Leveraging LLMs for Program Verification

Adharsh Kamath*, Nausheen Mohammed*, Aditya Senthilnathan, Saikat Chakraborty, Pantazis Deligiannis*, Shuvendu K. Lahiri, Akash Lal*, Aseem Rastogi*, Subhajit Roy§, Rahul Sharma*
*Microsoft Research, India   Cornell University   Microsoft Research, Redmond   §IIT Kanpur
FMCAD 2024 (Formal Methods in Computer-Aided Design)

Abstract

We investigate code reasoning skills of Large Language Models (LLMs) in the context of formal program verification. Specifically, we look at the problem of inferring loop invariants as well as ranking functions for proving safety properties and loop termination, respectively. We demonstrate how emergent capabilities of LLMs can be exploited through a combination of prompting techniques as well as by using them in conjunction with symbolic algorithms.

We curate a dataset of verification problems inspired by past work, and build an LLM-based toolchain called Loopy for proof generation. We perform a rigorous evaluation on this dataset to establish that LLMs have the potential of improving state-of-the-art in program verification.

The Problem

Formal verification seeks to establish a proof of correctness of a program with respect to a given property. This involves proof construction (e.g., finding loop invariants) and proof checking (e.g., establishing their inductiveness). While proof checking has been automated by SMT solvers, proof construction still requires ingenuity and is much harder to automate.

We follow the guess-and-check methodology: use LLMs to guess proof artifacts (loop invariants, ranking functions, pre/post-conditions), and discharge the proofs using an off-the-shelf formal checker (Frama-C). The soundness comes entirely from the proof checker, so LLM mistakes do not compromise correctness.

How Loopy Works

Loopy combines two key ideas to effectively leverage LLMs for verification:

  1. Prompt Engineering: We design prompt templates with domain-specific instructions that guide LLMs to produce better invariants. This includes explaining what loop invariants are, heuristics for reasoning about loops (e.g., bounding variables, using implications for conditionals), and nudging — adding generic instructions that encourage the LLM toward the right proof artifacts.
  2. Interplay with Symbolic Tools: LLMs are better at generating ingredients of an invariant than the complete invariant itself. Loopy uses:
    • Houdini algorithm — prunes incorrect guesses from LLM outputs to find a correct inductive subset, with only a linear number of calls to the checker.
    • Repair procedure — feeds error messages from the checker back to the LLM to fix incorrect invariants, guided by whether they fail on syntax, establishment, or preservation.

Example

Given a C program with an assertion, Loopy infers ACSL annotations that let Frama-C verify correctness:

void main() {
  int k = 100, b = 0;
  int i = 0, j = 0, n;
  // @invariant 0 <= n <= 2k
  // @invariant b == 0 ==> i == j
  // @invariant b == 1 ==> i == j - 1
  for (n = 0; n < 2*k; n++) {
    if (b) i++; else j++;
    b = !b;
  }
  assert (i == j);
}

The LLM generates candidate invariants capturing the relationship between b, i, and j. The Houdini algorithm filters out any incorrect candidates, and Frama-C verifies the program.

Key Results

Loopy is evaluated on curated datasets of C programs across four verification tasks:

Benchmark Total Vanilla LLMs Loopy Ultimate (Symbolic) Loopy + Ultimate
Scalar loops 469 237 (51%) 398 (85%) 430 (92%) 461 (98%)
Array loops 169 60 (36%) 127 (75%) 12 (7%) 128 (75%)
Recursion 31 14 (45%) 16 (52%) 20 (65%) 23 (74%)
Termination 281 49 (17%) 181 (64%) 236 (84%) 255 (91%)

Key takeaways:

Techniques

Domain-Specific Prompts. We design prompt templates with detailed instructions: what loop invariants are, how to reason about loop bodies, and heuristics like using implications for conditional code and bounding all variables. This raises GPT-4's success from 237 to 293 benchmarks on scalar loops.
Houdini Filtering. LLM outputs often contain the right invariant components spread across multiple completions, mixed with incorrect ones. The Houdini algorithm efficiently finds the largest inductive subset from the union of all LLM outputs, using only a linear number of oracle calls.
LLM-Guided Repair. When Houdini fails, Loopy feeds error messages from Frama-C back to the LLM with a specialized repair prompt. The LLM fixes syntax errors, adds missing establishment clauses, or strengthens preservation conditions. This recovers an additional 15 benchmarks.
Multiple LLMs. GPT-4, GPT-3.5-Turbo, and CodeLlama-34B-Instruct solve overlapping but distinct sets of benchmarks. GPT-3.5-Turbo solves 31 benchmarks that GPT-4 cannot, suggesting that ensembling multiple LLMs can further improve results.

Contributions

BibTeX

@inproceedings{kamath2024loopy,
  title={Leveraging LLMs for Program Verification},
  author={Kamath, Adharsh and Mohammed, Nausheen and Senthilnathan, Aditya and Chakraborty, Saikat and Deligiannis, Pantazis and Lahiri, Shuvendu K. and Lal, Akash and Rastogi, Aseem and Roy, Subhajit and Sharma, Rahul},
  booktitle={Formal Methods in Computer-Aided Design (FMCAD)},
  doi={10.34727/2024/ISBN.978-3-85448-065-5_16},
  year={2024}
}