We investigate code reasoning skills of Large Language Models (LLMs) in the context of formal program verification. Specifically, we look at the problem of inferring loop invariants as well as ranking functions for proving safety properties and loop termination, respectively. We demonstrate how emergent capabilities of LLMs can be exploited through a combination of prompting techniques as well as by using them in conjunction with symbolic algorithms.
We curate a dataset of verification problems inspired by past work, and build an LLM-based toolchain called Loopy for proof generation. We perform a rigorous evaluation on this dataset to establish that LLMs have the potential of improving state-of-the-art in program verification.
The Problem
Formal verification seeks to establish a proof of correctness of a program with respect to a given property. This involves proof construction (e.g., finding loop invariants) and proof checking (e.g., establishing their inductiveness). While proof checking has been automated by SMT solvers, proof construction still requires ingenuity and is much harder to automate.
We follow the guess-and-check methodology: use LLMs to guess proof artifacts (loop invariants, ranking functions, pre/post-conditions), and discharge the proofs using an off-the-shelf formal checker (Frama-C). The soundness comes entirely from the proof checker, so LLM mistakes do not compromise correctness.
How Loopy Works
Loopy combines two key ideas to effectively leverage LLMs for verification:
Prompt Engineering: We design prompt templates with domain-specific instructions that guide LLMs to produce better invariants. This includes explaining what loop invariants are, heuristics for reasoning about loops (e.g., bounding variables, using implications for conditionals), and nudging — adding generic instructions that encourage the LLM toward the right proof artifacts.
Interplay with Symbolic Tools: LLMs are better at generating ingredients of an invariant than the complete invariant itself. Loopy uses:
Houdini algorithm — prunes incorrect guesses from LLM outputs to find a correct inductive subset, with only a linear number of calls to the checker.
Repair procedure — feeds error messages from the checker back to the LLM to fix incorrect invariants, guided by whether they fail on syntax, establishment, or preservation.
Example
Given a C program with an assertion, Loopy infers ACSL annotations that let Frama-C verify correctness:
void main() {
int k = 100, b = 0;
int i = 0, j = 0, n;
// @invariant 0 <= n <= 2k
// @invariant b == 0 ==> i == j
// @invariant b == 1 ==> i == j - 1
for (n = 0; n < 2*k; n++) {
if (b) i++; else j++;
b = !b;
}
assert (i == j);
}
The LLM generates candidate invariants capturing the relationship between b, i, and j. The Houdini algorithm filters out any incorrect candidates, and Frama-C verifies the program.
Key Results
Loopy is evaluated on curated datasets of C programs across four verification tasks:
Benchmark
Total
Vanilla LLMs
Loopy
Ultimate (Symbolic)
Loopy + Ultimate
Scalar loops
469
237 (51%)
398 (85%)
430 (92%)
461 (98%)
Array loops
169
60 (36%)
127 (75%)
12 (7%)
128 (75%)
Recursion
31
14 (45%)
16 (52%)
20 (65%)
23 (74%)
Termination
281
49 (17%)
181 (64%)
236 (84%)
255 (91%)
Key takeaways:
Loopy significantly boosts raw LLM performance: from 51% to 85% on scalar loops and from 17% to 64% on termination.
Houdini alone increases success by 30.7% on scalar loop benchmarks (from 293 to 383 verified programs).
On array benchmarks, Loopy solves 127/169 programs compared to Ultimate's 12/169, demonstrating a major advantage of LLM-based approaches on programs with arrays.
Combining Loopy with Ultimate achieves 98% on scalar loops and 91% on termination, showing that LLM-based and symbolic approaches are complementary.
Loopy solves 31 benchmarks that Ultimate cannot on scalar loops, and 19 benchmarks on termination — establishing that LLMs can improve the state-of-the-art.
Techniques
Domain-Specific Prompts. We design prompt templates with detailed instructions: what loop invariants are, how to reason about loop bodies, and heuristics like using implications for conditional code and bounding all variables. This raises GPT-4's success from 237 to 293 benchmarks on scalar loops.
Houdini Filtering. LLM outputs often contain the right invariant components spread across multiple completions, mixed with incorrect ones. The Houdini algorithm efficiently finds the largest inductive subset from the union of all LLM outputs, using only a linear number of oracle calls.
LLM-Guided Repair. When Houdini fails, Loopy feeds error messages from Frama-C back to the LLM with a specialized repair prompt. The LLM fixes syntax errors, adds missing establishment clauses, or strengthens preservation conditions. This recovers an additional 15 benchmarks.
Multiple LLMs. GPT-4, GPT-3.5-Turbo, and CodeLlama-34B-Instruct solve overlapping but distinct sets of benchmarks. GPT-3.5-Turbo solves 31 benchmarks that GPT-4 cannot, suggesting that ensembling multiple LLMs can further improve results.
Contributions
A curated dataset of C programs spanning four verification tasks: scalar loop invariants, array loop invariants, recursive pre/post-conditions, and termination ranking functions.
Loopy, an LLM-based toolchain that combines prompt engineering with symbolic algorithms (Houdini, Repair) for automated proof generation.
A rigorous evaluation showing that LLMs can improve the state-of-the-art in program verification, solving benchmarks that dedicated symbolic tools cannot.
BibTeX
@inproceedings{kamath2024loopy,
title={Leveraging LLMs for Program Verification},
author={Kamath, Adharsh and Mohammed, Nausheen and Senthilnathan, Aditya and Chakraborty, Saikat and Deligiannis, Pantazis and Lahiri, Shuvendu K. and Lal, Akash and Rastogi, Aseem and Roy, Subhajit and Sharma, Rahul},
booktitle={Formal Methods in Computer-Aided Design (FMCAD)},
doi={10.34727/2024/ISBN.978-3-85448-065-5_16},
year={2024}
}