Can LLMs Write Provably Correct Code?

12-12-2025 - Hamburg, Germany
Originally written 98% by a human, then translated from German with AI. Share
Dieser Artikel ist auch auf Deutsch verfügbar.

Software bugs cost the US economy over two trillion dollars a year¹. A problem we don’t have in Europe — we all write formally verified code, after all.

No, of course not. For most projects, a bug is annoying but not a disaster. The situation is different for software in insulin pumps, flight control systems, nuclear power plants.

The answer is sobering: Even there, almost nobody relies on mathematical proofs. The EU Medical Device Regulation requires verification and validation for software, but doesn’t specify how. Mathematical proofs are nowhere mandated². Even in safety-critical software, formal methods are the exception, not the rule³.

Yet a solution for part of the problem has existed for decades: formal verification, meaning code that is mathematically proven correct — correct relative to a formal specification. Real examples exist. AWS has formally verified parts of their TLS library⁴. The seL4 microkernel runs in Boeing drones⁵. In hardware design, formal verification is standard — Intel has used it since the Pentium 4, after the FDIV bug forced a recall in 1994⁶. But these are exceptions, exquisite artifacts, as a DARPA paper calls them⁵, hand-picked components, each a research project in itself.

The reasons for low adoption are a whole bundle: Writing formal specifications is expensive and unfamiliar. Verifiers give unhelpful feedback on errors. Integration into existing systems is costly. And the few experts are rare.

Martin Kleppmann, professor at Cambridge and author of “Designing Data-Intensive Applications”, believes this will change⁷. His thesis: AI makes formal verification mainstream. The logic is simple. Proof checkers provide binary feedback — either the proof is correct or it isn’t. Does the model hallucinate? Less bad than usual, the checker rejects the proof, the model tries again. But not irrelevant: Without a good specification or helpful feedback, the loop can still get stuck.

A paper from JetBrains Research tests exactly that⁸.

Brief Aside: What is a Verification Language?

Dafny is the most well-known, developed at Microsoft Research. Important to understand: Dafny is a standalone language, not an analysis tool for Java or Python. Code is written directly in Dafny, verification happens before compilation. The verified code can then be compiled to C#, Go, Java, or JavaScript.

The code looks like a mix of Java and Python, with one crucial difference: Alongside the implementation are guarantees that the code must uphold. The compiler then proves that these guarantees hold.

An example:

method Abs(x: int) returns (y: int)
  ensures 0 <= y
  ensures y == x || y == -x
{
  if x < 0 { return -x; }
  else { return x; }
}

ensures defines postconditions — properties that must hold after the method executes. Here: The result is non-negative, and it’s either x or -x. Dafny checks this on every compile. If the proof fails, the code doesn’t compile.

Verified Dafny code can be compiled to Java, C#, or Go. Here’s the result:

public static BigInteger Abs(BigInteger x) {
    if (x.compareTo(BigInteger.ZERO) < 0) {
        return x.negate();
    } else {
        return x;
    }
}

The ensures clauses are gone from the Java code — they were proven at compile time and aren’t needed at runtime.

The difference from unit tests: A test checks samples. Dafny proves that the guarantee holds for every possible input. All inputs are covered, not just the ones someone thought of.

The problem: That example was simple — no loop, no complex logic. With loops, the compiler needs invariants describing what holds in each iteration — essentially mathematical induction, which is often harder to formulate than the algorithm itself. With mutually recursive functions, you need helper proofs that the verifier can’t find on its own. The proof quickly becomes longer than the code itself.

For a sorting algorithm or a cryptographic function, it’s worth it. For a web app with a hundred endpoints and constantly changing requirements, not so much.

The official tutorial is surprisingly well written.

The Experiment

At the center is a key question: How much does an LLM need to know to produce verified code?

Is a text description like “sort this array in ascending order” enough, and the model delivers correct code plus proof? Or does it need the formal specification, and the model just fills in the gaps? In the first case, no verification language would be needed. In the second, Dafny or Nagini remains a prerequisite — the LLM only saves the last step.

Six scenarios test the spectrum from “fill the gaps” to “do everything”. In Mode 1 and 2, the implementation is given and the proof is missing, with Mode 1 including the specification and Mode 2 not. In Mode 3 and 4, the specification is given, but implementation and proof are missing. In Mode 5 and 6, there’s only a method signature and text description — the LLM must deliver everything.

The pipeline works exactly as Kleppmann describes: Prompt to LLM, the verifier checks, feedback on errors, up to five iterations. Then validation: The system checks whether the generated specification implies the reference. Trivial postconditions aren’t allowed — meaning things like ensures true or conditions that don’t constrain any behavior.

The benchmarks include 132 Dafny programs, 106 Nagini programs in Python, and 55 Verus programs in Rust, all manually translated from HumanEval — a standard benchmark published by OpenAI in 2021 with 164 programming tasks, from string manipulation to simple algorithms. Each task contains a function signature, a docstring, and unit tests. The benchmark measures whether generated code passes the tests, and has become the de facto standard for comparing code generation models.

The Results

Tested with Claude 3.5 Sonnet, since GPT-4o and Claude 3 Opus performed significantly worse.

The table shows three representative scenarios: Mode 1 (adding proof with given specification and implementation), Mode 4 (implementation and proof from specification), Mode 6 (everything from text description). The remaining modes fall expectedly in between.

	Mode 1	Mode 4	Mode 6
Dafny	86%	82%	29%
Nagini	66%	63%	15%
Verus	45%	40%	15%

The answer to the key question is clear: Without formal specification, it doesn’t work. With specification, surprisingly well.

Concretely, this means: “Sort this array in ascending order” isn’t enough — the success rate is 29% for Dafny and 15% for the others. But when the human provides the specification — method signature plus requires and ensures — the LLM handles the rest. It writes the code, finds the loop invariants, and adds the necessary assertions. For Dafny, this succeeds in over 80% of cases.

This means: The verification language remains a prerequisite. But loop invariants, one of the classically hardest mechanical parts of formal verification⁹, the model can handle — while the hardest substantive work, the specification, stays with the human.

And this with Claude 3.5 Sonnet, a model from mid-2024. The tested alternatives like GPT-4o and Claude 3 Opus are now considered outdated. This suggests that newer models could improve further — but the hard limit remains the specification and scaling to real systems.

What This Means

Kleppmann’s thesis is confirmed, but only partially. The feedback loop does work — LLMs can generate formal proofs when given enough context. With formal specification, success is 63-86%, depending on language and scenario. This is remarkable, because loop invariants were previously considered the part of formal verification that required human intuition⁹. An LLM can apparently learn this.

But without formal specification, the success rate drops to 15-29%. The model can conduct a proof when someone says what should be proven. It cannot reliably derive what should be proven. The challenge doesn’t disappear, it migrates. Away from the proof, toward the specification.

This sounds like a small difference, but it’s fundamental. Correct code for the wrong requirement helps nobody. If the specification says “sort ascending” and the code provably does that, but it should have sorted descending, the proof doesn’t help. The work that remains with humans isn’t just some residual work — it’s the actual intellectual achievement: defining what correct even means.

Still, this is progress, and bigger than it sounds. Reading Dafny isn’t hard — the syntax is familiar and specifications are often self-explanatory. Writing Dafny so the compiler is happy was the problem until now. Binary means: correct or incorrect — and that’s exactly what makes debugging so tedious. The compiler doesn’t say “you’re missing an invariant here”, it just says “proof failed”. Finding the right invariant requires intuition for mathematical induction, and that’s precisely what AI now handles. A developer who can formulate pre- and postconditions no longer needs to know how to find loop invariants or structure helper lemmas. The entry barrier drops significantly, even if it doesn’t disappear.

For AI labs, formal verification is interesting for another reason. It provides perfect training signal for reinforcement learning¹⁰. A unit test can pass even though a bug exists — it just wasn’t hit. A formal proof leaves no such gap — it covers every possible input. When a model trains on millions of such feedback signals, it learns not just to generate proofs, but what makes a valid proof. The quality of feedback determines the quality of learning, and formal verification delivers one of the cleanest feedback signals: binary, unfakeable, without test gaps.

The paper shows that this cycle already works, at least for algorithmic problems at the HumanEval level. The leap to real systems with thousands of lines of code and complex dependencies is still ahead.

CISQ, The Cost of Poor Software Quality in the U.S. , 2022 ↩︎
European Parliament, Regulation (EU) 2017/745 on medical devices , Annex I, Section 17.2 ↩︎
Wayne, H., Why Don’t People Use Formal Methods? , 2018 ↩︎
Chudnov, A. et al., Continuous Formal Verification of Amazon s2n , CAV, 2018 ↩︎
Fisher, K. et al., The HACMS program: using formal methods to eliminate exploitable bugs , Philosophical Transactions of the Royal Society A, 2017 ↩︎ ↩︎
Kaivola, R. et al., Formal verification in Intel CPU design , IEEE, 2005 ↩︎
Kleppmann, M., Prediction: AI will make formal verification go mainstream , 2025 ↩︎
Kozyrev, A. et al., Can LLMs Enable Verification in Mainstream Programming? , JetBrains Research, 2025 ↩︎
Si, X. et al., Learning Loop Invariants for Program Verification , NeurIPS, 2018 ↩︎ ↩︎
Douglas, S. & Bricken, T., How Does Claude 4 Think? , Dwarkesh Podcast, 2025 ↩︎