LLM inference · Python · 2026

Verifying vLLM Configuration
Arithmetic with ESBMC

A reproducible proof-of-concept by Cyber-Reasoning Consultancy (CRC) applying ESBMC's Python frontend to the integer and index arithmetic of vLLM, the widely deployed high-throughput LLM inference and serving engine. ESBMC surfaced seven CLI-reachable defects; three division-by-zero crashes were confirmed and fixed upstream by the vLLM maintainers in PR #43794. Sources, harness and run instructions are published in lucasccordeiro/vllm.

PR #43794

Three CLI crash bugs in vLLM fixed upstream after our report

Live, CLI-reachable findings surfaced across the vLLM PoC

Verification targets across four tiers, ~4 min end-to-end

Further config-validation gaps filed upstream and still open

Context

vLLM is one of the most widely used open-source engines for serving large language models, and its configuration surface is large: dozens of command-line flags feed directly into the integer and index arithmetic that sizes the KV cache, schedules tokens and lays out memory blocks. A single out-of-range value, a block size of zero, a negative threshold, can slip past argument parsing and validators, then surface much later as an opaque crash deep inside the engine, where it is hard to attribute back to the operator's input.

We wanted to know whether ESBMC's Python frontend could reason about that arithmetic directly, exhaustively, on a developer laptop, and without standing up a GPU or loading a model, and turn a class of "garbage-in, crash-much-later" defects into findings that vLLM's maintainers would accept and fix.

Approach

Modelled on our AWS Neuron PoC, the harness pins vLLM at commit 4438b6e and drives ESBMC (version 8.3.0 or later, Python frontend) over 31 verification targets across four tiers, from focused checks on a single arithmetic helper up to targets that replay a value along its actual command-line-to-crash-site path. The full sweep runs end-to-end in about four minutes with zero unexpected verdicts.

Each candidate defect is paired with a small witness program under harness/ that reproduces ESBMC's counterexample, and, crucially, with a reachability argument establishing whether the failing state is reachable from a normal CLI invocation or only from internal API misuse. That distinction is what separates a reportable, operator-triggerable bug from a defensive-coding observation, and we grade every finding on it rather than reporting raw solver verdicts.

Results

Three division-by-zero crashes confirmed and fixed upstream by vLLM in PR #43794. --block-size 0 (issue #43496), --hash-block-size 0 and --max-model-len 0 were each accepted silently by argument parsing and the config validators, then crashed with a ZeroDivisionError inside cdiv() during KV-cache sizing. The merged fix rejects the bad inputs at construction time with a positive-integer (gt=0) field constraint; a fourth finding, --hash-block-size -k, was incidentally closed by the same constraint.

Three further config-validation gaps filed upstream and still open. --num-gpu-blocks-override 0 is accepted but fails engine init with a bare AssertionError in BlockPool.__init__ (issue #43842); --max-logprobs and --long-prefill-token-threshold both silently accept negative values (issue #43985). Each carries a reproducing witness and a proposed Field constraint mirroring the PR #43794 fix.

Honest scoping, backed by reachability proofs. Two further candidates, a non-power-of-two --block-size and a negative --kv-cache-memory-bytes, were investigated and shown not to be live bugs, caught by existing guards before any crash site. A negative max_num_scheduled_tokens defect is real but programmatic-only, not reachable from the CLI (issue #44123), and an unguarded division in get_num_blocks was shown unreachable from normal invocation. We report each of these as a non-finding rather than padding the count.

Three ESBMC Python-frontend issues filed (#4926, #4909, #4756) in the course of the PoC, the verification work feeding directly back into the open-source tooling that made it possible.

What this means

The configuration arithmetic of a production AI-inference engine is amenable to bounded model checking with ESBMC's Python frontend, at developer-laptop speeds and with no GPU or model in the loop. For teams operating or extending vLLM, the payoff is that a class of "accepted-then-crashes-later" misconfiguration bugs can be turned into fast, deterministic checks, and, as PR #43794 shows, into fixes the upstream project is willing to merge.

References

lucasccordeiro/vllm, public PoC repository (harness, witnesses, tiered targets, findings report). github.com/lucasccordeiro/vllm
vllm-project/vllm PR #43794, "Validate against some config fields being set to 0" (merged); fixes the three division-by-zero crashes. github.com/vllm-project/vllm/pull/43794
vllm-project/vllm issue #43496, --block-size 0 ZeroDivisionError (closed by PR #43794). github.com/vllm-project/vllm/issues/43496
vllm-project/vllm issue #43842, --num-gpu-blocks-override 0 silently accepted (open). github.com/vllm-project/vllm/issues/43842
vllm-project/vllm issue #43985, --max-logprobs and --long-prefill-token-threshold accept negative values (open). github.com/vllm-project/vllm/issues/43985
ESBMC, open-source bounded model checker, with the Python frontend used in this PoC. github.com/esbmc/esbmc

Note. This page describes an independent, reproducible proof-of-concept published by CRC's founder and ESBMC creator and lead developer, Prof. Lucas Cordeiro (University of Manchester). CRC Ltd is not in a commercial partnership with the vLLM project, and the PoC was not commissioned by it; all reports and fixes went through vLLM's public issue tracker and pull requests. The harness is open source; every finding, witness and non-finding, including the three fixes merged in PR #43794 and the candidates shown not to be live bugs, is reproducible from the public repository.