Verifying vLLM Configuration
Arithmetic with ESBMC
A reproducible proof-of-concept by Cyber-Reasoning Consultancy (CRC) applying ESBMC's Python frontend to the integer and index arithmetic of vLLM, the widely deployed high-throughput LLM inference and serving engine. ESBMC surfaced seven CLI-reachable defects; three division-by-zero crashes were confirmed and fixed upstream by the vLLM maintainers in PR #43794. Sources, harness and run instructions are published in lucasccordeiro/vllm.
PR #43794
Three CLI crash bugs in vLLM fixed upstream after our report
7
Live, CLI-reachable findings surfaced across the vLLM PoC
31
Verification targets across four tiers, ~4 min end-to-end
3
Further config-validation gaps filed upstream and still open
Context
vLLM is one of the most widely used open-source engines for serving large language models, and its configuration surface is large: dozens of command-line flags feed directly into the integer and index arithmetic that sizes the KV cache, schedules tokens and lays out memory blocks. A single out-of-range value, a block size of zero, a negative threshold, can slip past argument parsing and validators, then surface much later as an opaque crash deep inside the engine, where it is hard to attribute back to the operator's input.
We wanted to know whether ESBMC's Python frontend could reason about that arithmetic directly, exhaustively, on a developer laptop, and without standing up a GPU or loading a model, and turn a class of "garbage-in, crash-much-later" defects into findings that vLLM's maintainers would accept and fix.
Approach
Modelled on our
AWS Neuron PoC,
the harness pins vLLM at commit
4438b6e
and drives ESBMC (version 8.3.0 or later, Python frontend) over
31 verification targets across four tiers, from focused checks on a
single arithmetic helper up to targets that replay a value along its actual
command-line-to-crash-site path. The full sweep runs end-to-end in about four
minutes with zero unexpected verdicts.
Each candidate defect is paired with a small witness program under
harness/
that reproduces ESBMC's counterexample, and, crucially, with a reachability
argument establishing whether the failing state is reachable from a normal CLI
invocation or only from internal API misuse. That distinction is what separates a
reportable, operator-triggerable bug from a defensive-coding observation, and we
grade every finding on it rather than reporting raw solver verdicts.
Results
Three division-by-zero crashes confirmed and fixed upstream by vLLM in
PR #43794. --block-size 0
(issue #43496),
--hash-block-size 0
and --max-model-len 0
were each accepted silently by argument parsing and the config validators, then
crashed with a ZeroDivisionError
inside cdiv()
during KV-cache sizing. The merged fix rejects the bad inputs at construction
time with a positive-integer (gt=0)
field constraint; a fourth finding,
--hash-block-size -k,
was incidentally closed by the same constraint.
Three further config-validation gaps filed upstream and still open. --num-gpu-blocks-override 0
is accepted but fails engine init with a bare
AssertionError
in BlockPool.__init__
(issue #43842);
--max-logprobs
and --long-prefill-token-threshold
both silently accept negative values
(issue #43985).
Each carries a reproducing witness and a proposed
Field
constraint mirroring the PR #43794 fix.
Honest scoping, backed by reachability proofs. Two further
candidates, a non-power-of-two
--block-size
and a negative
--kv-cache-memory-bytes,
were investigated and shown not to be live bugs, caught by existing
guards before any crash site. A negative
max_num_scheduled_tokens
defect is real but programmatic-only, not reachable from the CLI
(issue #44123),
and an unguarded division in
get_num_blocks
was shown unreachable from normal invocation. We report each of these as a
non-finding rather than padding the count.
Three ESBMC Python-frontend issues filed
(#4926,
#4909,
#4756)
in the course of the PoC, the verification work feeding directly back into the
open-source tooling that made it possible.
What this means
The configuration arithmetic of a production AI-inference engine is amenable to bounded model checking with ESBMC's Python frontend, at developer-laptop speeds and with no GPU or model in the loop. For teams operating or extending vLLM, the payoff is that a class of "accepted-then-crashes-later" misconfiguration bugs can be turned into fast, deterministic checks, and, as PR #43794 shows, into fixes the upstream project is willing to merge.
References
- lucasccordeiro/vllm, public PoC repository (harness, witnesses, tiered targets, findings report). github.com/lucasccordeiro/vllm
- vllm-project/vllm PR #43794, "Validate against some config fields being set to 0" (merged); fixes the three division-by-zero crashes. github.com/vllm-project/vllm/pull/43794
-
vllm-project/vllm issue #43496,
--block-size 0ZeroDivisionError (closed by PR #43794). github.com/vllm-project/vllm/issues/43496 -
vllm-project/vllm issue #43842,
--num-gpu-blocks-override 0silently accepted (open). github.com/vllm-project/vllm/issues/43842 -
vllm-project/vllm issue #43985,
--max-logprobsand--long-prefill-token-thresholdaccept negative values (open). github.com/vllm-project/vllm/issues/43985 - ESBMC, open-source bounded model checker, with the Python frontend used in this PoC. github.com/esbmc/esbmc
Note. This page describes an independent, reproducible proof-of-concept published by CRC's founder and ESBMC creator and lead developer, Prof. Lucas Cordeiro (University of Manchester). CRC Ltd is not in a commercial partnership with the vLLM project, and the PoC was not commissioned by it; all reports and fixes went through vLLM's public issue tracker and pull requests. The harness is open source; every finding, witness and non-finding, including the three fixes merged in PR #43794 and the candidates shown not to be live bugs, is reproducible from the public repository.