Original Evaluation · March 2026 · Superseded by vLLM Re-evaluation May 2026

Legislative Pipeline — Original Model Selection

The first evidence-based model selection before a 155-hour unattended production run on 11,673 Illinois bills. Ollama backend, single inference stream, three candidate models, five chunk sizes. This evaluation selected Mistral 7B at 500 tokens. The May 2026 vLLM re-evaluation later reversed that recommendation — read below for exactly why, and why the original call was correct given what was measurable at the time.

Original decision · March 2026
Mistral 7B · 500-token chunks · Ollama
Highest parse-OK in the 15-cell grid (54.9%). Most stable cross-configuration variance (±0.10). Zero EMPTY parses at 500 tokens. The correct call given the available measurement stack.
15
Grid cells
3 models × 5 chunk sizes
~15h
Sweep runtime
Unattended, single GPU
40
Chunks scored
Blind rubric review
54.9%
Winning parse-OK
Mistral @ 500 tokens
A production decision that needed evidence

The production run was estimated at 155 hours of unattended, sequential inference on a single GPU. A wrong model choice is not a quick re-run — it is a week of compute and a corpus of low-quality extractions. The decision had to be made on evidence, not on a smoke test.

Three candidate models were evaluated across five chunk sizes (500–2,800 tokens): Mistral 7B Instruct v0.3, Qwen 2.5 7B, and Llama 3.2 3B.

◆ The Llama comparison note

This evaluation used Llama 3.2 3B — a 3-billion parameter model — against 7B-class models. The May 2026 re-evaluation upgraded to Llama 3.1 8B for parameter parity. That upgrade, combined with the vLLM migration, changed Llama's ranking from last to first by 20+ percentage points. The original evaluation was not wrong — it was limited by the comparison set available at the time.

System under test

Hardware: Single NVIDIA L4, 23 GB VRAM. One GPU, one inference process — no horizontal scale-out.

Backend: Ollama — one request at a time, no continuous batching. TTFT and ITL not measurable on Ollama. This became a documented limitation and motivated the vLLM migration.

Pipeline: Bronze (ingest) → Silver (chunk → extract) → Gold (standardize). Each config writes to a profile-isolated Silver path — no shared state between cells.

Idempotency: Extractor flushes in batches and resumes after a kill — a precondition for multi-night unattended runs.

Escalating workload characterization

Not one experiment — a deliberate escalation. Each stage scoped to answer a question the previous stage raised. You don't spend a 15-hour run on a question a 10-minute run can answer.

01

Smoke test — 5 bills

Confirm harness runs end to end · not a measurement

Pointed at Qwen — the result was later overturned by a larger sample. This is why smoke tests don't make model decisions.

02

50-bill grid — 3 × 3 cells

First real workload characterization

Model ranking flipped with 10× the sample. Mistral overtook Qwen — a textbook case for why you need a meaningful sample before drawing conclusions.

03

Thorough sweep — 15 cells · ~15 hours

3 models × 5 chunk sizes · 2 repeats · warmup phase

Ran unattended over a weekend. Warmup phase excluded cold-start bias. This is the primary dataset for the decision.

04

Qualitative review — 40 chunks scored

Blind rubric · 0–2 on Accuracy / Specificity / Completeness

Blind to parse status — a structurally partial parse with strong substance outscores a clean parse with shallow answers. Measures whether the model understood the bill, not just whether it produced valid JSON.

Quantitative and qualitative findings

Parse-OK rate — 15-cell grid

Model500 tok700 tok900 tok1200 tok2800 tokPeak
Mistral 7B 54.9%~48%~42%~35%~8% 54.9% @ 500
Qwen 2.5 7B ~38%~44%~52%~35%~10% ~52% @ 900
Llama 3.2 3B ~28%~25%~22%~18%~5% ~28% @ 500

Approximate values from evaluation report. The 2,800-token collapse was model-independent — all three lost 10–11 of 15 fields with 60–67% EMPTY parse rates.

◆ The chunk-size cliff

Every model collapses past ~2,000 tokens. At 2,800 tokens, EMPTY-parse rates hit 60–67% across all models. Because the degradation is model-independent, the chunk-size decision is separable from the model decision — a property of 7B-class models on dense legal text at long context, not any one model's quirk.

Qualitative rubric review

ModelChunk sizeOverall (0–2)Notes
Qwen 2.5 7B9001.87Highest single score in review
Mistral 7B5001.68Winner at 500 tokens; lowest variance — decision basis
Mistral 7B9001.58Close second
Llama 3.2 3B500~1.2Last; recurring policy-domain mis-tag

Mistral — stability winner

Cross-configuration variance: ±0.10 between chunk sizes. No catastrophic failure mode. No saturation signature at any chunk size. For a 155-hour unattended run on a shifting corpus, predictability was weighted above peak score.

Qwen — peak quality, fragile

Peak score 1.87 at chunk_size=900 but cross-configuration variance of ±0.35 — 3.5× more fragile than Mistral. Known "None." failure mode at certain chunk sizes. Retained as a candidate for later revisit.

USE method — GPU resource analysis

Utilization, Saturation, Errors applied to the L4 GPU. USE characterizes infrastructure health — not output quality. Separate axes.

Utilization

L4 ran effectively saturated on a single inference stream. Pre-flight checks confirmed 0% idle utilization before each run. The concurrency headroom a single stream leaves unmeasured was the primary motivation for the vLLM migration.

Saturation — one textbook case

The Qwen 2.5 × 1,200-token cell is a textbook saturation signature: p99 latency 137s vs 13–16s for healthy cells, throughput collapsed 7–9×, 32 of 165 chunks lost. Reproduced three times. p99 detaching from median while throughput collapses = work arriving faster than the resource can drain it.

Errors

Outside the Qwen 1,200-token saturation cell, errors were dominated by parse failures, not crashes. Parse status (OK / PARTIAL / EMPTY) was promoted to a first-class signal so this class of error is counted, not absorbed.

What Ollama could not expose

TTFT, ITL, KV cache utilization, and per-stream decode throughput were not measurable on Ollama's single-stream architecture. These metrics — used extensively in the May 2026 vLLM evaluation — were explicitly flagged as a known limitation at evaluation time.

What this evaluation did not establish

A characterization is only as credible as its statement of what it did not prove.

Empty inference fields — uniform across all models

Several fields came back empty for every model at every chunk size. A uniform failure points at the prompt or parser, not any model. Flagged as the highest-value pre-production action item. The May 2026 evaluation refined the extraction schema to 15 fields with clearer separation between binary and substantive fields.

Single-stream, single-GPU — concurrency never exercised

Throughput numbers from this evaluation are a valid floor, not a capacity model. The May 2026 concurrency sweep found that vLLM at c=24 delivers 9× Ollama's throughput — a gain completely invisible here. Mistral's p99 explosion under concurrency (15s → 74s) was also invisible on Ollama.

Parameter-count mismatch in the Llama comparison

This evaluation compared Llama 3.2 3B against Mistral and Qwen at 7B. A 3B model finishing behind 7B models is expected and should not be interpreted as a statement about the Llama model family. When May 2026 upgraded to Llama 3.1 8B for parameter parity, Llama won by 37+ percentage points.

No ground-truth labels

Rubric scores are expert judgments, not measurements against a gold standard. The May 2026 evaluation addressed this with two independent Opus 4.8 passes (high effort then max effort), finding 60/60 agreement and 0 revisions.

Why the May 2026 evaluation reversed the recommendation
◆ Superseded — May 2026 vLLM Re-evaluation

Change 1 — Backend: Ollama → vLLM. Ollama cannot expose per-stream TTFT, ITL, KV cache utilization, or real concurrency behavior. Moving to vLLM 0.6.6 made those measurements possible for the first time. The most consequential discovery: Mistral's p99 latency grows 5× under concurrency (15s → 74s), driven by KV cache head-of-line blocking from 3,000-token prompts. This disqualifying failure mode was completely invisible to the Ollama evaluation.

Change 2 — Model: Llama 3.2 3B → Llama 3.1 8B. Upgrading to a parameter-parity model changed the answer. Llama 3.1 8B at chunk_size=1,200 achieves 92.5% parse-OK — 37.6pp above Mistral's original winning score of 54.9%. The original evaluation was not wrong to rank Llama last — a 3B model losing to 7B models is expected. It was wrong to draw conclusions about the Llama family from that result.

Neither change was a methodology failure — both were resource constraints at evaluation time. The lesson: the answer you get depends on the question you can ask, and the question you can ask depends on your measurement stack.

● Full current evaluation

The complete May 2026 vLLM evaluation — six experiments, 126 cells, qualitative review, SLO analysis — is documented in the main case study ↗. Current recommendation: Llama 3.1 8B AWQ at chunk_size=1,200, concurrency=12–24.