Two rounds of scoring on real Illinois bill extractions — first across models, then within each model across chunk sizes. The short answer: yes across models, no within models. Here is every chunk, every model, and the rationale behind every score.
01 / Question
The cross-model question: when parse-OK says configuration A scores higher than B, does substantive quality agree? Answer: yes — confirmed on all 60 cells.
llama31_1200 — Llama 3.1 8B AWQ · 1200-token chunks. Parse-OK peak: 92.5%.
qwen_500 — Qwen 2.5 7B AWQ · 500-token chunks. Best Qwen parse-OK at time of review: 62.5%.
mistral_3000 — Mistral 7B AWQ · 3000-token chunks. Parse-OK peak: 72.0%.
Each profile pairs a model with a chunk size — context-window dimension is part of what's being evaluated.
Pass 1 — Claude Opus 4.8 at high effort. Scored each profile's extractions against source text, blind to parse_status. 60 cells.
Pass 2 — Claude Opus 4.8 at max effort, fresh context, no access to pass-1 scores. Independent re-evaluation of all 60 cells.
Result: 60/60 agreement. Zero cells revised. Pass-2 found the scoring internally consistent and rubric-aligned.
◆ Critical methodological choice
Scoring used chunk-fidelity: did the model extract this chunk correctly? Not whole-bill understanding. At Chunk 8, mistral_3000's large context window produces a highly specific answer about a different section of the same bill. Under chunk-fidelity that is Accuracy:1. Under whole-bill summarization it would flip to 2. Both passes flagged this as the single call the result turns on.
02 / Rubric
Per-cell average = (Accuracy + Specificity + Completeness) / 3.
03 / Results
Aggregate scores across all 20 chunks. Pass 1 and pass 2 agree on all 60 cells.
◆ The finding
Context-window size interacts with extraction fidelity. The 3000-token config imports content from other sections of the same bill and loses the chunk's headline; the 1200-token config stays on-chunk most reliably; the 500-token config is faithful but too compressed to be complete. On six self-contained single-amendment bills all three converge near 2.0 — on multi-section bills, mistral collapses while llama holds. This is a property of the configurations (model + chunk size together), not the models alone.
04 / The Review
Expand any chunk to see the source text, each model's full extraction across all 11 fields, and the pass-1 score with rationale confirmed by pass 2.
01 / Question
Phase 1 confirmed cross-model reliability. Phase 2 asks a harder question: when parse-OK says chunk size X is better than Y for the same model, does substantive quality agree?
◆ Answer: No
In every informative pair, the smaller chunk size scored ≥ the larger. The parse-OK peak is not the quality peak for any of the three models. The mechanism is consistent: larger chunks cause off-section drift, producing parse-OK-invisible failures — the model populates all 11 fields with confident, specific answers about a different section of the same bill.
This does not change the production recommendation. The cross-model ranking (Phase 1, 60/60) is unaffected. Llama leads by 20+ percentage points. What this finding adds: the proxy's within-model reliability is now characterized — directional for Llama, reversed for Qwen and Mistral.
Each model was scored at the chunk sizes that bracket its parse-OK peak. Smaller size first.
Llama: 900 · 1200 · 2000 — scored 900 vs 2000 (1200 is the peak, not directly scored)
Qwen: 2000 · 3000 · 4000 — scored 2000 vs 3000 and 3000 vs 4000
Mistral: 2000 · 3000 · 5000 — scored 2000 vs 3000 and 3000 vs 5000
A larger chunk gives the model more bill context. Instead of staying on the displayed chunk, it anchors on an adjacent section — and produces specific, fully-populated JSON about the wrong part of the bill. Parse-OK validates the shape; it cannot see the mis-reference.
Additionally Qwen emits structurally-valid JSON with None values across fields — passing parse-OK while containing no information.
02 / All Profiles
Select a profile to see its scores and all 20 expandable chunks. Profiles are grouped by model so you can read across chunk sizes naturally.
03 / Cross-Cutting Patterns
Patterns confirmed across all scored pairs.
Every model at every larger chunk size shows it. The model gives a confident, specific answer about a neighboring section — correct about the bill, wrong about this chunk. Accuracy and Completeness drop; Specificity stays high because the answer IS specific, just misattributed.
At chunk_size=3000, Qwen returns parse_status=OK, missing=0 while all 11 fields contain "None.". Chunk 2 of the Qwen 2000 vs 3000 pair is the clearest single case: fully valid JSON, substantively empty. Parse-OK cannot see this.
In every pair the quality gap is concentrated in Completeness. Accuracy drops slightly; Specificity barely moves (drifted answers stay specific). But Completeness collapses because the displayed chunk's actual content is never extracted.
Both qwen_3000 and mistral_3000 drift to the same wrong section on the identical HB4820 general-definitions chunk — a veterans-housing operative provision. The off-section-drift mechanism is shared across model families at the larger chunk size.