the proof ledger
Measured, never estimated.
In plain terms: we tested whether an AI writes cheaper, more reliable code in curt than in other languages. The honest answer is mixed — and it's all here. Every claim carries a receipt — a script that reproduces the number exactly — and the wins and the losses share the same table.
| Claim | Measured | Reproduce |
|---|---|---|
| Token cost vs Python | 1.10× median (n=22) | tools/tokens ↗ |
| Token cost vs Go / Rust | 2.34× / 2.63× cheaper | tools/tokens ↗ |
| Concurrent TCP echo server | 32 vs 55 · 94 · 123 | DESIGN.md ↗ |
| Grammar-masked generation | 0 / 200 violations | grammar/DEMO ↗ |
| vs Zerolang — claude-sonnet | curt wins 3/3 · 7.7× $ | VS-ZERO ↗ |
| vs Zerolang — claude-haiku | split: 78% vs 89% | VS-ZERO ↗ |
| Fix synthesis (1-turn repair) | 24/32 · 0% wrong-payload | BENCHMARK ↗ |
| Python wins small tasks | honest · 54/54 | VS-ZERO ↗ |
The Zerolang head-to-head
A pre-registered showdown (frozen before any lane was generated): nine shared RosettaCode tasks, each language carrying its own canonical docs, single shot plus up to two repair turns, Python control, two models × three samples. 162 cells, $0.77, all transcripts committed.
| model | lang | solved | $/solved | median out-tok | turns |
|---|---|---|---|---|---|
| haiku | curt | 21/27 | $0.0062 | 33 | 1.24 |
| haiku | zero | 24/27 | $0.0050 | 168 | 1.00 |
| haiku | python | 27/27 | $0.0005 | 37 | 1.00 |
| sonnet | curt | 27/27 | $0.0020 | 35 | 1.00 |
| sonnet | zero | 27/27 | $0.0156 | 168 | 1.15 |
| sonnet | python | 27/27 | $0.0011 | 40 | 1.00 |
Split decision, reported in full. On claude-sonnet, curt won all three pre-registered axes — 7.7× cheaper per solved task at perfect success parity, ~5× leaner output (35 vs 168 tokens), faster convergence. On claude-haiku it missed success parity (78% vs 89%), which voids the dollar axis by rule. And the Python control won every cost axis on both models — on small, well-trained tasks, no agent language beats the incumbent. curt's losses are filed as roadmap work with measured targets.
The diagnostics tournament
curt's prose error hint vs Zerolang's typed-repair-identifier design, tournamented on curt's own repair corpus — 32 toolchain-verified broken programs, each rendered four ways, fed to claude-haiku for repair. The verdict rule was pre-registered before any API call.
| rendering | diag o200k | turn-1 | final | lane $ |
|---|---|---|---|---|
| A — shipped prose hint | 38.4 | 18/32 | 21/32 | $0.152 |
| B — typed (Zero-style) | 43.3 | 21/32 | 25/32 | $0.143 |
| C — hybrid | 60.3 | 22/32 | 26/32 | $0.139 |
| D — typed + replacement payload | 82.8 | 32/32 | 32/32 | $0.107 |
curt adopted the winner the same day. Typed fields beat the prose hint by +9.4pp turn-1 at 1.13× tokens, so curt's diagnostics now emit typed want/got fields plus a stable repair{id,summary} — at single-line economy (~44 tokens vs Zero's captured 114). Row D is the prize: a machine-applicable replacement repairs 32/32 single-shot, because turn count dominates diagnostic size in loop dollars.
Reproduce any number: .ci-venv/bin/python tools/bench/headtohead.py report · tools/bench/diag_tourney.py report · tools/tokens/count.py — each deterministic over committed transcripts.