Eval dashboard

Held-out metrics for the two fine-tunes, broken out by data split.

Gold benchmark — production-truth

8opps · locked · stricter gates (agreement ≥99%, RMSE_fit <5, latency ≤3s)

No gold benchmark runs yet

8 opps locked in gold. The next /api/training/checkpoints/[id]/promote on a SCORING checkpoint runs the gold benchmark automatically and lands a row here.

FT vs Claude on held-out validation

Validation set: 14 opps · pnpm tsx scripts/benchmark-ft-vs-claude.ts

Latest run · 5/15/2026, 4:21:26 AM · n=160

RMSE fit ↓

7.1

RMSE tech ↓

8.3

RMSE $ ↓

12.9

MAE fit ↓

4.6

MAE tech ↓

5.4

MAE $ ↓

9.5

Qualified agreement 98%FT qualified 8Claude qualified 10Mean FT latency 3599ms

History (last 3)

5/15/2026RMSE fit 7.1 · agree 98%
5/15/2026RMSE fit 11.3 · agree 92%
5/15/2026RMSE fit 11.2 · agree 94%

Shadow inference health (last 24h)

Live counts from /api/screen — backed by ScoreRunError table

ft errors

base errors

claude errors

FT-Claude Δ ≥ 15

opps where FT & Claude disagree on Fit by 15+

Behavioral drift (30-day rolling)

8 snapshots · run pnpm tsx scripts/compute-behavioral-snapshot.ts nightly

Date	Arm	n	Mean fit	Mean tech	Mean $	Qualified rate	Mean risk flags	Rationale chars
2026-05-15	base	77	36.7	42.1	51.9	13%	6.7	645
2026-05-15	claude	530	13.3	12.8	36.9	3%	7.9	557
2026-05-15	ft	84	20.6	22.3	37.8	0%	5.8	426
2026-05-14	base	16	61.5	67.7	67.9	81%	6.4	675
2026-05-14	claude	294	21.5	27.1	45.0	13%	7.4	590
2026-05-14	ft	8	37.5	43.8	52.9	13%	5.8	440
2026-05-13	base	3	63.0	65.3	63.3	100%	6.3	722
2026-05-13	claude	117	22.1	24.6	37.3	0%	7.8	622

Scoring (Qwen3-4B-Instruct + LoRA)

Split	n	Fit MAE ↓	Tech MAE ↓	Threshold-70 acc ↑	Risk Jaccard ↑
ft	11	2.00	2.73	1.00	0.01
claude	11	1.00	1.64	1.00	0.03
base	11	18.00	28.55	1.00	0.01

Drafting (Qwen3-30B-A3B-Instruct + LoRA)

Split / arm	n	FK ↓	Factual ↑	Plain ↑	Complete ↑	Win-theme ↑
holdout / ft	9	12.4	3.62	3.25	2.00	3.62
holdout / claude	9	11.3	3.89	3.00	2.00	4.00
holdout / base	9	12.0	3.67	2.67	2.00	3.50