← dashboard
Eval dashboard
Held-out metrics for the two fine-tunes, broken out by data split.
Gold benchmark — production-truth
8opps · locked · stricter gates (agreement ≥99%, RMSE_fit <5, latency ≤3s)No gold benchmark runs yet
8 opps locked in gold. The next /api/training/checkpoints/[id]/promote on a SCORING checkpoint runs the gold benchmark automatically and lands a row here.
FT vs Claude on held-out validation
Validation set: 14 opps · pnpm tsx scripts/benchmark-ft-vs-claude.tsLatest run · 5/15/2026, 4:21:26 AM · n=160
RMSE fit ↓
7.1
RMSE tech ↓
8.3
RMSE $ ↓
12.9
MAE fit ↓
4.6
MAE tech ↓
5.4
MAE $ ↓
9.5
Qualified agreement 98%FT qualified 8Claude qualified 10Mean FT latency 3599ms
History (last 3)
- 5/15/2026RMSE fit 7.1 · agree 98%
- 5/15/2026RMSE fit 11.3 · agree 92%
- 5/15/2026RMSE fit 11.2 · agree 94%
Shadow inference health (last 24h)
Live counts from /api/screen — backed by ScoreRunError tableft errors
1
base errors
1
claude errors
0
FT-Claude Δ ≥ 15
8
opps where FT & Claude disagree on Fit by 15+
Behavioral drift (30-day rolling)
8 snapshots · run pnpm tsx scripts/compute-behavioral-snapshot.ts nightly| Date | Arm | n | Mean fit | Mean tech | Mean $ | Qualified rate | Mean risk flags | Rationale chars |
|---|---|---|---|---|---|---|---|---|
| 2026-05-15 | base | 77 | 36.7 | 42.1 | 51.9 | 13% | 6.7 | 645 |
| 2026-05-15 | claude | 530 | 13.3 | 12.8 | 36.9 | 3% | 7.9 | 557 |
| 2026-05-15 | ft | 84 | 20.6 | 22.3 | 37.8 | 0% | 5.8 | 426 |
| 2026-05-14 | base | 16 | 61.5 | 67.7 | 67.9 | 81% | 6.4 | 675 |
| 2026-05-14 | claude | 294 | 21.5 | 27.1 | 45.0 | 13% | 7.4 | 590 |
| 2026-05-14 | ft | 8 | 37.5 | 43.8 | 52.9 | 13% | 5.8 | 440 |
| 2026-05-13 | base | 3 | 63.0 | 65.3 | 63.3 | 100% | 6.3 | 722 |
| 2026-05-13 | claude | 117 | 22.1 | 24.6 | 37.3 | 0% | 7.8 | 622 |
Scoring (Qwen3-4B-Instruct + LoRA)
| Split | n | Fit MAE ↓ | Tech MAE ↓ | Threshold-70 acc ↑ | Risk Jaccard ↑ | Bad JSON |
|---|---|---|---|---|---|---|
| ft | 11 | 2.00 | 2.73 | 1.00 | 0.01 | 0 |
| claude | 11 | 1.00 | 1.64 | 1.00 | 0.03 | 0 |
| base | 11 | 18.00 | 28.55 | 1.00 | 0.01 | 0 |
Drafting (Qwen3-30B-A3B-Instruct + LoRA)
| Split / arm | n | FK ↓ | Factual ↑ | Plain ↑ | Complete ↑ | Win-theme ↑ |
|---|---|---|---|---|---|---|
| holdout / ft | 9 | 12.4 | 3.62 | 3.25 | 2.00 | 3.62 |
| holdout / claude | 9 | 11.3 | 3.89 | 3.00 | 2.00 | 4.00 |
| holdout / base | 9 | 12.0 | 3.67 | 2.67 | 2.00 | 3.50 |