MFMomentum Federal
← dashboard

Eval dashboard

Held-out metrics for the two fine-tunes, broken out by data split.

Gold benchmark — production-truth

8opps · locked · stricter gates (agreement ≥99%, RMSE_fit <5, latency ≤3s)

No gold benchmark runs yet

8 opps locked in gold. The next /api/training/checkpoints/[id]/promote on a SCORING checkpoint runs the gold benchmark automatically and lands a row here.

FT vs Claude on held-out validation

Validation set: 14 opps · pnpm tsx scripts/benchmark-ft-vs-claude.ts

Latest run · 5/15/2026, 4:21:26 AM · n=160

RMSE fit

7.1

RMSE tech

8.3

RMSE $

12.9

MAE fit

4.6

MAE tech

5.4

MAE $

9.5

Qualified agreement 98%FT qualified 8Claude qualified 10Mean FT latency 3599ms

History (last 3)

  • 5/15/2026RMSE fit 7.1 · agree 98%
  • 5/15/2026RMSE fit 11.3 · agree 92%
  • 5/15/2026RMSE fit 11.2 · agree 94%

Shadow inference health (last 24h)

Live counts from /api/screen — backed by ScoreRunError table

ft errors

1

base errors

1

claude errors

0

FT-Claude Δ ≥ 15

8

opps where FT & Claude disagree on Fit by 15+

Behavioral drift (30-day rolling)

8 snapshots · run pnpm tsx scripts/compute-behavioral-snapshot.ts nightly
DateArmnMean fitMean techMean $Qualified rateMean risk flagsRationale chars
2026-05-15base7736.742.151.913%6.7645
2026-05-15claude53013.312.836.93%7.9557
2026-05-15ft8420.622.337.80%5.8426
2026-05-14base1661.567.767.981%6.4675
2026-05-14claude29421.527.145.013%7.4590
2026-05-14ft837.543.852.913%5.8440
2026-05-13base363.065.363.3100%6.3722
2026-05-13claude11722.124.637.30%7.8622

Scoring (Qwen3-4B-Instruct + LoRA)

SplitnFit MAE ↓Tech MAE ↓Threshold-70 acc ↑Risk Jaccard ↑Bad JSON
ft112.002.731.000.010
claude111.001.641.000.030
base1118.0028.551.000.010

Drafting (Qwen3-30B-A3B-Instruct + LoRA)

Split / armnFK ↓Factual ↑Plain ↑Complete ↑Win-theme ↑
holdout / ft912.43.623.252.003.62
holdout / claude911.33.893.002.004.00
holdout / base912.03.672.672.003.50