◆Momentum Federal
Pitch deck · v2 · May 2026
Shadow modeis live in production
The procurement reasoning engine for federal contractors.
An always-on screening platform that ranks every incoming RFP, learns from every human correction, and gets measurably sharper every week. Three models, one source of truth, a self-improving loop.
- IThe problem
- IIThe solution
- IIIHow scoring works · v2
- IVThe moat is the loop
- VMeasured improvement
- VIToday’s ledger
- VIIDefensibility
- VIIIEconomics
- IXFederal-grade sovereignty
- XWhy now
- XIState of the system
- XIIThe close
IThe problem
Federal contractors are losing bids they should have won.
RFPs per week
Across SAM.gov, grants.gov, and contract vehicles
RFPs reviewable per day
The hard ceiling on human screening
Go unscreened
Invisible loss, every single week
Pipeline volume vastly exceeds review capacity. The result is two failure modes that both cost real money. False negatives— bids you should have pursued slip past unread. False positives— capture teams burn weeks on bids that were structurally unviable from the outset.
IIThe solution
An always-on screening engine that gets sharper every week.
01
Ingest
SAM.gov, grants.gov, and direct upload feed the pipeline.
02
Gate
Deterministic rules cut 18% of opps before any LLM call. Zero cost on obvious no-fits.
03
Score
Three models score every gate-passing RFP in parallel.
04
Surface
Capture team sees a ranked, reasoned pipeline with disagreement flags.
IIIHow scoring works · v2
Shadow mode is live. The button moved into the pipeline.
Auto-screening previously ran the teacher only. The fine-tune and the control fired only when a human clicked the score button — training data trickled in at click speed. That changed. Now all three arms fire automatically on every gate-passing opportunity. The teacher still drives the official qualifying score; the fine-tune and the control observe in parallel and accumulate training signal passively. The button is reclassified to a manual re-run.
Teacher
Gates the score
Claude Sonnet 4.6
Source of truth for the official qualifying score. Reasoning the capture team reads.
- Latency
- ~12s
- Cost
- $0.020 / call
- Host
- Anthropic API
Student
Shadow training
Qwen3-4B + LoRA
Distilled from the teacher. Earns its keep on every screen by generating eval signal and training rows.
- Latency
- ~5s
- Cost
- $0.001 / call
- Host
- Sidecar → Tinker
Control
Regression check
Haiku 4.5 (Base)
Stand-in for vanilla un-trained Qwen. If the student ever drifts below it, training has gone backwards.
- Latency
- ~5s
- Cost
- $0.007 / call
- Host
- Anthropic API
Observability
Disagreement badges appear on every RFP row when the student and teacher diverge by ten points or more on Fit. Δ 10–15 shadow drift · Δ 15–25 needs review · Δ 25+strong disagree. Forty held-out validation opportunities never seen during training drive RMSE, MAE, and agreement after every retrain. Per-arm failures are isolated and surfaced — schema drift becomes loud, not silent.
IVThe moat
The moat is the loop, not the model.
1
Production inference
Every RFP scored by all three arms.
2
Disagreement capture
Auto-classifier flags meaningful gaps.
3
Human review
Capture team picks the winner per case.
4
Training dataset
Corrections override the teacher signal.
5
New checkpoint
Promoted only if gates pass.
Over months, the student outperforms the teacher on YOUR pipeline because it is not imitating a generalist — it is learning your team's domain judgment. The data flywheel rewards early movers, not feature copiers.
VMeasured improvement
One retrain. Dramatically better numbers.
| Metric | Before · 125 ex / 3 ep | After · 233 ex / 6 ep | Δ |
|---|---|---|---|
| RMSE on Fit | 11.16 | 7.10 | −36% |
| RMSE on Tech | 12.19 | 8.35 | −31% |
| RMSE on Financial | 16.00 | 12.92 | −19% |
| MAE on Fit | 6.19 | 4.57 | −26% |
| Qualified agreement | 93.8% | 97.5% | +3.7 pp |
| FT qualified (of 10) | 2 | 8 | false-neg −75% |
| Mean FT latency | 4,619 ms | 3,599 ms | −22% |
The conservative-bias fix is the headline. The old fine-tune qualified 2 of 10RFPs that the teacher qualified — hiding eight viable bids from the capture team. The new fine-tune qualifies 8 of 10. Final training loss dropped 10× over six epochs (3.28 → 0.31 nats/token at step 300).
VIToday’s ledger
Operational ML platform, not a prototype.
Live production data, current as of this issue.
Agreement with teacher
Up from 93.8%
Claude-scored training rows
10 vendor archetypes
Captured disagreements
Ready for review
Locked gold-benchmark RFPs
Never rotates
On the wire: 845 real opportunities · 771 score runs across all three arms · 211 prime contractors tracked.
VIIDefensibility
Anyone can call Claude. Very few can do this.
- 01Locked gold benchmark
Twenty-four RFPs that never rotate. Honest cross-time deltas.
- 02Three-arm shadow inference
Teacher, student, control. Parallel on every call.
- 03Disagreement classifier
Auto-flags meaningful student/teacher gaps.
- 04Active-learning feedback
Human corrections override the teacher in the next dataset.
- 05Promotion gates
Validation and gold thresholds blocked unless passed.
- 06Behavioral drift monitoring
Two-sigma daily alarms on distribution shifts.
- 07Training lineage
Every checkpoint traceable to its dataset.
- 08Archetype-balanced corpus
Ten vendor profiles. No single-customer overfit.
- 09Sidecar isolation
Fine-tune runs in your process. Swap providers without losing data.
Time to copy. Inference path: one week. Eval and governance discipline: months. Real disagreement history: years.
VIIIEconomics
Cost-to-serve is the floor. Pricing sits above it.
Per RFP screened
All three arms · vs $300+/hr senior capture time
Per retrain cycle
$5–$15 Tinker compute · LoRA · six epochs
Per year, 100 RFPs/day
Linear scaling · $10,220 at 1,000 RFPs/day
The cost lever. The teacher is the most expensive arm. Every retrain pushes the student closer to standalone primary scorer, with the teacher re-invoked only on disagreement. When that crossover lands, per-RFP cost drops to roughly $0.001. Customer pricing is per-seat and per-org — anchored to capture-team payroll, not cost-to-serve.
IXFederal-grade sovereignty
Built for the buyer. Operational data never leaves.
On your infrastructure
- —Opportunity records, profiles, score runs
- —Documents (extracted text and binaries)
- —Capture strategy and disagreement reviews
- —Behavioral drift snapshots and eval history
- —Student inference via the Python sidecar
External dependencies
- —Anthropic API for teacher signal
- —Tinker for student weight storage and sampler
- —Optional: SAM.gov and grants.gov ingestion
Flip a feature flag and 100% of production scoring routes through the on-prem student. Full air-gap path is documented — swap the teacher for a self-hosted model.
XWhy now
$755B in federal contracts. 108,899 contractors. One unscreened pile.
FY24 federal contract spend
Up from $598B in FY19
Companies competing for awards
40,856 large · 78,677 SB
Defense share of FY24
VA · NASA · DOE · HHS · GSA round out the top
Frontier models are good enough to teach
Claude Sonnet 4.6 reasons sharply enough to distill from.
Distillation makes per-call cost viable
$0.001 per student call. The math wasn't there 18 months ago.
Flywheels reward early movers
Resolved human corrections compound. Month 12, the gap is years wide.
XIState of the system
Live today. More to build.
Live now
- ✓Shadow mode in production — all three arms on every gate-passing opp
- ✓Held-out validation set + RMSE tracking
- ✓Disagreement badges on /rfps
- ✓Tinker training pipeline · 10× loss reduction
- ✓941 Claude rows across 10 archetypes
- ✓Locked 24-RFP gold benchmark
- ✓Behavioral drift snapshots
- ✓Promotion gates that block bad checkpoints
Next up
- →Scheduled SAM.gov ingestion — daily, not manual
- →Capture-team workflow — queue, star, follow-up pipeline
- →Specialized fine-tunes per criterion — Fit / Tech / Financial
- →Direct preference optimization on resolved disagreements
- →Flip the gate to FT when RMSE drops below 5
XIIThe close
By the time a competitor catches up to feature parity, our student will be on its 50th retrain.
Inference path takes a week. Eval and governance discipline take months. Real disagreement history takes years. We have a year-long head start on the part nobody can shortcut.
01 / 03
Product demo
See the live pipeline, gold benchmark, and three-arm comparison.
02 / 03
Technical deep dive
Walk the codebase, training scripts, and promotion gates.
03 / 03
Partnership
Capital, federal-prime channel, or pilot deployment.