◆Momentum Federal

Pitch deck · v2 · May 2026

Shadow modeis live in production

The procurement reasoning engine for federal contractors.

An always-on screening platform that ranks every incoming RFP, learns from every human correction, and gets measurably sharper every week. Three models, one source of truth, a self-improving loop.

By Aaron Ponce/Agile Six/government.momentumenergy.ai

IThe problem
IIThe solution
IIIHow scoring works · v2
IVThe moat is the loop
VMeasured improvement
VIToday’s ledger
VIIDefensibility
VIIIEconomics
IXFederal-grade sovereignty
XWhy now
XIState of the system
XIIThe close

IThe problem

Federal contractors are losing bids they should have won.

100s

RFPs per week

Across SAM.gov, grants.gov, and contract vehicles

RFPs reviewable per day

The hard ceiling on human screening

95%

Go unscreened

Invisible loss, every single week

Pipeline volume vastly exceeds review capacity. The result is two failure modes that both cost real money. False negatives— bids you should have pursued slip past unread. False positives— capture teams burn weeks on bids that were structurally unviable from the outset.

IIThe solution

An always-on screening engine that gets sharper every week.

01
Ingest
SAM.gov, grants.gov, and direct upload feed the pipeline.
02
Gate
Deterministic rules cut 18% of opps before any LLM call. Zero cost on obvious no-fits.
03
Score
Three models score every gate-passing RFP in parallel.
04
Surface
Capture team sees a ranked, reasoned pipeline with disagreement flags.

IIIHow scoring works · v2

Shadow mode is live. The button moved into the pipeline.

Auto-screening previously ran the teacher only. The fine-tune and the control fired only when a human clicked the score button — training data trickled in at click speed. That changed. Now all three arms fire automatically on every gate-passing opportunity. The teacher still drives the official qualifying score; the fine-tune and the control observe in parallel and accumulate training signal passively. The button is reclassified to a manual re-run.

Teacher

Gates the score

Claude Sonnet 4.6

Source of truth for the official qualifying score. Reasoning the capture team reads.

Latency: ~12s
Cost: $0.020 / call
Host: Anthropic API

Student

Shadow training

Qwen3-4B + LoRA

Distilled from the teacher. Earns its keep on every screen by generating eval signal and training rows.

Latency: ~5s
Cost: $0.001 / call
Host: Sidecar → Tinker

Control

Regression check

Haiku 4.5 (Base)

Stand-in for vanilla un-trained Qwen. If the student ever drifts below it, training has gone backwards.

Latency: ~5s
Cost: $0.007 / call
Host: Anthropic API

Observability

Disagreement badges appear on every RFP row when the student and teacher diverge by ten points or more on Fit. Δ 10–15 shadow drift · Δ 15–25 needs review · Δ 25+strong disagree. Forty held-out validation opportunities never seen during training drive RMSE, MAE, and agreement after every retrain. Per-arm failures are isolated and surfaced — schema drift becomes loud, not silent.

IVThe moat

The moat is the loop, not the model.

1
Production inference
Every RFP scored by all three arms.
2
Disagreement capture
Auto-classifier flags meaningful gaps.
3
Human review
Capture team picks the winner per case.
4
Training dataset
Corrections override the teacher signal.
5
New checkpoint
Promoted only if gates pass.

Over months, the student outperforms the teacher on YOUR pipeline because it is not imitating a generalist — it is learning your team's domain judgment. The data flywheel rewards early movers, not feature copiers.

VMeasured improvement

One retrain. Dramatically better numbers.

Metric	Before · 125 ex / 3 ep	After · 233 ex / 6 ep	Δ
RMSE on Fit	11.16	7.10	−36%
RMSE on Tech	12.19	8.35	−31%
RMSE on Financial	16.00	12.92	−19%
MAE on Fit	6.19	4.57	−26%
Qualified agreement	93.8%	97.5%	+3.7 pp
FT qualified (of 10)	2	8	false-neg −75%
Mean FT latency	4,619 ms	3,599 ms	−22%

The conservative-bias fix is the headline. The old fine-tune qualified 2 of 10RFPs that the teacher qualified — hiding eight viable bids from the capture team. The new fine-tune qualifies 8 of 10. Final training loss dropped 10× over six epochs (3.28 → 0.31 nats/token at step 300).

VIToday’s ledger

Operational ML platform, not a prototype.

Live production data, current as of this issue.

97.5%

Agreement with teacher

Up from 93.8%

941

Claude-scored training rows

10 vendor archetypes

123

Captured disagreements

Ready for review

Locked gold-benchmark RFPs

Never rotates

On the wire: 845 real opportunities · 771 score runs across all three arms · 211 prime contractors tracked.

VIIDefensibility

Anyone can call Claude. Very few can do this.

01Locked gold benchmark
Twenty-four RFPs that never rotate. Honest cross-time deltas.
02Three-arm shadow inference
Teacher, student, control. Parallel on every call.
03Disagreement classifier
Auto-flags meaningful student/teacher gaps.
04Active-learning feedback
Human corrections override the teacher in the next dataset.
05Promotion gates
Validation and gold thresholds blocked unless passed.
06Behavioral drift monitoring
Two-sigma daily alarms on distribution shifts.
07Training lineage
Every checkpoint traceable to its dataset.
08Archetype-balanced corpus
Ten vendor profiles. No single-customer overfit.
09Sidecar isolation
Fine-tune runs in your process. Swap providers without losing data.

Time to copy. Inference path: one week. Eval and governance discipline: months. Real disagreement history: years.

VIIIEconomics

Cost-to-serve is the floor. Pricing sits above it.

$0.028

Per RFP screened

All three arms · vs $300+/hr senior capture time

$15

Per retrain cycle

$5–$15 Tinker compute · LoRA · six epochs

$1,022

Per year, 100 RFPs/day

Linear scaling · $10,220 at 1,000 RFPs/day

The cost lever. The teacher is the most expensive arm. Every retrain pushes the student closer to standalone primary scorer, with the teacher re-invoked only on disagreement. When that crossover lands, per-RFP cost drops to roughly $0.001. Customer pricing is per-seat and per-org — anchored to capture-team payroll, not cost-to-serve.

IXFederal-grade sovereignty

Built for the buyer. Operational data never leaves.

On your infrastructure

—Opportunity records, profiles, score runs
—Documents (extracted text and binaries)
—Capture strategy and disagreement reviews
—Behavioral drift snapshots and eval history
—Student inference via the Python sidecar

External dependencies

—Anthropic API for teacher signal
—Tinker for student weight storage and sampler
—Optional: SAM.gov and grants.gov ingestion

Flip a feature flag and 100% of production scoring routes through the on-prem student. Full air-gap path is documented — swap the teacher for a self-hosted model.

XWhy now

$755B in federal contracts. 108,899 contractors. One unscreened pile.

$755B

FY24 federal contract spend

Up from $598B in FY19

108,899

Companies competing for awards

40,856 large · 78,677 SB

59.9%

Defense share of FY24

VA · NASA · DOE · HHS · GSA round out the top

Frontier models are good enough to teach

Claude Sonnet 4.6 reasons sharply enough to distill from.

Distillation makes per-call cost viable

$0.001 per student call. The math wasn't there 18 months ago.

Flywheels reward early movers

Resolved human corrections compound. Month 12, the gap is years wide.

XIState of the system

Live today. More to build.

Live now

✓Shadow mode in production — all three arms on every gate-passing opp
✓Held-out validation set + RMSE tracking
✓Disagreement badges on /rfps
✓Tinker training pipeline · 10× loss reduction
✓941 Claude rows across 10 archetypes
✓Locked 24-RFP gold benchmark
✓Behavioral drift snapshots
✓Promotion gates that block bad checkpoints

Next up

→Scheduled SAM.gov ingestion — daily, not manual
→Capture-team workflow — queue, star, follow-up pipeline
→Specialized fine-tunes per criterion — Fit / Tech / Financial
→Direct preference optimization on resolved disagreements
→Flip the gate to FT when RMSE drops below 5

XIIThe close

By the time a competitor catches up to feature parity, our student will be on its 50th retrain.

Inference path takes a week. Eval and governance discipline take months. Real disagreement history takes years. We have a year-long head start on the part nobody can shortcut.

01 / 03

Product demo

See the live pipeline, gold benchmark, and three-arm comparison.

02 / 03

Technical deep dive

Walk the codebase, training scripts, and promotion gates.

03 / 03

Partnership

Capital, federal-prime channel, or pilot deployment.