P-01
Evaluation before code
No model touches production traffic before the eval set is signed off by the business owner.
// production AI · est. 2023 · booking Q1 2027
Twelve people. Six sectors. One bar across all of them. We build evaluation-first systems for banks, hospitals, law firms, and operators where a hallucination is a regulatory event, not a UX bug.
$ harness eval --suite=production --canary
loading suite ./evals/underwriting.yaml
spinning up runners x 24 -- ready
✓ regression 412/412 pass
✓ adversarial 128/128 pass
→ canary passed. ramp to 100% in 6m12s.
// 01 · what makes us different
P-01
No model touches production traffic before the eval set is signed off by the business owner.
P-02
Systems escalate to a human when uncertain. Confidence is calibrated, not asserted.
P-03
Models see only the fields the task requires. Outputs carry the privilege of the inputs.
P-04
Every inference is reproducible from inputs, weights, prompt, and policy version.
// 02 · selected work
CS-002 / B2B SaaS · customer support / AI audit
An autonomous agent was resolving 38% of inbound tickets. Finance flagged a refund-rate anomaly; the support team couldn't explain it
CS-001 / Consumer lending / Eval-first build
Twelve underwriters reviewing 800 applications a day. Average decision time was 14 minutes; senior staff were spending half their week on the easiest 60% of files
CS-003 / Logistics · enterprise / Embedded team
The internal team had built two models that worked in notebooks but had never made it to production. They needed a reference for what 'ready' looks like — eval surface, deployment patterns, on-call.
// 03 · who we work with
I-01
Where 'wrong' is a regulatory finding, not a UX bug.
I-02
PHI in, evidence-cited outputs, every action logged.
I-03
Privilege never leaks. Citations are real or it's a defect.
I-04
Margin-aware automation. Hallucinations cost money in this sector — literally.
I-05
Telemetry-rich. Explainability-mandatory. Downtime costs more than the engagement.
I-06
Your engineers are sharp. We bring eval discipline they haven't built yet.
// 04 · capabilities
End-to-end production scaffolding: orchestration, observability, evals, guardrails, and cost controls. The plumbing your in-house team will not have to build.
Goal-directed agents wired into your real systems — CRM, ERP, data warehouses, internal APIs. Built for measurable workflows, not chat windows.
Domain adaptation on your proprietary data. From SFT to DPO and RFT pipelines, with rigorous offline + online evaluation before anything ships.
Bounded engagements: a problem, a budget, a deadline. We embed with your team or run the project end-to-end through delivery and handoff.
When the team is big enough to matter: an internal AI platform with shared evals, a model gateway, prompt registry, and a path to self-serve.
For CTOs and CEOs: a senior partner across architecture, vendor selection, build-vs-buy, hiring, and roadmap. Quarterly cadence, no consulting deck theatre.
// next
The bench is deliberately small. We decline work that lacks production access, ownership, or a serious sponsor.
Submit project intake