1,104 tests across 29 suites
Eval and regression suites acted as hard deployment gates for risky releases.
Production retrieval systems for teams that needed fast, grounded answers over private knowledge, plus an evaluation harness that could stop regressions before deployment.
RAG quality is only credible when retrieval, refusals, latency, and cost are measured together before release.
Suite-level status for retrieval quality, groundedness, refusal correctness, hallucination rate, latency, cost, and regressions.
Golden setKnown-good questions, expected source chunks, forbidden answers, refusal cases, and tenant/context boundaries.
Budget sheetP95 thresholds, per-query spend ceilings, model-routing rules, cache-hit assumptions, and fallback behavior.
Failure taxonomyRetrieval drift, stale chunks, citation mismatch, prompt regression, timeout behavior, weak refusals, and cost creep.
The critical move was pairing retrieval architecture with evaluation, monitoring, and CI/CD gates so teams could tell whether changes improved the system or quietly degraded it.
Many AI candidates can wire a vector database to a chatbot. The scarcer skill is making retrieval reliable enough to survive model updates, prompt changes, latency pressure, private data constraints, and release velocity. This case study demonstrates that operating layer.
Eval and regression suites acted as hard deployment gates for risky releases.
Architected retrieval systems around practical latency budgets, not only answer quality.
Serverless tradeoff analysis chose Lambda over EC2 while maintaining operating targets.
I build retrieval systems with the eval, observability, latency, and cost controls needed for production use.