Skip to content
Back to featured work
Case Study | RAG Reliability

RAG systems with eval gates, latency targets, and cost controls.

Production retrieval systems for teams that needed fast, grounded answers over private knowledge, plus an evaluation harness that could stop regressions before deployment.

5 production RAG systems deployed for private knowledge workflows
~50 QPS sustained with sub-300ms P95 latency targets
1,104 passing eval and regression tests across 29 suites
-65% inference cost reduction through routing, caching, and architecture choices
Architecture

A retrieval stack built to be measured, not guessed at.

The critical move was pairing retrieval architecture with evaluation, monitoring, and CI/CD gates so teams could tell whether changes improved the system or quietly degraded it.

System flow

documents to gated answers

Knowledge inputs

Content sourcesWeb crawls, PDFs, analyst reports, briefs, transcripts, and structured documents.
IngestionChunking, metadata extraction, tenant labels, and quality scoring.

Retrieval layer

Vector searchPinecone, Weaviate, pgvector, and hybrid retrieval strategies.
RerankingDynamic strategy selection and iterative retrieval loops.

Answer layer

Model routingFrontier models for complex reasoning, SLMs/local models for high-volume tasks.
GroundingAnswer confidence, source checks, refusal behavior, and citations.

Quality gate

Eval suitesRAGAS, LangSmith, Promptfoo, LLM-as-judge, static checks.
CI/CD stopsHallucination rate, latency, answer quality, and cost per query gates.

What I owned

  • Architected serverless RAG on AWS Lambda and Bedrock processing 100K+ queries per month — Lambda-vs-EC2 tradeoff analysis delivered ~70% cost reduction at the same latency targets.
  • Built retrieval strategies that moved from static search to agentic RAG with dynamic retrieval, reranking, and iterative loops — improving retrieval accuracy 40% and cutting inference cost 65%.
  • Implemented cost-aware model routing, prompt caching (cutting repetitive query costs up to 90%), semantic caching, and fallback strategies across multiple LLM providers.
  • Built eval harnesses across quality, latency, refusal behavior, retrieval quality, cost, and regression risk.

Why it matters to hiring teams

Many AI candidates can wire a vector database to a chatbot. The scarcer skill is making retrieval reliable enough to survive model updates, prompt changes, latency pressure, private data constraints, and release velocity. This case study demonstrates that operating layer.

RAGAS LangSmith Promptfoo AWS Lambda Bedrock pgvector
Evidence

Production outcomes, not retrieval theater.

Reliability

1,104 tests across 29 suites

Eval and regression suites acted as hard deployment gates for risky releases.

Performance

Sub-300ms P95 target

Architected retrieval systems around practical latency budgets, not only answer quality.

Economics

70% infrastructure cost reduction

Serverless tradeoff analysis chose Lambda over EC2 while maintaining operating targets.

Need RAG that can be measured and shipped?

I build retrieval systems with the eval, observability, latency, and cost controls needed for production use.