Atlas Labs
Lab Notes
RAGEvalsData ContractsProduction

RAG in the real world: evals, drift, and data contracts

·5 min read

Most RAG demos look great. You chunk some documents, embed them, hook up a retriever, and get plausible-looking answers. The benchmark numbers are solid. The demo is smooth.

Then you put it in front of real users, with real data, running for real time — and things start to drift.

This post is about the failure modes that matter in production RAG and the engineering practices that catch them before they become incidents.

Why RAG systems degrade over time

RAG failures tend to cluster in three places:

  1. Retrieval quality drifts as your document corpus changes
  2. Generation quality degrades as the LLM sees distribution shift between retrieved context and the queries it was tested on
  3. The pipeline silently breaks at component boundaries — stale indexes, schema changes, corrupt embeddings

The insidious part is that none of these cause hard errors. The system keeps responding. Users just start getting worse answers, and you often don't find out until someone complains loudly enough.

Building an eval harness before you launch

The best time to set up evaluation infrastructure is before you need it. That means:

Defining a golden dataset early. A set of (question, expected answer, source document) triples that you can run against any version of your pipeline. This doesn't need to be large — 50–200 high-quality examples is often enough to catch regressions.

Measuring what matters:

  • Faithfulness: Does the answer stick to what the retrieved documents actually say?
  • Answer relevancy: Does the answer address what the user actually asked?
  • Context precision: Are the retrieved chunks relevant to the question?
  • Context recall: Did retrieval surface the right documents at all?

Tools like RAGAS, TruLens, and DeepEval give you frameworks for these metrics. The key is treating them like unit tests — run them on every pipeline change, fail the build if scores drop below thresholds.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
    dataset=golden_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision],
)

# Fail CI if faithfulness drops below 0.85
assert results["faithfulness"] >= 0.85, f"Faithfulness regression: {results['faithfulness']}"

Drift detection in production

Offline evals on a golden set will catch many regressions, but not all. Production queries have a distribution that's impossible to fully anticipate. You need online monitoring too.

What to monitor:

Retrieval-side signals:

  • Average similarity score of top-k retrieved chunks (dropping scores indicate embedding drift or corpus issues)
  • Retrieval latency (sudden increases can indicate index problems)
  • Cache hit rates (useful for understanding query distribution)

Generation-side signals:

  • Response length distribution (sudden changes often indicate prompt or context issues)
  • LLM-as-judge scores on sampled production queries
  • User feedback signals (thumbs up/down, regeneration rate, abandonment)

System-level signals:

  • Embedding freshness (how old are the most recently indexed documents?)
  • Index health (are all expected documents present?)

Set up dashboards for these and configure alerts on deviations. Prometheus + Grafana works well here; so does Datadog if you're already in that ecosystem.

Data contracts between pipeline components

The most overlooked source of RAG failures is the implicit contract between pipeline components. When document structure changes, embeddings go stale, or a new data source gets added with different chunking, retrieval silently degrades.

Data contracts make these dependencies explicit:

# contract: document-ingestion-to-embeddings
schema:
  chunk_id: string
  content: string
  source_url: string
  updated_at: datetime
  metadata: object

freshness:
  max_age_hours: 24
  alert_threshold_hours: 12

quality:
  min_chunk_length: 100
  max_chunk_length: 2000
  required_fields: [chunk_id, content, source_url]

Tools like Great Expectations, Soda, or even simple schema validation at ingestion time can enforce these contracts. The goal is to make violations loud and immediate rather than silent and slow.

Handling corpus drift

Your document corpus changes. New documents are added, old ones are updated, some are deleted. If your embeddings don't keep up, retrieval degrades.

A few patterns that work well:

Incremental re-embedding with change detection. Rather than re-embedding everything on a schedule, detect changed documents at ingestion time (content hashing works well) and only re-embed those.

Shadow indexing for major model upgrades. When you upgrade embedding models, maintain two indexes in parallel. Route a percentage of traffic to the new index, compare retrieval quality, then cut over when confident.

Soft deletes with TTL. Instead of immediately removing documents from the index, mark them as soft-deleted and let them expire. This prevents sudden drops in retrieval coverage and makes rollbacks easier.

The operational checklist

Before you call a RAG system production-ready:

  • [ ] Golden dataset established and eval harness running in CI
  • [ ] Freshness monitoring on the embedding index
  • [ ] Retrieval quality metrics tracked in production (similarity scores, latency)
  • [ ] LLM-as-judge sampling on production queries (even 1% is valuable)
  • [ ] Data contracts defined between ingestion, embedding, and retrieval
  • [ ] Runbook for common failure modes (stale index, embedding model change, corpus update)
  • [ ] Alerting on retrieval quality degradation with defined SLOs

RAG is not a set-it-and-forget-it technology. The systems that work in production are the ones with the operational rigor to catch drift before users do.