RAG before fine-tuning: a practical default for AI in production

Most AI engagements that arrive at our door open with the same question. Should we use RAG, or should we fine-tune a model? The team has read the same articles, talked to the same vendors, and arrived at the same uncertainty. The vendors all sell fine-tuning; the articles all explain RAG.

Our default answer is RAG. Not because RAG is better — it isn't, in absolute terms — but because the engagement shape favours it. Faster to ship, easier to audit, and the failure modes are obvious enough that a team can operate the system after handoff. Fine-tuning lands in a smaller window than most teams expect.

What RAG actually is

Retrieval-augmented generation, in three steps. Index your corpus into a vector database. At query time, retrieve the top-k chunks most similar to the user's question. Stuff those chunks into the prompt and ask the model to answer using them. The model doesn't learn your data; it reads it freshly each query.

That's the whole architecture. Everything interesting is in the chunking strategy, the embedding model, the re-ranker (or absence of one), and the prompt that combines retrieval results with the user's intent. Months of engineering get spent there, but the shape of the system stays simple enough that an engineer joining mid-project can ship a fix in their first week.

Three reasons RAG wins by default

01Time-to-first-useful-answer. A working RAG prototype is two-week work for one engineer. A fine-tuning pipeline that produces a measurably better model — including the eval harness needed to prove the lift — is two-to-three months. Most clients can't justify the second timeline for the same problem.
02Auditability. When RAG produces a wrong answer, we trace the retrieval results, see what the model was actually given, and fix the chunk that misled it. When a fine-tuned model produces a wrong answer, we shrug. The fix is another training run.
03Cost shape. RAG cost scales linearly with queries (embeddings + inference). Fine-tuning cost is a fixed upfront block (training + re-training when the base model updates) plus inference. For most production loads under 1M queries/month, RAG is cheaper across the lifecycle.

When fine-tuning earns its place

Three signatures. If you see all three, fine-tuning is probably right. If you see only one, it isn't.

The task shape is narrow and stable. Classification into 12 categories that won't change. Format-conformant extraction. Tone-matching against a long, consistent voice corpus. Anything where the model needs to internalise a pattern, not look up a fact.
Latency or cost demands a smaller model. If a 70B fine-tuned model beats GPT-class quality on the task at a quarter the latency, fine-tune. If the production budget will tolerate a frontier model, don't.
The corpus is small enough to fit in training data but too large to fit in a prompt. The awkward middle — say, 50–500MB of domain content. Smaller fits a prompt; larger usually retrieves better than it trains.

The hybrid we ship most often

Most production AI features we deliver end up looking the same: RAG over the client's content corpus, a frontier model (Claude, GPT) for synthesis, a small fine-tuned classifier for routing (e.g., 'is this question answerable from the docs, or does it need a human?'), and a deterministic fallback when retrieval returns zero hits. Quality gates measure recall on a held-out eval set, latency at p95, and cost per query. We report all three weekly.

When the answer is no AI at all

A small but non-trivial fraction of AI engagements get killed on the discovery call. The signal: the team wants AI because the board asked, not because the feature would change a metric. We have the conversation and recommend a non-AI solution — usually a search index, an automated workflow, or a redesigned form — that ships faster and costs less.

Saying no to AI work is one of the more useful things a studio can offer right now. Almost every team we talk to is over-indexing on the model and under-indexing on the problem.

Working through whether your project wants RAG, fine-tuning, or no AI at all? The discovery call is the right place to figure it out.

Routing through the lattice.

Should take less than a second.

What RAG actually is

Three reasons RAG wins by default

01Time-to-first-useful-answer. A working RAG prototype is two-week work for one engineer. A fine-tuning pipeline that produces a measurably better model — including the eval harness needed to prove the lift — is two-to-three months. Most clients can't justify the second timeline for the same problem.

02Auditability. When RAG produces a wrong answer, we trace the retrieval results, see what the model was actually given, and fix the chunk that misled it. When a fine-tuned model produces a wrong answer, we shrug. The fix is another training run.

03Cost shape. RAG cost scales linearly with queries (embeddings + inference). Fine-tuning cost is a fixed upfront block (training + re-training when the base model updates) plus inference. For most production loads under 1M queries/month, RAG is cheaper across the lifecycle.

When fine-tuning earns its place

Three signatures. If you see all three, fine-tuning is probably right. If you see only one, it isn't.

The task shape is narrow and stable. Classification into 12 categories that won't change. Format-conformant extraction. Tone-matching against a long, consistent voice corpus. Anything where the model needs to internalise a pattern, not look up a fact.

Latency or cost demands a smaller model. If a 70B fine-tuned model beats GPT-class quality on the task at a quarter the latency, fine-tune. If the production budget will tolerate a frontier model, don't.

The corpus is small enough to fit in training data but too large to fit in a prompt. The awkward middle — say, 50–500MB of domain content. Smaller fits a prompt; larger usually retrieves better than it trains.

The hybrid we ship most often

When the answer is no AI at all

Saying no to AI work is one of the more useful things a studio can offer right now. Almost every team we talk to is over-indexing on the model and under-indexing on the problem.

Working through whether your project wants RAG, fine-tuning, or no AI at all? The discovery call is the right place to figure it out.

RAG before fine-tuning: a practical default for AI in production

What RAG actually is

Three reasons RAG wins by default

When fine-tuning earns its place

The hybrid we ship most often

When the answer is no AI at all

More from the journal.

Why AI makes things up — and what to do about it

The case for boring AI: features your users don't notice

Have a project that touches this?

Routing through the lattice.

One short email a month, no fluff.

RAG before fine-tuning: a practical default for AI in production

What RAG actually is

Three reasons RAG wins by default

When fine-tuning earns its place

The hybrid we ship most often

When the answer is no AI at all

More from the journal.

Why AI makes things up — and what to do about it

The case for boring AI: features your users don't notice

Have a project that touches this?