Blog · LLMs · Systems
How I built a RAG system from scratch
June 2025 · 8 min read
When I was brought in to build a Q&A system on top of insurance policy documents, I knew RAG (Retrieval-Augmented Generation) was the right approach. But knowing what to build and knowing how to make it actually work in production are very different things.
Here’s what I learned — and what I’d do differently.
Why RAG, not fine-tuning?
My first instinct was to fine-tune a model on the documents. But there are two problems with this for a document Q&A use case:
- Fine-tuning teaches style, not facts. A fine-tuned model learns how the source material is written, but it doesn’t reliably memorise specific clauses or numbers. Ask it “what’s the premium for plan B?” and it might hallucinate a confident answer.
- Documents change. Insurance policies get updated quarterly. Fine-tuning would require retraining every update cycle — expensive and slow.
RAG sidesteps both problems. The model retrieves actual text at query time, so it can only draw on what’s in the documents. And when documents change, you just re-index — no retraining needed.
The pipeline in brief
A RAG pipeline has three phases:
Offline (build the index): 1. Load documents → chunk into passages → embed each passage → store in a vector database
Online (answer a query): 1. Embed the query → find the top-k most similar passages → pass those passages + the query to an LLM → get a grounded answer
Simple in theory. Surprisingly tricky in practice.
The chunk size problem
My first run used LangChain’s default 512-token chunks with 50-token overlap. The retrieval was accurate — it found something relevant. But the answers were frequently incomplete because insurance policy clauses often span full paragraphs with important conditions buried three sentences in.
I tested three sizes:
| Chunk size | Retrieval precision | Answer completeness |
|---|---|---|
| 512 tokens | High | Often incomplete |
| 800 tokens | High | Good |
| 1200 tokens | Slightly lower | Good, but noisy |
800 tokens with 100 overlap turned out to be the sweet spot for this document type. The lesson: chunk size is domain-dependent. For academic papers or legal documents, you need bigger chunks than for FAQs or product descriptions.
The embedding model choice
I compared text-embedding-ada-002 and text-embedding-3-small:
3-smallwas ~20% cheaper and performed marginally better on domain-specific similarity tasks- Both significantly outperformed sentence transformers for this use case (long-form financial/legal text)
I’d lean toward 3-small for most production use cases today.
The prompt matters more than you think
Early answers were technically correct but often confusing or verbose. The key improvements came from the prompt:
PROMPT = """You are an insurance policy assistant.
Answer using ONLY the provided context.
If the answer is not in the context, say exactly:
"I couldn't find this in the available policy documents."
Keep answers concise and always cite: [Document Name, Page N].
Context:
{context}
Question: {question}"""Three things that made a big difference: 1. Explicit fallback instruction — stops the model from guessing when context is insufficient 2. Citation requirement — makes answers auditable, builds trust 3. Concision instruction — early answers were 3× longer than needed
What I’d do differently
- Hybrid search from day one. Pure vector similarity misses exact keyword matches (policy numbers, specific clause references). BM25 + vector combined would have improved recall.
- Chunk at semantic boundaries. Using recursive character splitting doesn’t respect document structure. For PDFs with clear sections, splitting on section headers would produce cleaner chunks.
- Evaluate systematically. I evaluated by manually checking 50 queries. A proper RAG eval framework (RAGAS, for example) would have given me faster, more rigorous feedback on retrieval and generation quality separately.
Final thoughts
RAG is genuinely powerful for document Q&A — but the defaults won’t cut it for production. The work is in the details: chunk size, overlap, embedding choice, prompt design, and evaluation. Each decision compounds.
For the Old Mutual project, the system ultimately reduced average query resolution from 15 minutes to under 30 seconds, with 97% answer accuracy on human evaluation. Worth every iteration.
Want to see the full implementation? Check out the project case study or the GitHub repo.