Case Study · NLP · LLM Systems · Backend
RAG System — Old Mutual
A Retrieval-Augmented Generation pipeline that enables internal teams to ask questions against insurance policy documents and receive accurate, sourced answers in seconds — not hours.
The Problem
Old Mutual’s product teams managed hundreds of policy documents totalling thousands of pages. When customers or internal agents asked specific questions (“Does this policy cover X?”, “What’s the claims window for product Y?”), staff had to manually search through PDFs — taking 10–20 minutes per query and introducing inconsistency.
Goal: Build a natural language Q&A system that retrieves the right document sections and generates accurate, cited answers — deployable as an internal API.
What is RAG?
RAG (Retrieval-Augmented Generation) combines the precision of search with the fluency of large language models:
- Index — Documents are chunked and converted to vector embeddings
- Retrieve — A user’s question is embedded and matched to the most relevant chunks
- Generate — An LLM uses the retrieved chunks as context to compose a grounded answer
This prevents hallucination (the model can only say what’s in the retrieved context) and keeps answers sourced and auditable.
System Architecture
[PDF Documents]
↓
[Document Loader + Chunker]
↓
[OpenAI text-embedding-3-small]
↓
[FAISS Vector Store]
↑
[Query] → [Embed Query] → [Similarity Search] → [Top-k Chunks]
↓
[GPT-4o-mini + Prompt]
↓
[Answer + Sources]
Implementation
Document Ingestion & Chunking
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load all PDFs from the policy documents folder
loader = PyPDFDirectoryLoader("./data/policies/")
documents = loader.load()
# Chunking strategy: 800 tokens, 100 overlap
# Important for long-form insurance docs with dense clauses
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(documents)
print(f"Total chunks: {len(chunks)}") # 4,312 chunks from 87 documentsEmbedding & Vector Store
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Build and persist vector store
vectorstore = FAISS.from_documents(chunks, embeddings)
vectorstore.save_local("./vectorstore/old_mutual_policies")
# Later: load for inference
vectorstore = FAISS.load_local(
"./vectorstore/old_mutual_policies",
embeddings,
allow_dangerous_deserialization=True
)RAG Chain
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
PROMPT = PromptTemplate(
input_variables=["context", "question"],
template="""You are an insurance policy assistant for Old Mutual.
Answer the question using ONLY the provided context.
If the answer is not in the context, say "I couldn't find this in the available policy documents."
Always cite the document name and page number.
Context:
{context}
Question: {question}
Answer:"""
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
chain_type_kwargs={"prompt": PROMPT},
return_source_documents=True
)FastAPI Endpoint
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="Old Mutual Policy Q&A API")
class QueryRequest(BaseModel):
question: str
top_k: int = 5
@app.post("/query")
async def query_policies(request: QueryRequest):
result = qa_chain({"query": request.question})
sources = [
{
"document": doc.metadata.get("source", "Unknown"),
"page": doc.metadata.get("page", "N/A"),
"excerpt": doc.page_content[:200]
}
for doc in result["source_documents"]
]
return {
"answer": result["result"],
"sources": sources,
"query": request.question
}Results
<div class="result-number">97%</div>
<div class="result-label">Answer accuracy (human eval)</div>
<div class="result-number"><30s</div>
<div class="result-label">Avg query resolution</div>
<div class="result-number">87</div>
<div class="result-label">Policies indexed</div>
<div class="result-number">30×</div>
<div class="result-label">Faster than manual search</div>
A blind evaluation was conducted on 50 test queries. Human reviewers rated answers on correctness, completeness, and proper citation.
Key Design Decisions
Chunk size of 800 tokens (not the default 512): Insurance policy clauses often span full paragraphs. Smaller chunks broke clause context, degrading answer quality. Tested 512, 800, and 1200 — 800 gave best retrieval precision.
text-embedding-3-small over ada-002: 20% cheaper, slightly better semantic similarity on domain-specific legal/financial text in benchmarks.
Source citation in prompt: By requiring the model to cite document name + page in the prompt template, answers became auditable — critical for a regulated financial environment.
k=5 retrieval: Returning 5 chunks gave enough context for complex multi-clause questions without overloading the context window.
Challenges
- Scanned PDFs: About 15% of documents were scanned images, not text PDFs. Added a
pdf2image+pytesseractOCR preprocessing step to handle these. - Table extraction: Policy tables (premium tiers, coverage limits) were misread by the splitter. Implemented a custom table-detection step using
pdfplumber. - Confidential data handling: All data stayed on-premise — no documents were uploaded to any cloud service. The FAISS store and OpenAI calls used only extracted chunks, not raw files.