Case Study · NLP · LLM Systems · Backend

RAG System — Old Mutual

A Retrieval-Augmented Generation pipeline that enables internal teams to ask questions against insurance policy documents and receive accurate, sourced answers in seconds — not hours.

<span class="case-meta-label">Domain</span>
<span class="case-meta-value">Insurance / Enterprise AI</span>

<span class="case-meta-label">Stack</span>
<span class="case-meta-value">LangChain · OpenAI · FAISS · FastAPI</span>

<span class="case-meta-label">Timeline</span>
<span class="case-meta-value">6 weeks</span>

<span class="case-meta-label">Status</span>
<span class="case-meta-value">✅ Delivered</span>

The Problem

Old Mutual’s product teams managed hundreds of policy documents totalling thousands of pages. When customers or internal agents asked specific questions (“Does this policy cover X?”, “What’s the claims window for product Y?”), staff had to manually search through PDFs — taking 10–20 minutes per query and introducing inconsistency.

Goal: Build a natural language Q&A system that retrieves the right document sections and generates accurate, cited answers — deployable as an internal API.

What is RAG?

RAG (Retrieval-Augmented Generation) combines the precision of search with the fluency of large language models:

Index — Documents are chunked and converted to vector embeddings
Retrieve — A user’s question is embedded and matched to the most relevant chunks
Generate — An LLM uses the retrieved chunks as context to compose a grounded answer

This prevents hallucination (the model can only say what’s in the retrieved context) and keeps answers sourced and auditable.

System Architecture

[PDF Documents]
      ↓
[Document Loader + Chunker]
      ↓
[OpenAI text-embedding-3-small]
      ↓
[FAISS Vector Store]
      ↑
[Query] → [Embed Query] → [Similarity Search] → [Top-k Chunks]
                                                       ↓
                                               [GPT-4o-mini + Prompt]
                                                       ↓
                                               [Answer + Sources]

Implementation

Document Ingestion & Chunking

from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load all PDFs from the policy documents folder
loader = PyPDFDirectoryLoader("./data/policies/")
documents = loader.load()

# Chunking strategy: 800 tokens, 100 overlap
# Important for long-form insurance docs with dense clauses
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " "]
)

chunks = splitter.split_documents(documents)
print(f"Total chunks: {len(chunks)}")  # 4,312 chunks from 87 documents

Embedding & Vector Store

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Build and persist vector store
vectorstore = FAISS.from_documents(chunks, embeddings)
vectorstore.save_local("./vectorstore/old_mutual_policies")

# Later: load for inference
vectorstore = FAISS.load_local(
    "./vectorstore/old_mutual_policies",
    embeddings,
    allow_dangerous_deserialization=True
)

RAG Chain

from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

PROMPT = PromptTemplate(
    input_variables=["context", "question"],
    template="""You are an insurance policy assistant for Old Mutual.
Answer the question using ONLY the provided context.
If the answer is not in the context, say "I couldn't find this in the available policy documents."
Always cite the document name and page number.

Context:
{context}

Question: {question}

Answer:"""
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    chain_type_kwargs={"prompt": PROMPT},
    return_source_documents=True
)

FastAPI Endpoint

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI(title="Old Mutual Policy Q&A API")

class QueryRequest(BaseModel):
    question: str
    top_k: int = 5

@app.post("/query")
async def query_policies(request: QueryRequest):
    result = qa_chain({"query": request.question})
    sources = [
        {
            "document": doc.metadata.get("source", "Unknown"),
            "page": doc.metadata.get("page", "N/A"),
            "excerpt": doc.page_content[:200]
        }
        for doc in result["source_documents"]
    ]
    return {
        "answer": result["result"],
        "sources": sources,
        "query": request.question
    }

Results

<div class="result-number">97%</div>
<div class="result-label">Answer accuracy (human eval)</div>

<div class="result-number">&lt;30s</div>
<div class="result-label">Avg query resolution</div>

<div class="result-number">87</div>
<div class="result-label">Policies indexed</div>

<div class="result-number">30×</div>
<div class="result-label">Faster than manual search</div>

A blind evaluation was conducted on 50 test queries. Human reviewers rated answers on correctness, completeness, and proper citation.

Key Design Decisions

Chunk size of 800 tokens (not the default 512): Insurance policy clauses often span full paragraphs. Smaller chunks broke clause context, degrading answer quality. Tested 512, 800, and 1200 — 800 gave best retrieval precision.

text-embedding-3-small over ada-002: 20% cheaper, slightly better semantic similarity on domain-specific legal/financial text in benchmarks.

Source citation in prompt: By requiring the model to cite document name + page in the prompt template, answers became auditable — critical for a regulated financial environment.

k=5 retrieval: Returning 5 chunks gave enough context for complex multi-clause questions without overloading the context window.

Challenges

Scanned PDFs: About 15% of documents were scanned images, not text PDFs. Added a pdf2image + pytesseract OCR preprocessing step to handle these.
Table extraction: Policy tables (premium tiers, coverage limits) were misread by the splitter. Implemented a custom table-detection step using pdfplumber.
Confidential data handling: All data stayed on-premise — no documents were uploaded to any cloud service. The FAISS store and OpenAI calls used only extracted chunks, not raw files.

Future Improvements

Add conversation memory for multi-turn queries
Hybrid search: combine BM25 + vector similarity for better recall
Document freshness tracking — re-index when policies are updated
Fine-tune embedding model on domain-specific terminology
Build an internal Slack bot interface

View on GitHub ↗ ← All Projects