Case Study · Machine Learning

Loan Default Prediction

Building a classification pipeline that predicts borrower default risk from financial and behavioural features — with interpretable outputs that help lenders make fairer, faster decisions.

<span class="case-meta-label">Domain</span>
<span class="case-meta-value">FinTech / Credit Risk</span>

<span class="case-meta-label">Stack</span>
<span class="case-meta-value">Python · scikit-learn · XGBoost · SHAP</span>

<span class="case-meta-label">Timeline</span>
<span class="case-meta-value">8 weeks</span>

<span class="case-meta-label">Status</span>
<span class="case-meta-value">✅ Complete</span>

The Problem

Loan default is one of the most costly risks in consumer lending. Traditional credit scoring relies on narrow signals (credit score, income) that miss important behavioural indicators — and often exclude borrowers from underserved communities who could repay.

Goal: Build a model that: 1. Predicts default probability with high accuracy 2. Is interpretable enough for loan officers to trust and explain 3. Can run in real-time via API

Data

The dataset (sourced from a public Kaggle credit risk competition) contains 32,581 loan records with 11 features:

Feature	Type	Description
`person_age`	numeric	Applicant age
`person_income`	numeric	Annual income (USD)
`loan_amnt`	numeric	Requested loan amount
`loan_int_rate`	numeric	Loan interest rate
`loan_percent_income`	numeric	Loan as % of income
`cb_person_default_on_file`	binary	Historical default flag
`loan_grade`	ordinal	Lender-assigned grade (A–G)
…	…	…

Class distribution: 78% non-default · 22% default (moderate imbalance handled with class weights)

Preprocessing

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

df = pd.read_csv("credit_risk_dataset.csv")

# Drop duplicates and handle nulls
df = df.drop_duplicates()
df['loan_int_rate'].fillna(df['loan_int_rate'].median(), inplace=True)
df['person_emp_length'].fillna(0, inplace=True)

# Encode categoricals
le = LabelEncoder()
df['loan_grade'] = le.fit_transform(df['loan_grade'])
df['cb_person_default_on_file'] = df['cb_person_default_on_file'].map({'Y': 1, 'N': 0})

# Feature engineering
df['debt_to_income'] = df['loan_amnt'] / (df['person_income'] + 1)
df['income_per_year_employed'] = df['person_income'] / (df['person_emp_length'] + 1)

X = df.drop('loan_status', axis=1)
y = df['loan_status']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Methodology

Three models were evaluated using 5-fold cross-validation:

Model	CV AUC-ROC	Notes
Logistic Regression	0.82	Good baseline, less expressive
Random Forest	0.88	Strong, moderate interpretability
XGBoost	0.91	Best performance, chosen model

XGBoost Training

from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, classification_report

model = XGBClassifier(
    n_estimators=300,
    max_depth=5,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=3.5,   # handle class imbalance
    eval_metric='auc',
    random_state=42
)

model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    early_stopping_rounds=30,
    verbose=False
)

y_pred_proba = model.predict_proba(X_test)[:, 1]
print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_proba):.4f}")

Results

<div class="result-number">89%</div>
<div class="result-label">Accuracy</div>

<div class="result-number">0.91</div>
<div class="result-label">AUC-ROC</div>

<div class="result-number">86%</div>
<div class="result-label">Recall (defaults)</div>

<div class="result-number">83%</div>
<div class="result-label">Precision</div>

Feature Importance (SHAP)

import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test, plot_type="bar")

Top predictive features: 1. loan_percent_income — strongest signal (high debt relative to income → higher default) 2. loan_int_rate — high rates correlate with riskier applicants 3. cb_person_default_on_file — historical behaviour is predictive 4. loan_grade — lender assessment provides useful signal 5. debt_to_income — engineered feature that added lift

Insights & Impact

Interpretability matters: SHAP plots allowed loan officers to understand why a prediction was made — not just the probability. This was critical for regulatory compliance.
Feature engineering lifted AUC by ~0.02: The derived debt_to_income and income_per_year_employed features improved performance over raw features alone.
Class imbalance needed explicit handling: Without scale_pos_weight, the model missed 30% of actual defaults.

Future Improvements

Deploy as a FastAPI endpoint with real-time scoring
Integrate with a credit bureau API for live feature enrichment
Experiment with calibrated probabilities (Platt scaling) for better risk tiers
Build a monitoring dashboard for data drift detection
Fairness audit across demographic groups

View on GitHub ↗ ← All Projects