Case Study · Machine Learning
Loan Default Prediction
Building a classification pipeline that predicts borrower default risk from financial and behavioural features — with interpretable outputs that help lenders make fairer, faster decisions.
The Problem
Loan default is one of the most costly risks in consumer lending. Traditional credit scoring relies on narrow signals (credit score, income) that miss important behavioural indicators — and often exclude borrowers from underserved communities who could repay.
Goal: Build a model that: 1. Predicts default probability with high accuracy 2. Is interpretable enough for loan officers to trust and explain 3. Can run in real-time via API
Data
The dataset (sourced from a public Kaggle credit risk competition) contains 32,581 loan records with 11 features:
| Feature | Type | Description |
|---|---|---|
person_age |
numeric | Applicant age |
person_income |
numeric | Annual income (USD) |
loan_amnt |
numeric | Requested loan amount |
loan_int_rate |
numeric | Loan interest rate |
loan_percent_income |
numeric | Loan as % of income |
cb_person_default_on_file |
binary | Historical default flag |
loan_grade |
ordinal | Lender-assigned grade (A–G) |
| … | … | … |
Class distribution: 78% non-default · 22% default (moderate imbalance handled with class weights)
Preprocessing
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
df = pd.read_csv("credit_risk_dataset.csv")
# Drop duplicates and handle nulls
df = df.drop_duplicates()
df['loan_int_rate'].fillna(df['loan_int_rate'].median(), inplace=True)
df['person_emp_length'].fillna(0, inplace=True)
# Encode categoricals
le = LabelEncoder()
df['loan_grade'] = le.fit_transform(df['loan_grade'])
df['cb_person_default_on_file'] = df['cb_person_default_on_file'].map({'Y': 1, 'N': 0})
# Feature engineering
df['debt_to_income'] = df['loan_amnt'] / (df['person_income'] + 1)
df['income_per_year_employed'] = df['person_income'] / (df['person_emp_length'] + 1)
X = df.drop('loan_status', axis=1)
y = df['loan_status']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)Methodology
Three models were evaluated using 5-fold cross-validation:
| Model | CV AUC-ROC | Notes |
|---|---|---|
| Logistic Regression | 0.82 | Good baseline, less expressive |
| Random Forest | 0.88 | Strong, moderate interpretability |
| XGBoost | 0.91 | Best performance, chosen model |
XGBoost Training
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, classification_report
model = XGBClassifier(
n_estimators=300,
max_depth=5,
learning_rate=0.05,
subsample=0.8,
colsample_bytree=0.8,
scale_pos_weight=3.5, # handle class imbalance
eval_metric='auc',
random_state=42
)
model.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
early_stopping_rounds=30,
verbose=False
)
y_pred_proba = model.predict_proba(X_test)[:, 1]
print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_proba):.4f}")Results
<div class="result-number">89%</div>
<div class="result-label">Accuracy</div>
<div class="result-number">0.91</div>
<div class="result-label">AUC-ROC</div>
<div class="result-number">86%</div>
<div class="result-label">Recall (defaults)</div>
<div class="result-number">83%</div>
<div class="result-label">Precision</div>
Feature Importance (SHAP)
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")Top predictive features: 1. loan_percent_income — strongest signal (high debt relative to income → higher default) 2. loan_int_rate — high rates correlate with riskier applicants 3. cb_person_default_on_file — historical behaviour is predictive 4. loan_grade — lender assessment provides useful signal 5. debt_to_income — engineered feature that added lift
Insights & Impact
- Interpretability matters: SHAP plots allowed loan officers to understand why a prediction was made — not just the probability. This was critical for regulatory compliance.
- Feature engineering lifted AUC by ~0.02: The derived
debt_to_incomeandincome_per_year_employedfeatures improved performance over raw features alone. - Class imbalance needed explicit handling: Without
scale_pos_weight, the model missed 30% of actual defaults.