Loan Default Prediction

Building a classification pipeline that predicts borrower default risk from financial and behavioural features — with interpretable outputs that help lenders make fairer, faster decisions.

<span class="case-meta-label">Domain</span>
<span class="case-meta-value">FinTech / Credit Risk</span>
<span class="case-meta-label">Stack</span>
<span class="case-meta-value">Python · scikit-learn · XGBoost · SHAP</span>
<span class="case-meta-label">Timeline</span>
<span class="case-meta-value">8 weeks</span>
<span class="case-meta-label">Status</span>
<span class="case-meta-value">✅ Complete</span>

The Problem

Loan default is one of the most costly risks in consumer lending. Traditional credit scoring relies on narrow signals (credit score, income) that miss important behavioural indicators — and often exclude borrowers from underserved communities who could repay.

Goal: Build a model that: 1. Predicts default probability with high accuracy 2. Is interpretable enough for loan officers to trust and explain 3. Can run in real-time via API


Data

The dataset (sourced from a public Kaggle credit risk competition) contains 32,581 loan records with 11 features:

Feature Type Description
person_age numeric Applicant age
person_income numeric Annual income (USD)
loan_amnt numeric Requested loan amount
loan_int_rate numeric Loan interest rate
loan_percent_income numeric Loan as % of income
cb_person_default_on_file binary Historical default flag
loan_grade ordinal Lender-assigned grade (A–G)

Class distribution: 78% non-default · 22% default (moderate imbalance handled with class weights)

Preprocessing

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

df = pd.read_csv("credit_risk_dataset.csv")

# Drop duplicates and handle nulls
df = df.drop_duplicates()
df['loan_int_rate'].fillna(df['loan_int_rate'].median(), inplace=True)
df['person_emp_length'].fillna(0, inplace=True)

# Encode categoricals
le = LabelEncoder()
df['loan_grade'] = le.fit_transform(df['loan_grade'])
df['cb_person_default_on_file'] = df['cb_person_default_on_file'].map({'Y': 1, 'N': 0})

# Feature engineering
df['debt_to_income'] = df['loan_amnt'] / (df['person_income'] + 1)
df['income_per_year_employed'] = df['person_income'] / (df['person_emp_length'] + 1)

X = df.drop('loan_status', axis=1)
y = df['loan_status']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Methodology

Three models were evaluated using 5-fold cross-validation:

Model CV AUC-ROC Notes
Logistic Regression 0.82 Good baseline, less expressive
Random Forest 0.88 Strong, moderate interpretability
XGBoost 0.91 Best performance, chosen model

XGBoost Training

from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, classification_report

model = XGBClassifier(
    n_estimators=300,
    max_depth=5,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=3.5,   # handle class imbalance
    eval_metric='auc',
    random_state=42
)

model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    early_stopping_rounds=30,
    verbose=False
)

y_pred_proba = model.predict_proba(X_test)[:, 1]
print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_proba):.4f}")

Results

<div class="result-number">89%</div>
<div class="result-label">Accuracy</div>
<div class="result-number">0.91</div>
<div class="result-label">AUC-ROC</div>
<div class="result-number">86%</div>
<div class="result-label">Recall (defaults)</div>
<div class="result-number">83%</div>
<div class="result-label">Precision</div>

Feature Importance (SHAP)

import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test, plot_type="bar")

Top predictive features: 1. loan_percent_income — strongest signal (high debt relative to income → higher default) 2. loan_int_rate — high rates correlate with riskier applicants 3. cb_person_default_on_file — historical behaviour is predictive 4. loan_grade — lender assessment provides useful signal 5. debt_to_income — engineered feature that added lift


Insights & Impact

  • Interpretability matters: SHAP plots allowed loan officers to understand why a prediction was made — not just the probability. This was critical for regulatory compliance.
  • Feature engineering lifted AUC by ~0.02: The derived debt_to_income and income_per_year_employed features improved performance over raw features alone.
  • Class imbalance needed explicit handling: Without scale_pos_weight, the model missed 30% of actual defaults.

Future Improvements


View on GitHub ↗ ← All Projects