Blog · Machine Learning

Why XGBoost still wins on tabular data

May 2025 · 6 min read

When I built the loan default prediction model, I ran a proper benchmark: logistic regression, random forest, LightGBM, XGBoost, and a simple tabular neural net (TabNet). The result was clear — XGBoost won. Not by a huge margin, but consistently, and with better interpretability than the neural net.

This surprised me. I expected the deep learning approach to shine given the feature count and dataset size. Here’s what I learned.

The benchmark setup

Dataset: 32,581 loan records, 11 features (mix of numeric + categorical)
Metric: AUC-ROC (class imbalance makes accuracy misleading)
Approach: 5-fold stratified cross-validation, tuned each model with Optuna

Model	Mean AUC-ROC	Std Dev	Training time
Logistic Regression	0.820	0.008	~2s
Random Forest	0.878	0.006	~45s
LightGBM	0.906	0.005	~18s
XGBoost	0.912	0.004	~35s
TabNet (neural)	0.893	0.011	~8 min

XGBoost: highest AUC, lowest variance across folds, and significantly faster than the neural net.

Why does XGBoost perform so well?

Gradient boosting is a great fit for tabular structure. Most real-world tabular datasets have: - Non-linear relationships between features - Interactions between features (loan_percent_income AND loan_grade together are more predictive than separately) - Class imbalance - Missing values

XGBoost handles all of these natively. Neural networks can learn these patterns, but they need more data, more tuning, and don’t handle missing values or imbalance as gracefully out of the box.

Regularisation. XGBoost has L1 and L2 regularisation baked in. TabNet required significant dropout tuning to match it.

Feature importance for free. XGBoost gives you gain-based importance and SHAP values with one extra library. For the loan project, this was essential — loan officers need to understand why a prediction was made, not just what it is.

When to use a neural net instead

I’m not saying “never use deep learning for tabular data.” There are cases where it wins:

Very large datasets (>500K rows) where XGBoost starts to struggle computationally
Features that have natural representations (embeddings for IDs, sequences)
When you need to do end-to-end learning with other modalities (images + tabular together)
When you want to learn representations for downstream tasks

But for the typical data science project — under 100K rows, structured features, classification or regression — XGBoost is my default, and I’ll beat it before switching.

The tuning that actually moved the needle

Most XGBoost guides focus on n_estimators and max_depth. Here’s what actually made a difference in my benchmark:

params = {
    "n_estimators": 300,
    "max_depth": 5,           # Shallow trees → less overfitting
    "learning_rate": 0.05,    # Low LR + more trees > high LR + few trees
    "subsample": 0.8,         # Row subsampling reduces variance
    "colsample_bytree": 0.8,  # Feature subsampling (like random forests)
    "scale_pos_weight": 3.5,  # Critical for class imbalance
    "min_child_weight": 3,    # Prevents learning from tiny leaf nodes
    "reg_alpha": 0.1,         # L1: sparsity
    "reg_lambda": 1.0,        # L2: smoothness
}

scale_pos_weight was the single biggest win — without it, the model correctly classified 95% of non-defaults but missed 30% of actual defaults. Not useful for a credit risk model.

Conclusion

XGBoost isn’t exciting. There’s no paper title waiting in “I used XGBoost.” But it’s reliable, interpretable, fast to iterate, and genuinely hard to beat on most tabular classification tasks.

Default to XGBoost. Justify deviating from it.

Read the full loan default prediction case study for implementation details.

← All Posts