Skip to content

Project: Default prediction (Logistic vs LDA)

ISLR “Default” dataset · R · Train/test split · Threshold trade-offs

Overview

Goal: predict whether a customer defaults (default = Yes) using balance, income, and student, and compare Logistic Regression against LDA.

Key idea: default is rare (~3–4%), so the probability cutoff you choose matters. Lower cutoffs flag more accounts as “high risk” and catch more true defaulters.

Method

Reporting focuses on: predicted default rate (how many accounts are flagged) and caught defaulter rate (share of true defaulters flagged).

Results

Baseline context: if we predicted “No default” for everyone, accuracy would be about 96.37% (since the test default rate is ~3.63%).

Model Cutoff Test accuracy Test error Predicted default rate Caught defaulter rate
Logistic0.5097.20%2.80%1.50%32.11%
Logistic0.1093.53%6.47%8.10%72.48%
Logistic0.0589.97%10.03%12.53%84.40%
LDA0.5097.10%2.90%1.13%25.69%
LDA0.1093.40%6.60%8.23%72.48%
LDA0.0588.43%11.57%14.07%84.40%

Takeaways

Plots

Histogram: Logistic predicted probabilities with cutoff lines
Logistic regression predicted probabilities (test set). Vertical lines mark cutoffs 0.50 / 0.10 / 0.05.
Histogram: LDA predicted probabilities with cutoff lines
LDA predicted probabilities (test set). Most predictions are near 0 because default is rare.

Files

Note: results depend on the random train/test split. A next step would be cross-validation or a validation set for selecting cutoffs.