Software Defect Prediction
ISEC 2026 Student Data Science Challenge··Team MEGALODON
Predicting which software modules are defect-prone from static code metrics alone — without ever running the code.

1st place — a leak-safe gradient-boosting ensemble that flags faulty modules from static metrics, with SHAP for interpretability.
The problem
Given only static code metrics — complexity, coupling, size, and the like — predict which software modules are likely to contain defects, without executing the code. The data was noisy, had duplicated rows, and was imbalanced (most modules aren't faulty), which makes it dangerously easy to build a model that looks great in cross-validation and fails in the real world.
Approach
A hybrid pipeline built for honest generalization, not leaderboard overfitting:
- KNN-based duplicate detection to find and handle near-identical rows before they could leak between train and validation folds.
- A gradient-boosting ensemble combining LightGBM, XGBoost, and CatBoost — three boosters with different splitting and regularization behavior, blended for robustness.
- GroupKFold cross-validation so related samples never straddled the train/validation boundary — the single most important guard against the leakage that quietly inflates defect-prediction scores.
- SHAP values to explain why a module was flagged, turning a black-box score into something a reviewer could act on.
Result
1st place at the ISEC 2026 Student Data Science Challenge with Team MEGALODON.
What I learned
- Leakage discipline wins these competitions. The duplicate detection + GroupKFold combination mattered more than any single model.
- Interpretability is not optional for defect prediction — a flagged module is only useful if a human can see the reasoning and trust it.
Scroll sideways · click any photo to enlarge