Integrating Gradient Boosting and Generative Models: Hybrid Approach to Address Class Imbalance and Evaluation Gaps in Real-World Systems
Author(s)
Lau, Mary
DownloadThesis PDF (2.060Mb)
Advisor
Gupta, Amar
Terms of use
Metadata
Show full item recordAbstract
Anomaly detection remains a persistent challenge in machine learning due to the extreme class imbalance, high cost of false negatives, and the need to regulate false positives in realworld settings at scale. This thesis introduces Tail-end FPR Max Recall, a business-aware evaluation framework designed for such constrained environments. Using this framework, we benchmark LightGBM—a gradient boosting method known for its computational efficiency and predictive accuracy—on an imbalanced dataset, comparing its performance against standard academic evaluation criteria. Our results demonstrate that Tail-end FPR Max Recall fills critical gaps left by standard academic criteria, providing a more realistic assessment of model performance that aims to maximize recall while enforcing a false positive rate budget. Beyond benchmarking, we propose two strategies that incorporate deep learning methods to augment the already strong performance of gradient boosting: (1) using generative models to produce synthetic minority-class samples that outperform traditional oversampling techniques, and (2) using neural embeddings to improve feature representation for anomaly detection. Together, these contributions offer a methodology for evaluating and improving anomaly detection pipelines in domains where rare, high-impact events must be detected while meeting strict operational demands.
Date issued
2025-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology