Tree-based Data Replay for More Efficient LLM Continual Learning
Author(s)
Bailey, Brian
DownloadThesis PDF (874.0Kb)
Advisor
Chase, Christina
Terms of use
Metadata
Show full item recordAbstract
As Large Language Models (LLMs) gain popularity, they face a crucial challenge: effectively updating their knowledge bases with new data while retaining knowledge of prior information. This challenge is compounded by the considerable computational resources and time required to do so. This problem has been previously addressed using multiple approaches, including data replay, Elastic Weight Consolidation (EWC), and others. This study introduces an evolutionary tree-based data replay method designed to enhance the efficiency of LLMs’ continual training. It leverages the evolutionary relationships among domain-specific data to inform the replay strategy, selectively excluding similar data from the training of current subdomains to optimize efficiency. Initial experiments identified Mistral-7B as the appropriate model for this analysis. Subsequent tests assessed its performance under different data replay configurations, focusing on perplexity as the primary performance measure. The results indicate that focused data replay maintains model performance and enhance training efficiency. Models trained under restrictive replay conditions—excluding data from parent nodes—achieved perplexity scores within 1.5% of the baseline and reduced training time by up to 20%. Moreover, an ablation study established that a minimum replay ratio of 0.4:1 is essential to keep performance within 8.2% of the baseline. The findings suggest significant potential for structured data replay in improving continual learning processes for LLMs. Future research should explore data selection based on similarity metrics or automatic data categorization to enhance scalability and applicability.
Date issued
2024-05Department
Massachusetts Institute of Technology. Department of Brain and Cognitive SciencesPublisher
Massachusetts Institute of Technology