Study on LLMs for Promptagator-Style Dense Retriever Training
Author(s)
Gwon, Daniel; Jedidi, Nour; Lin, Jimmy
Download3746252.3760960.pdf (553.4Kb)
Publisher with Creative Commons License
Publisher with Creative Commons License
Creative Commons Attribution
Terms of use
Metadata
Show full item recordAbstract
Promptagator demonstrated that Large Language Models (LLMs) with few-shot prompts can be used as task-specific query generators for fine-tuning domain-specialized dense retrieval models. However, the original Promptagator approach relied on proprietary and large-scale LLMs which users may not have access to or may be prohibited from using with sensitive data. In this work, we study the impact of open-source LLMs at accessible scales (≤14B parameters) as an alternative. Our results demonstrate that open-source LLMs as small as 3B parameters can serve as effective Promptagator-style query generators. We hope our work will inform practitioners with reliable alternatives for synthetic data generation and give insights to maximize fine-tuning results for domain-specific applications. Our code is available at https://www.github.com/mitll/promptodile
Description
CIKM ’25, Seoul, Republic of Korea
Date issued
2025-11-10Department
Lincoln LaboratoryPublisher
ACM|Proceedings of the 34th ACM International Conference on Information and Knowledge Management
Citation
Daniel Gwon, Nour Jedidi, and Jimmy Lin. 2025. Study on LLMs for Promptagator-Style Dense Retriever Training. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25). Association for Computing Machinery, New York, NY, USA, 4748–4752.
Version: Final published version
ISBN
979-8-4007-2040-6