Study on LLMs for Promptagator-Style Dense Retriever Training

Gwon, Daniel; Jedidi, Nour; Lin, Jimmy

Author(s)

Gwon, Daniel; Jedidi, Nour; Lin, Jimmy

Download3746252.3760960.pdf (553.4Kb)

Publisher with Creative Commons License

Terms of use

Creative Commons Attribution https://creativecommons.org/licenses/by/4.0/

Metadata

Show full item record

Abstract

Promptagator demonstrated that Large Language Models (LLMs) with few-shot prompts can be used as task-specific query generators for fine-tuning domain-specialized dense retrieval models. However, the original Promptagator approach relied on proprietary and large-scale LLMs which users may not have access to or may be prohibited from using with sensitive data. In this work, we study the impact of open-source LLMs at accessible scales (≤14B parameters) as an alternative. Our results demonstrate that open-source LLMs as small as 3B parameters can serve as effective Promptagator-style query generators. We hope our work will inform practitioners with reliable alternatives for synthetic data generation and give insights to maximize fine-tuning results for domain-specific applications. Our code is available at https://www.github.com/mitll/promptodile

Description

CIKM ’25, Seoul, Republic of Korea

Date issued

2025-11-10

URI

https://hdl.handle.net/1721.1/164284

Department

Lincoln Laboratory

Publisher

ACM|Proceedings of the 34th ACM International Conference on Information and Knowledge Management

Citation

Daniel Gwon, Nour Jedidi, and Jimmy Lin. 2025. Study on LLMs for Promptagator-Style Dense Retriever Training. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25). Association for Computing Machinery, New York, NY, USA, 4748–4752.

Version: Final published version

ISBN

979-8-4007-2040-6

Collections

MIT Open Access Articles