| dc.contributor.author | Gwon, Daniel | |
| dc.contributor.author | Jedidi, Nour | |
| dc.contributor.author | Lin, Jimmy | |
| dc.date.accessioned | 2025-12-11T20:01:25Z | |
| dc.date.available | 2025-12-11T20:01:25Z | |
| dc.date.issued | 2025-11-10 | |
| dc.identifier.isbn | 979-8-4007-2040-6 | |
| dc.identifier.uri | https://hdl.handle.net/1721.1/164284 | |
| dc.description | CIKM ’25, Seoul, Republic of Korea | en_US |
| dc.description.abstract | Promptagator demonstrated that Large Language Models (LLMs) with few-shot prompts can be used as task-specific query generators for fine-tuning domain-specialized dense retrieval models. However, the original Promptagator approach relied on proprietary and large-scale LLMs which users may not have access to or may be prohibited from using with sensitive data. In this work, we study the impact of open-source LLMs at accessible scales (≤14B parameters) as an alternative. Our results demonstrate that open-source LLMs as small as 3B parameters can serve as effective Promptagator-style query generators. We hope our work will inform practitioners with reliable alternatives for synthetic data generation and give insights to maximize fine-tuning results for domain-specific applications. Our code is available at https://www.github.com/mitll/promptodile | en_US |
| dc.publisher | ACM|Proceedings of the 34th ACM International Conference on Information and Knowledge Management | en_US |
| dc.relation.isversionof | https://doi.org/10.1145/3746252.3760960 | en_US |
| dc.rights | Creative Commons Attribution | en_US |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | en_US |
| dc.source | Association for Computing Machinery | en_US |
| dc.title | Study on LLMs for Promptagator-Style Dense Retriever Training | en_US |
| dc.type | Article | en_US |
| dc.identifier.citation | Daniel Gwon, Nour Jedidi, and Jimmy Lin. 2025. Study on LLMs for Promptagator-Style Dense Retriever Training. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25). Association for Computing Machinery, New York, NY, USA, 4748–4752. | en_US |
| dc.contributor.department | Lincoln Laboratory | en_US |
| dc.identifier.mitlicense | PUBLISHER_POLICY | |
| dc.eprint.version | Final published version | en_US |
| dc.type.uri | http://purl.org/eprint/type/ConferencePaper | en_US |
| eprint.status | http://purl.org/eprint/status/NonPeerReviewed | en_US |
| dc.date.updated | 2025-12-01T09:24:35Z | |
| dc.language.rfc3066 | en | |
| dc.rights.holder | The author(s) | |
| dspace.date.submission | 2025-12-01T09:24:35Z | |
| mit.license | PUBLISHER_CC | |
| mit.metadata.status | Authority Work and Publication Information Needed | en_US |