Exploring Fine-Tuning Techniques for Removing Tamper-Resistant Safeguards for Open-Weight LLMs
Author(s)
Zhang, Sarah
DownloadThesis PDF (2.665Mb)
Advisor
Kim, Yoon
Terms of use
Metadata
Show full item recordAbstract
Open-source models present significant opportunities and risks, especially in dual-use scenarios where they can be repurposed for malicious tasks via adversarial fine-tuning. In this paper, we evaluate the effectiveness of Tampering Attack Resistance (TAR), a safeguard designed to protect against such adversarial attacks, by exploring its resilience to full-parameter and parameter-efficient fine-tuning. Our experiments reveal that while TAR enhances tamper resistance compared to models without safeguards, it remains susceptible to variability. Specifically, we observe inconsistencies where the same adversarial attack can succeed under some initializations and fail under others. This is a critical security risk as even a single instance of failure can lead to models being exploited for harmful purposes. These findings highlight the limitations of current tamper-resistant safeguards and emphasize the need for more robust safeguards to ensure the safe and ethical deployment of open-source models.
Date issued
2025-02Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology