Exploring Fine-Tuning Techniques for Removing
Tamper-Resistant Safeguards for Open-Weight LLMs

Zhang, Sarah

dc.contributor.advisor	Kim, Yoon
dc.contributor.author	Zhang, Sarah
dc.date.accessioned	2025-04-14T14:06:35Z
dc.date.available	2025-04-14T14:06:35Z
dc.date.issued	2025-02
dc.date.submitted	2025-04-03T14:06:37.647Z
dc.identifier.uri	https://hdl.handle.net/1721.1/159116
dc.description.abstract	Open-source models present significant opportunities and risks, especially in dual-use scenarios where they can be repurposed for malicious tasks via adversarial fine-tuning. In this paper, we evaluate the effectiveness of Tampering Attack Resistance (TAR), a safeguard designed to protect against such adversarial attacks, by exploring its resilience to full-parameter and parameter-efficient fine-tuning. Our experiments reveal that while TAR enhances tamper resistance compared to models without safeguards, it remains susceptible to variability. Specifically, we observe inconsistencies where the same adversarial attack can succeed under some initializations and fail under others. This is a critical security risk as even a single instance of failure can lead to models being exploited for harmful purposes. These findings highlight the limitations of current tamper-resistant safeguards and emphasize the need for more robust safeguards to ensure the safe and ethical deployment of open-source models.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Exploring Fine-Tuning Techniques for Removing Tamper-Resistant Safeguards for Open-Weight LLMs
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science

Files in this item

Name:: zhang-sjzhang-meng-eecs-2025-t ...
Size:: 2.665Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record

Exploring Fine-Tuning Techniques for Removing Tamper-Resistant Safeguards for Open-Weight LLMs

Files in this item

This item appears in the following Collection(s)