dc.contributor.advisor | Kim, Yoon | |
dc.contributor.author | Zhang, Sarah | |
dc.date.accessioned | 2025-04-14T14:06:35Z | |
dc.date.available | 2025-04-14T14:06:35Z | |
dc.date.issued | 2025-02 | |
dc.date.submitted | 2025-04-03T14:06:37.647Z | |
dc.identifier.uri | https://hdl.handle.net/1721.1/159116 | |
dc.description.abstract | Open-source models present significant opportunities and risks, especially in dual-use scenarios where they can be repurposed for malicious tasks via adversarial fine-tuning. In this paper, we evaluate the effectiveness of Tampering Attack Resistance (TAR), a safeguard designed to protect against such adversarial attacks, by exploring its resilience to full-parameter and parameter-efficient fine-tuning. Our experiments reveal that while TAR enhances tamper resistance compared to models without safeguards, it remains susceptible to variability. Specifically, we observe inconsistencies where the same adversarial attack can succeed under some initializations and fail under others. This is a critical security risk as even a single instance of failure can lead to models being exploited for harmful purposes. These findings highlight the limitations of current tamper-resistant safeguards and emphasize the need for more robust safeguards to ensure the safe and ethical deployment of open-source models. | |
dc.publisher | Massachusetts Institute of Technology | |
dc.rights | In Copyright - Educational Use Permitted | |
dc.rights | Copyright retained by author(s) | |
dc.rights.uri | https://rightsstatements.org/page/InC-EDU/1.0/ | |
dc.title | Exploring Fine-Tuning Techniques for Removing
Tamper-Resistant Safeguards for Open-Weight LLMs | |
dc.type | Thesis | |
dc.description.degree | M.Eng. | |
dc.contributor.department | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science | |
mit.thesis.degree | Master | |
thesis.degree.name | Master of Engineering in Electrical Engineering and Computer Science | |