MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Exploring Fine-Tuning Techniques for Removing Tamper-Resistant Safeguards for Open-Weight LLMs

Author(s)
Zhang, Sarah
Thumbnail
DownloadThesis PDF (2.665Mb)
Advisor
Kim, Yoon
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Open-source models present significant opportunities and risks, especially in dual-use scenarios where they can be repurposed for malicious tasks via adversarial fine-tuning. In this paper, we evaluate the effectiveness of Tampering Attack Resistance (TAR), a safeguard designed to protect against such adversarial attacks, by exploring its resilience to full-parameter and parameter-efficient fine-tuning. Our experiments reveal that while TAR enhances tamper resistance compared to models without safeguards, it remains susceptible to variability. Specifically, we observe inconsistencies where the same adversarial attack can succeed under some initializations and fail under others. This is a critical security risk as even a single instance of failure can lead to models being exploited for harmful purposes. These findings highlight the limitations of current tamper-resistant safeguards and emphasize the need for more robust safeguards to ensure the safe and ethical deployment of open-source models.
Date issued
2025-02
URI
https://hdl.handle.net/1721.1/159116
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.