Show simple item record

dc.contributor.advisorKim, Yoon
dc.contributor.authorZhang, Sarah
dc.date.accessioned2025-04-14T14:06:35Z
dc.date.available2025-04-14T14:06:35Z
dc.date.issued2025-02
dc.date.submitted2025-04-03T14:06:37.647Z
dc.identifier.urihttps://hdl.handle.net/1721.1/159116
dc.description.abstractOpen-source models present significant opportunities and risks, especially in dual-use scenarios where they can be repurposed for malicious tasks via adversarial fine-tuning. In this paper, we evaluate the effectiveness of Tampering Attack Resistance (TAR), a safeguard designed to protect against such adversarial attacks, by exploring its resilience to full-parameter and parameter-efficient fine-tuning. Our experiments reveal that while TAR enhances tamper resistance compared to models without safeguards, it remains susceptible to variability. Specifically, we observe inconsistencies where the same adversarial attack can succeed under some initializations and fail under others. This is a critical security risk as even a single instance of failure can lead to models being exploited for harmful purposes. These findings highlight the limitations of current tamper-resistant safeguards and emphasize the need for more robust safeguards to ensure the safe and ethical deployment of open-source models.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleExploring Fine-Tuning Techniques for Removing Tamper-Resistant Safeguards for Open-Weight LLMs
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record