Truthfulness in Large Language Models

Liu, Kevin

Author(s)

Liu, Kevin

DownloadThesis PDF (1.539Mb)

Advisor

Andreas, Jacob

Hadfield-Menell, Dylan

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Large language models (LLMs) have been experiencing a rapid rise in utility, accessibility, and popularity, but there are still many areas in which they can improve. One such area for improvement is their truthfulness. We seek to improve the truthfulness of LLMs by probing their internal representations. We find that a linear probe on the last hidden layer representation is able to improve a model’s accuracy by reducing its confidence in incorrect answers. However, this probe is less effective at perturbing the model to change its behavior and driving the model towards correct answers.

Date issued

2023-06

URI

https://hdl.handle.net/1721.1/151345

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses