Truthfulness in Large Language Models

dc.contributor.advisor	Andreas, Jacob
dc.contributor.advisor	Hadfield-Menell, Dylan
dc.contributor.author	Liu, Kevin
dc.date.accessioned	2023-07-31T19:32:58Z
dc.date.available	2023-07-31T19:32:58Z
dc.date.issued	2023-06
dc.date.submitted	2023-06-06T16:34:59.367Z
dc.identifier.uri	https://hdl.handle.net/1721.1/151345
dc.description.abstract	Large language models (LLMs) have been experiencing a rapid rise in utility, accessibility, and popularity, but there are still many areas in which they can improve. One such area for improvement is their truthfulness. We seek to improve the truthfulness of LLMs by probing their internal representations. We find that a linear probe on the last hidden layer representation is able to improve a model’s accuracy by reducing its confidence in incorrect answers. However, this probe is less effective at perturbing the model to change its behavior and driving the model towards correct answers.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Truthfulness in Large Language Models
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science