Accurate Protein Function Prediction with Graph Transformer-Based Function Localization

Mitra, Shania

dc.contributor.advisor	Kellis, Manolis
dc.contributor.author	Mitra, Shania
dc.date.accessioned	2025-07-07T17:36:37Z
dc.date.available	2025-07-07T17:36:37Z
dc.date.issued	2025-05
dc.date.submitted	2025-05-20T21:15:16.575Z
dc.identifier.uri	https://hdl.handle.net/1721.1/159880
dc.description.abstract	Protein function prediction is a fundamental challenge in biology, crucial for understanding biological processes, disease mechanisms, and accelerating drug discovery. While computational methods leveraging sequence or structural information have advanced, accurately translating protein structure to function and pinpointing the specific residues responsible remain significant hurdles. Many existing deep learning approaches fall short, often relying on post-hoc analyses that lack specificity or fail to directly integrate functional site identification into the prediction process. In this study, we introduce the Protein Region Proposal Network (ProteinRPN), a novel graphbased deep learning framework designed to address these limitations. ProteinRPN is the first model to integrate the proactive identification of functional regions within the Gene Ontology term prediction pipeline. The core of the model is a Region Proposal Network module that processes protein structure graphs (residues as nodes, contacts as edges) to identify potential functional regions, termed anchors. These anchors are subsequently refined using a multi-stage process involving a novel differentiable node drop pooling layer that incorporates domain knowledge. A functional attention layer further enhances the representations of predicted functional nodes, and a Graph Multiset Transformer aggregates this localized information into a comprehensive graph-level embedding for final prediction. The model is optimized using a combination of cross-entropy classification loss, supervised and self-supervised contrastive learning losses (SupCon and InfoNCE) for robust representation learning. Evaluated on standard benchmarks derived from the DeepFRI/HEAL datasets, ProteinRPN demonstrates state-of-the-art performance, consistently outperforming existing sequencebased and structure-based methods across all three Gene Ontology domains (Molecular Function, Biological Process, Cellular Component) based on standard CAFA metrics (Fmax, AUPR, Smin). Notably, ProteinRPN achieves significant improvements over strong baselines like HEAL, with AUPR (Area under Precision Recall curve) gains of approximately 15.4% (BP), 8.5% (CC), and 1.3% (MF). Furthermore, ablation studies validate the contribution of each key component, particularly the region proposal mechanism. Qualitative analysis confirms the model’s ability to accurately localize known functional residues within protein structures, offering enhanced interpretability. By directly modeling and identifying functionally relevant structural regions, ProteinRPN presents a robust, interpretable, and high-performing approach to structure-based protein function prediction. This work contributes a novel framework that bridges the gap between structural information and functional annotation, offering potential for deeper biological insights and advancing computational tools for understanding the proteome.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Accurate Protein Function Prediction with Graph Transformer-Based Function Localization
dc.type	Thesis
dc.description.degree	S.M.
dc.contributor.department	Massachusetts Institute of Technology. Center for Computational Science and Engineering
mit.thesis.degree	Master
thesis.degree.name	Master of Science in Computational Science and Engineering

Files in this item

Name:: mitra-shania-csesm-ccse-2025-thesis ...
Size:: 3.199Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record