| dc.description.abstract | Protein function prediction is a fundamental challenge in biology, crucial for understanding biological processes, disease mechanisms, and accelerating drug discovery. While computational methods leveraging sequence or structural information have advanced, accurately translating protein structure to function and pinpointing the specific residues responsible remain significant hurdles. Many existing deep learning approaches fall short, often relying on post-hoc analyses that lack specificity or fail to directly integrate functional site identification into the prediction process. In this study, we introduce the Protein Region Proposal Network (ProteinRPN), a novel graphbased deep learning framework designed to address these limitations. ProteinRPN is the first model to integrate the proactive identification of functional regions within the Gene Ontology term prediction pipeline. The core of the model is a Region Proposal Network module that processes protein structure graphs (residues as nodes, contacts as edges) to identify potential functional regions, termed anchors. These anchors are subsequently refined using a multi-stage process involving a novel differentiable node drop pooling layer that incorporates domain knowledge. A functional attention layer further enhances the representations of predicted functional nodes, and a Graph Multiset Transformer aggregates this localized information into a comprehensive graph-level embedding for final prediction. The model is optimized using a combination of cross-entropy classification loss, supervised and self-supervised contrastive learning losses (SupCon and InfoNCE) for robust representation learning. Evaluated on standard benchmarks derived from the DeepFRI/HEAL datasets, ProteinRPN demonstrates state-of-the-art performance, consistently outperforming existing sequencebased and structure-based methods across all three Gene Ontology domains (Molecular Function, Biological Process, Cellular Component) based on standard CAFA metrics (Fmax, AUPR, Smin). Notably, ProteinRPN achieves significant improvements over strong baselines like HEAL, with AUPR (Area under Precision Recall curve) gains of approximately 15.4% (BP), 8.5% (CC), and 1.3% (MF). Furthermore, ablation studies validate the contribution of each key component, particularly the region proposal mechanism. Qualitative analysis confirms the model’s ability to accurately localize known functional residues within protein structures, offering enhanced interpretability. By directly modeling and identifying functionally relevant structural regions, ProteinRPN presents a robust, interpretable, and high-performing approach to structure-based protein function prediction. This work contributes a novel framework that bridges the gap between structural information and functional annotation, offering potential for deeper biological insights and advancing computational tools for understanding the proteome. | |