Efficient Segment Anything on the Edge
Author(s)
Stiles, Nicole
DownloadThesis PDF (15.37Mb)
Advisor
Han, Song
Terms of use
Metadata
Show full item recordAbstract
The Segment-Anything Model (SAM) is a vision foundation model facilitating promptable and zero-shot image segmentation. SAM-based models have a wide range of applications including autonomous driving, medical image segmentation, VR, and data annotation. However, SAM models are highly computationally intensive and lack a flexible prompting mechanism. On an NVIDIA A100 GPU, SAM runs at 11 frames/second, missing the mark for real-time performance and preventing the usage of SAM on edge devices. To tackle both the latency constraint and the prompt flexibility constraint, we introduce GazeSAM, a new real-time gaze-prompted image segmentation model. GazeSAM uses face and gaze detection to determine the direction of a user's gaze, object detection to find candidate objects of interest, depth estimation to perform background detection, and image segmentation to generate masks. The final output is a mask segmenting the object at the focus of the user's gaze. By performing algorithmic optimizations, employing inference engines, and applying FP16 and INT8 quantization, we achieve a 24x speedup relative to the baseline FP32 PyTorch implementation. GazeSAM runs at a speed of over 30 FPS, enabling real-time performance on an RTX 4070 GPU.
Date issued
2024-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology