Learning 3D Representations from Data
Author(s)
Wang, Yue
DownloadThesis PDF (38.62Mb)
Advisor
Solomon, Justin M.
Terms of use
Metadata
Show full item recordAbstract
Deep learning has achieved tremendous progress and success in processing images and natural languages. Deep models enable human-level perception, photorealistic image generation, and conversational language understanding. Despite significant progress, existing deep models still fail to meet the demands of robotics. There are several factors leading to this gap. First, existing computer vision algorithms have been primarily targeted to 2D images. These algorithms are extremely good at recognizing objects in an image, but they fail to reason about 3D geometry. Second, the current success in the 2D domain is mainly due to the advance in convolutional neural networks (CNNs). However, CNNs do not generalize to arbitrary data modalities such as point clouds. Finally, 3D annotations are scarce and hard to obtain. Annotating 3D data usually requires more human effort, which hinders supervised learning from 3D data. Therefore, learning 3D representations from data remains challenging and demands further study.
This thesis investigates how to learn representations from 3D data efficiently and effectively. This thesis aims to design 3D learning algorithms that understand geometry with minimal supervision. First, we proposed a general point cloud network , termed Dynamic Graph Convolutional Neural Networks (DGCNN), to learn a latent structure from sensory inputs. The induced structure will improve feature learning from point clouds. Unlike prior works that focus on global features, DGCNN views local geometry as the key to point cloud feature learning. Second, we study using DGCNN to enable high-level semantic reasoning tasks such as shape segmentation and 3D object detection. To that end, we propose a multi-view based object detection model that learns complementary features by projecting point clouds to virtual views. In addition, our follow-up work Object DGCNN leverages DGCNN to model object relations and empowers a post-processing free object detection pipeline with state-of-the-art performances on multiple benchmarks. Third, we generalize these point cloud models to tackle low-level motion estimation problems such as point cloud registration. The proposed Deep Closest Point architecture combines a traditional optimization pipeline with deep learning. Moreover, Partial Registration Network (PRNet) uses shape registration as a proxy task to enable self-supervised learning from point clouds in a subsequent work. Finally, this thesis allows for a critical application -- scene understanding for autonomous driving. These studies collectively facilitate 3D deep learning in a broad range of scenarios in visual computing.
Date issued
2022-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology