Pursuing Mid-level Perception from Casual Videos
Author(s)
Zhang, Zhoutong
DownloadThesis PDF (37.29Mb)
Advisor
Freeman, William T.
Terms of use
Metadata
Show full item recordAbstract
This thesis aims to summarize a series of explorations around a central theme: How can we learn mid-level perception from collections of casually shot videos? To avoid reader’s disappointment, I would like to be frank at the start: contents within are only starting steps towards solving the problem. Specifically, a major part of this thesis addresses the problem of recovering depths and ego-motion despite of the dynamics in the video, which is only part of the mid-level perception problem.
Why those in particular? First of all, they are the pillars of 3D understanding if the agent is able to move and interact with the dynamic world. In a more narrow sense, this corresponds to the "mid-level vision" in Marr’s perception theory, where 2.5D sketches are recovered from processed image signals. If we add the flexibility of motion, then the task would also include recovering the ego-motion, i.e. the trajectory of the viewer through time. In addition, depths and ego-motion recovery have the potential to help solving other mid-level vision tasks. In this thesis, we show that we can solve the video version of the checkershadow illusion [1] when both the observer and the checker is moving simultaneously. This is done by building a 3D representation of the scene that are split into persistent and transient effects, which is only possible with recovered the depth and ego-motion.
To get depth and camera’s ego-motion from videos with unrestricted object motion and ego-motion, is quite challenging. The first chapter of the thesis gives an introduction of the problem, with brief reviews of past works and demonstrate how they fail to solve the problem robustly and why. The second chapter will address a partial form of the problem, where for a video with given camera ego-motion, how to recover reliable depth maps even if there’s significant object motion in the scene. The third chapter of the thesis addresses the full problem, presenting a solution to jointly recover depth and camera ego-motion for casually shot videos.
It is remained to ask, why the ambitious title? Why not a more specific one and end the thesis here? Maybe a bit unconventional, I would like to think of this thesis as a starting milestone for the topic, which I feel committed and excited to pursuit, instead of an end, a mere warp up of what I did for my graduate studies. Therefore, the last chapter, named "Video Canonicalization", is dedicated to an ongoing pursuit that aims to provide a structure that is helpful for analyzing different works, and clarifying design dimensions for solving mid-level vision problems using videos. Some part of this chapter may seem half-baked, with rudimentary experiments and examples that merely aim to prove the concept. Hopefully those will mature into future projects that would better bear the title.
Finally, I would like to cite, though not in its exact form, Patrick Winston’s remarks when I entered MIT: "There’s only one thing I can promise you after your journey at MIT: you will find the thing you are truly excited about, which will drive you for the future. If not, I’ll come to you and you will be in trouble with me." I’m really glad that this turned out to be true, but sad that he will never come to us even if it wasn’t.
Date issued
2022-09Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology