An invariance-based account of feedforward categorization in a realistic model of the ventral visual pathway
Author(s)Mutch, James Vincent
Massachusetts Institute of Technology. Department of Brain and Cognitive Sciences.
Tomaso A. Poggio.
MetadataShow full item record
For the recognition of general objects in natural scenes, the current top-performing computer vision models owe a debt to visual neuroscience. The hierarchical architecture of convolutional networks, and related models such as HMAX, mimics that of the ventral stream of visual cortex. In essence, they apply the model of Hubel and Wiesel recursively, alternating layers of 'simple' cells, which are tuned to certain local features, and 'complex' cells, which pool the outputs of simple cells within a local region. With recent advances in deep learning, for many tasks in vision and speech, emphasis has moved away from so-called 'hand-designed' models and toward big data and high throughput computing, with models learning from millions of labeled examples. Yet CNNs only learn their features - the weights of connections in the network. All other aspects of the network (size, connectivity, response functions, etc.) are unlearned architectural choices made by their designers. Vision has not yet been reduced to a pure learning problem - human insight into the nature of visual problems continues to be important. To design a good vision system, one still has to understand vision. And, as evidenced by performance for many complex visual tasks, natural vision systems still 'understand' vision better than we do; there is still much to be learned from them. Our work is based on the HMAX model, which places greater weight on biological realism. Our goals are threefold: to better understand the ventral stream algorithm, as well as the visual problem it solves, and to improve the performance of artificial vision systems. In this work we take two main approaches. i-theory is an ongoing effort to explain the good performance of hierarchical models in terms of a formal theory of invariance to transformations. We provide a reinterpretation of V1 simple and complex cells in the context of i-theory as computing a high-dimensional, locally translation-invariant signature for the contents of a V1 receptive field. We describe a simple algorithm for learning them which can extend without modification to the learning of higher-order representations for V2 and beyond. The algorithm yields model V1 cells having a good fit to data from several animal species. We also demonstrate that a precondition of i-theory, covariance, can hold in upper layers, even for transformations not anticipated in the training of lower layers. No current hierarchical object recognition model incorporates realistic retinal resolution. Incorporating this detail forces a reevaluation of the role of the ventral stream's feedforward core in the larger task of scene understanding as well as many details of the model itself, particularly with respect to scale. We investigate the optimal shape of the input window used to select a subset of the visual information available in a scene for processing in a single feedforward pass, defined as a region in (x, y, A), the handling of the A dimension within the hierarchy, and the problem of clutter. Our main experimental results are (1) spatial wavelengths too small for the retina to perceive across the entire object do not play a significant role in the no-clutter case, but confer robustness in the presence of clutter, and (2) preservation by the hierarchy of information about the relative scale (distance along A) of feature activations is more important than current models reflect.
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Brain and Cognitive Sciences, 2017.Cataloged from PDF version of thesis. "September 2016."Includes bibliographical references (pages 115-118).
DepartmentMassachusetts Institute of Technology. Department of Brain and Cognitive Sciences.
Massachusetts Institute of Technology
Brain and Cognitive Sciences.