Visual concepts and compositional voting

Wang, Jianyu; Zhang, Zhishuai; Xie, Cihang; Zhou, Yuyin; Premachandran, Vittal; Zhu, Jun; Xie, Lingxi; Yuille, Alan L.

Author(s)

Wang, Jianyu; Zhang, Zhishuai; Xie, Cihang; Zhou, Yuyin; Premachandran, Vittal; ... Show more

DownloadCBMM-Memo-087.pdf (3.373Mb)

Metadata

Show full item record

Abstract

It is very attractive to formulate vision in terms of pattern theory [26], where patterns are defined hierarchically by compositions of elementary building blocks. But applying pattern theory to real world images is very challenging and is currently less successful than discriminative methods such as deep networks. Deep networks, however, are black-boxes which are hard to interpret and, as we will show, can easily be fooled by adding occluding objects. It is natural to wonder whether by better under- standing deep networks we can extract building blocks which can be used to develop pattern theoretic models. This motivates us to study the internal feature vectors of a deep network using images of vehicles from the PASCAL3D+ dataset with the scale of objects fixed. We use clustering algorithms, such as K-means, to study the population activity of the features and extract a set of visual concepts which we show are visually tight and correspond to semantic parts of the vehicles. To analyze this in more detail, we annotate these vehicles by their semantic parts to create a new dataset which we call VehicleSemanticParts, and evaluate visual concepts as unsupervised semantic part detectors. Our results show that visual concepts perform fairly well but are outperformed by supervised discriminative methods such as Support Vector Machines. We next give a more detailed analysis of visual concepts and how they relate to semantic parts. Following this analysis, we use the visual concepts as building blocks for a simple pattern theoretical model, which we call compositional voting. In this model several visual concepts combine to detect semantic parts. We show that this approach is significantly better than discriminative methods like Support Vector machines and deep networks trained specifically for semantic part detection. Finally, we return to studying occlusion by creating an annotated dataset with occlusion, called Vehicle Occlusion, and show that compositional voting outperforms even deep networks when the amount of occlusion becomes large.

Date issued

2018-03-27

URI

http://hdl.handle.net/1721.1/115182

Publisher

Center for Brains, Minds and Machines (CBMM)

Series/Report no.

CBMM Memo Series;087

Collections

CBMM Memo Series