This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. CBMM Memo No. 061 July 31, 2017 Full interpretation of minimal images Guy Ben-Yosef, Liav Assif, Shimon Ullman Abstract The goal in this work is to model the process of ‘full interpretation’ of object images, which is the ability to identify and localize all semantic features and parts that are recognized by human observers. The task is approached by dividing the interpretation of the complete object to the interpretation of multiple reduced but interpretable local regions. In such reduced regions, interpretation is simpler, since the number of semantic components is small, and the variability of possible configurations is low. We model the interpretation process by identifying primitive components and relations that play a useful role in local interpretation by humans. To identify useful components and relations used in the interpretation process, we consider the interpretation of ‘minimal configurations’: these are reduced local regions, which are minimal in the sense that further reduction renders them unrecognizable and uninterpretable. We show that such minimal interpretable images have useful properties, which we use to identify informative features and relations used for full interpretation. We describe our interpretation model, and show results of detailed interpretations of minimal configurations, produced automatically by the model. Finally, we discuss possible extensions and implications of full interpretation to difficult visual tasks, such as recognizing social interactions, which are beyond the scope of current models of visual recognition. 1 Title: Full interpretation of minimal images. Author names and affiliations: Guy Ben-Yosef1,2*,3, Liav Assif1, Shimon Ullman1,3 1. Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 7610001, Israel. 2. Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. 3. Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. * Present address. Corresponding author: shimon.ullman@weizmann.ac.il Word count for manuscript: 12,066 Abstract: The goal in this work is to model the process of ‘full interpretation’ of object images, which is the ability to identify and localize all semantic features and parts that are recognized by human observers. The task is approached by dividing the interpretation of the complete object to the interpretation of multiple reduced but interpretable local regions. In such reduced regions, interpretation is simpler, since the number of semantic components is small, and the variability of possible configurations is low. We model the interpretation process by identifying primitive components and relations that play a useful role in local interpretation by humans. To identify useful components and relations used in the interpretation process, we consider the interpretation of ‘minimal configurations’: these are reduced local regions, which are minimal in the sense that further reduction renders them unrecognizable and uninterpretable. We show that such minimal interpretable images have useful properties, which we use to identify informative features and relations used for full interpretation. We describe our interpretation model, and show results of detailed interpretations of minimal configurations, produced automatically by the model. Finally, we discuss possible extensions and implications of full interpretation to difficult visual tasks, such as recognizing social interactions, which are beyond the scope of current models of visual recognition. 2 Keywords: Image interpretation; Minimal images; Parts and relations; Top-down processing; 1. Introduction Humans can recognize in images not only objects (e.g., a person) and their major parts (e.g., head, torso, limbs), but also multiple semantic components and structures at a fine level of detail (e.g., shirt, collar, zipper, pocket, cuffs etc.), as in Fig. 1A. Identifying detailed components of the objects in the image is an essential part of the visual process, contributing to the understanding of the surrounding scene and its potential meaning to the viewer (Sec. 6.1). Although this capacity is of fundamental importance in human perception and cognition, current understanding of the processes involved in detailed image interpretation is limited. From the modeling perceptive, existing models cannot deal well with the full problem of detailed image interpretation, and, as discussed below, the limitations are of fundamental nature. Computational models of object recognition and categorization have made significant advances in recent years, demonstrating consistently improving results in recognizing thousands of natural object categories in complex natural scenes (Sec. 2). However, existing models cannot provide a detailed interpretation of a scene’s components in a way that will approximate human perception. For example, for a given image such as Fig. 1A, existing models can correctly decide if the image contains a person (e.g., Csurka et al., 2004; Simonyan & Zisserman, 2015), and can locate a bounding box around the body (e.g., Dalal & Triggs, 2005; Girshick et al., 2014). At a more refined level, current algorithms can provide an approximate segmentation of the body figure (e.g., Long et al., 2015), and can locate image region containing the main body parts, such as the torso region, the face, or the legs (e.g., Chen et. al., 2017; Vedaldi et al., 2014), or keypoints at the joints (e.g., Chen & Yuille, 2014; Wei. et al., 2016). However, existing computational models cannot achieve the accuracy and richness of the local interpretation of image components perceived by a human observer (e.g., as in Fig. 1B). 3 To clarify the terminology, by the term ‘visual interpretation’ we refer to a mapping between entities in the images and entities in the world (such as objects, object categories, object parts at different levels, and other physical entities). For instance, within a face image, a particular image contour may correspond to, say, the mouth’s upper lip. The contour is an image component, the upper-lip is a semantic component in the outside world, and the interpretation process maps between the two. 1.1. Local image interpretation Producing a detailed interpretation of an object's image is a challenging task, since a full object may contain a large number of identifiable components in highly variable configurations. We approach this task by decomposing the full object or scene image into smaller, local, regions containing recognizable object components. There are several advantages to perform the interpretation first in local regions, and then combine the results. First, as exemplified in Fig. 1B, in such local regions the task of full interpretation is still possible (Torralba, 2009; Ullman et al., 2016), but it becomes more tractable, since the number of semantic recognizable components is highly reduced. As will be shown (Sec. 5), reducing the number of components plays a key factor in Figure 1. (A). Humans can identify a large number of semantic features and parts in an object image. In the image of a walking person, features like the suit’s pocket, tie’s knot, left shoe, or the right ear, are easily identified by humans, among many others. (B). A detailed interpretation of a small image region, as identified by human observers. In small local regions, the number of semantic components is significantly smaller than in full images, and variability is reduced. (C). When the local region becomes too limited, human observers can no longer recognize and interpret its content when presented on its own (Ullman et al., 2016). C A Left neck contour Left tie contour Tie knot Right neck contour Right tie contour Suit-shirt contour Shoulder contour Local region recognized by humans, along with detailed internal interpretation: Eyebrow Shoulder Pocket Shoe Cuff Knee joint Belt’s Buckle Tie knot Local region that is no longer recognizable: B 4 effective interpretation. At the same time, when the interpretation region becomes too limited, observers can no longer interpret or even identify its content, as illustrated in Fig. 1C (Ullman et al., 2016). The goal of the model is therefore to apply the interpretation process to local regions that are small, yet interpretable on their own by human observers. A second advantage of applying the interpretation locally is that variability of configurations taken from the same object class, but limited to local regions, is often significantly lower compared with complete object images. For example, the full horse images in Fig. 2 (taken from the ‘horse’ category in ImageNet, Deng et al., 2012, a common benchmark for evaluating object recognition models) are quite different from each other, but can become significantly more similar at the level of local regions. This well-known advantage of local regions, which has been used in part-based recognition models, is extended below to define minimal recognition configurations. Finally, as will be discussed in the next section, the image of a single object typically contains multiple, partially overlapping regions, where each one can be interpreted on its own. Due to this redundancy, performing the interpretation locally and then combining the results increases the robustness of the full process to local occlusions and distortions. 1.2. Minimal configurations In performing local interpretation, how should an object image be divided into local regions? The approach we take in this study is to develop and test the interpretation model on regions that can be interpreted on their own by human observers, but at the same time are as limited as possible. We used for this purpose a set of local recognizable Figure 2. Complete horse images taken from ImageNet object recognition benchmark (Deng et al., 2012), and a small recognizable region that is interpretable (similar to Fig. 4A), next to each complete horse image illustrating the reduced variability in small recognizable region vs. the complete object image. 5 images derived by a recent study of minimal recognizable images (Ullman et al., 2016). We briefly describe below how these images were obtained, and then explain the reasons for using these local images in developing and testing the interpretation model. A ‘minimal configuration’ (also termed Minimal Recognizable Configuration, or MIRC) is defined as an image patch that can be reliably recognized by human observers, which is minimal in the sense that further reduction by either size or resolution makes the patch unrecognizable. To discover minimal configurations, an image patch was presented to observers: if it was recognizable, 5 descendants were generated: four by small (20%) cropping at one of the corners, or and one by reducing resolution (by 20%) of the original patch. A recognizable patch is identified as a ‘minimal configuration’ if none of its 5 descendants reach recognition criterion (set to 50%, results are insensitive to this setting). A search started with images from different object classes (Fig. 3A), and identified their minimal configurations over all possible positions, sizes and resolutions. Each subject saw a single patch only from each original image, requiring over 15,000 subjects. Testing was therefore done online using Amazon’s Mechanical Turk platform (MTurk), combined with laboratory controls. At the end of the search, each object class was covered by multiple minimal configurations at different positions and sizes. Minimal configurations were on average about 15 image samples in size; some contained local object parts, others were more global views at a reduced resolution. Examples of identified minimal configurations are shown in the top row of Fig. 3B. 0.93 0.03 0.64 0.03 0.79 0.88 0.79 0.79 0.79 0.00 0.04 0.00 0.0 0.13 0.79 0.16 0.69 0.17 Figure 3. Minimal configurations adapted from Ullman et al. (2016). (A). The search for minimal images started from different object images (8 shown here), each composed of 50x50 image samples. (B). Top row: minimal images discovered by the search. Bottom row: sub-minimal configurations, which are slightly reduced versions of the images on top. Numbers below each image show correct recognition rate by 30 human observers. Small changes to the local image at the minimal configuration level can have large effect on recognition. A data set of such pairs is used below for modeling the interpretation of local regions. B A Plane Ship Fly Eagle Horse Eye Suit Glasses 6 A notable aspect of the results for the purpose of the current study is the presence of a sharp transition for almost all minimal configurations from a recognizable to a non- recognizable minimal image: a surprisingly small change at the minimal-configuration level can make it unrecognizable. Examples are shown in Fig. 3B, bottom row, together with their respective recognition rates. The loss of recognition when the image is sufficiently reduced and features are removed is expected, but the sharp drop at the minimal level is remarkable, and consistent across many examples. It was used below to identify informative properties and relations for the interpretation process. It was also found that the large gap in human recognition rate between minimal and sub-minimal images is not reproduced by current computational models of human object recognition (Serre et al., 2007) and recent deep network models (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015). As shown below (Sec. 5.2), the full interpretation model can provide at least a partial explanation to this sharp drop in recognition. 1.3. Recognition and interpretation With respect to local interpretation, recognition tests of minimal images showed that although the minimal images are ‘atomic’ in the sense that their partial images become unrecognizable, humans can consistently recognize multiple semantic features and parts within them. It was noted (Ullman et al., 2016) that recognition and interpretation of minimal images go hand in hand in the sense that under the tested conditions (unlimited viewing time), when subjects correctly recognized a minimal image, they were also able to provide an internal interpretation of multiple internal components. Since in minimal images all the available information is, by definition, crucial for recognition, we propose in the model below that all the interpreted components of minimal images also contribute to their recognition. As described further below (Sec. 4.3, 5.2, 6.4), in the model, the full interpretation process contributes to accurate recognition, since a potential false detection can be rejected if it does not have the expected internal interpretation. For the purpose of modeling human visual interpretation, our initial focus is on the interpretation of minimal images, for the following reasons. First, they provide a useful test set for the model: since they are interpretable by humans, a theory of human image interpretation should be applicable to such configurations. Second, we use minimal 7 and sub-minimal pairs with a large gap in recognizability and interpretability as a source for inferring useful features for the interpretation of minimal images (Sec. 4). Before describing the model, we briefly describe past work related to visual object interpretation. 2. Related work on visual object interpretation Visual recognition can take place at different levels of details, from full objects and their main parts, to fine details of objects' structure. In modeling human visual perception, as well as in computer vision, much of the work to date has focused on relatively coarse levels, rather than full object interpretation considered here. For example, a leading biological model of the human object recognition system, the HMAX model (Riesenhuber & Poggio, 1999; Serre et al., 2007) produces as its output general category labels of full objects, rather than a detailed interpretation. Other biologically inspired models of recognition use features based on unit responses along the ventral cortical hierarchy (e.g., Murphy & Finkel, 2007; Rodriguez-Sanchez & Tsotsos, 2011), but their focus is again on shape representation and object recognition rather than full interpretation. Some models of human vision, such as the Recognition by Components (RBC) model of human object categorization (Biederman, 1987), deal with both objects and parts, but the parts are limited to a small number of 3-D major components, and do not provide a detailed object interpretation. A model for human image interpretation (Epshtein et al., 2008) was shown to provide partial image interpretation by a combination of bottom-up with top-down processing. The model uses a hierarchy of informative image patches to represent object parts at multiple levels. The current model also uses a combination of bottom-up and top- down processing, but it provides a significantly richer interpretation, and based on computational and psychophysical considerations, it uses an extended set of elements and relations. A preliminary version of the model was described in Ben-Yosef et al. (2015). The current model extends the early version in the use of minimal images (rather than local image regions), in testing on multiple classes, and in comparisons with human vision. 8 In computer vision, there has been a rapid progress in different aspects of object and scene recognition, based primarily on deep convolutional neural networks and related methods (He et al., 2016; Hinton, 2007; Krizhevsky et al., 2012; LeCun et al., 2015; Simonyan & Zisserman, 2015; Yamins et al., 2014). Such methods have also been adapted successfully for image segmentation, namely, the delineation of image regions belonging to different objects. For example, recent algorithms (e.g., Dai et al., 2016; Hariharan et al. 2015; Long et al., 2015) can identify image regions belonging to different objects in the PASCAL (Everingham et al., 2010) or CoCo (Lin et al., 2014) benchmarks; however, they do not locate the precise object boundaries, and do not identify the object’s semantic components. A number of studies have begun to address the problem of a fuller object interpretation, including methods for part-based detectors, object parsing, and methods for so-called fine-grained recognition. Recent examples include modeling objects by their main parts, for example an airplane’s nose, tail, or wing (Vedaldi et al., 2014), or modeling human-body parts such as the head, shoulder, elbow, or wrist (e.g., Felzenszwalb et al., 2010; Girshick et al., 2015). Related models provide segmentation at the level of object parts rather than complete objects (applied e.g. to animal body parts such as head, leg, torso, or tail, e.g., Azizpour & Laptev, 2012; Chen et al., 2017). Another form of interpretation has been the detection of key-points within an object, such as key-points of the human body (e.g., Andriluka et al., 2014; Chen & Yuille, 2014; Tompson et al., 2015) and within the human face (e.g., Yang et al., 2015; Xiao et al., 2016). 2.1 Structured models and top-down processing The goal of interpretation models, such as those above, is to identify the semantic structure in an image region. The model is usually given during learning a set of training images together with their interpretation, i.e., a set of semantic elements within each image, and the goal of the model is to identify similar elements in a novel image. In a correct interpretation, the internal components are expected to be arranged in certain consistent configurations, which are often characterized in the model by a set of spatial relations between components. The task of producing the semantic interpretation can 9 therefore be naturally approached in terms of locating within an image region a set of elements (primitives), arranged in a configuration that satisfies relevant relations. The term ‘relations’ also includes here properties of single elements (e.g., the curvature, location, or size of a contour), which can be considered as unary relations. The model described in this work belongs to this general approach of structured models. There is a rich history to the use of structural models in the computational study of vision, including visual recognition and interpretation. Models differ in the shape components used to create structured configurations, the relations used to represent configurations, and the algorithms used to learn structures from image examples, and to identify similar structure in novel images. Basic shape components used in past structural models include edge and boundary elements, including contours (e.g., Brooks, 1983), contour-pairs (e.g., Brooks, 1983; Ferrari et al., 2010) and boundary fragments (e.g., Opelt et al., 2006); image patches, regions and their descriptors (e.g., Fei-Fei et al., 2006; Felzenszwalb & Huttenlocher, 2005; Hanson & Riseman, 1978; Todorovic & Ahuja, 2006; Zhu & Mumford, 2006) complete hierarchies of increasingly complex contour or region combinations and their descriptors (e.g., Fidler & Leonardis, 2007; Ommer & Buhmann, 2007; Siddiqi et al., 1999; Zhu et al., 2009), obtained by grouping and segmentation processes; as well as 3-D surfaces and volumes (e.g., Hanson & Riseman, 1978; Marr & Nishihara, 1978). The relations used in these models were mostly simple, in particular, the expected location within a reference frame and relative displacement (e.g., Chen & Yuille, 2014; Chen et al., 2017; Fei-Fei et al., 2006; Felzenszwalb & Huttenlocher, 2005; Fidler & Leonardis, 2007; Felzenszwalb et al., 2010; Ferrari et al., 2010; Ommer & Buhmann, 2007), but a few used more complex relations such as co-termination (Ferrari et al., 2010), parallelism of elements (Zhu & Mumford, 2006), and containment (Todorovic & Ahuja, 2006). In terms of algorithms used to learn, and then identify, image structures, closest to our model are methods developed and used in the field of machine vision under the 10 general term ‘structured prediction’ such as Structured Support Vector Machine (Joachims et al., 2008), and Conditional Random Field (Lafferty et al., 2001), combined with deep network algorithms (e.g., Chen & Yuille, 2014). These models are given the set of possible relations to use, and they learn the specific parameters from examples. Other methods used in related past models include probabilistic graphical models (e.g. Epshtein et al., 2008; Fei-Fei et al., 2006; Jin & Geman, 2006), and stochastic grammars (e.g., Zhu & Mumford, 2007; Zhu et al., 2009). A basic distinction between methods is that some rely on purely bottom-up processing (e.g. Felzenszwalb & Huttenlocher, 2005; Krizhevsky et al., 2012; Riesenhuber & Poggio, 1999), while others combine bottom-up with top-down processes (e.g. Epshtein et al., 2008; Fei-Fei et al., 2006; Zhu & Mumford, 2006). 2.2 Focus of the current model compared with past models None of the past models mentioned above implemented semantic interpretation at the level considered in this paper, but several, in particular models using hierarchical representations and grammar models, incorporate descriptions at multiple levels and could possibly be extended to include semantic parts at all levels (e.g. Ommer & Buhmann, 2007; Marr & Nishihara, 1978; Zhu & Mumford, 2006; Zhu et al., 2009). In formulating a structural model for image interpretation, the main aspects to consider are the components used to describe the structure, relations between them, and methods for learning the underlying image structure and finding similar structures in novel images. Our main focus in the current work is on the identification of informative components and relations, using the set of minimal images combined with sub-minimal images and hard-negative examples. In most past visual models that deal with image structures, informative relations have been limited to a small set of simple relations, in particular relative displacement between components. As elaborated below, results of the present modeling show that the capacity to provide full interpretation requires the use of features and relations, which go beyond those used in most past models. In comparing the present model with alternative approaches, of particular interest are comparisons with recent bottom-up, network-based models (Sec 5.2, Sec. 6.4). With 11 the recent success of such models in various visual tasks, comparisons are useful to explore the possible limitations of purely bottom-up methods, and the potential contribution of structural models, which often naturally employ, as the current model does, a combination of bottom-up and top-down processes. Network models are a useful form for formulating models of visual processes, but our results suggest that exclusively feed-forward network models are unlikely to be sufficient for detailed image interpretation. Recent deep nets modeling include extensions beyond feed-forward processing, such as recurrent and LSTM nets (e.g., Devil et al., 2011; Mnih et al., 2014, for recognition, or Xiao et al., 2016, for facial landmark detection), but not for detailed local interpretation. It will be of interest to study in the future network models, which can efficiently incorporate the computations used by the current model for that task of detailed image interpretation (Sec. 6.5). 3. Model description Our interpretation scheme has two main components: in the learning stage, it learns the semantic structure of an image region in a supervised manner, and in the interpretation stage, it identifies the learned structure in similar image regions. These two stages are described in the rest of this section (combined with the appendices, which supply more technical details). 3.1 Learning setup The learning stage derives the semantic structure of an object region based on positive examples coming from class images, and negative examples derived by the system from similar but non-class images. We first describe how these training examples are obtained, and then how the region’s semantic structure is learned from them. Positive examples are supplied manually during a preparation stage as a set of image regions with their interpretation, namely, the semantic elements that should be identified and localized. Since the goal is to model humans' ability to obtain a detailed local interpretation, the target set of semantic primitives to identify was collected for different minimal images using human observers. The semantic features to be identified by the model, e.g. 'ear', 'eye', 'tie knot' etc., were features that human observers label consistently in minimal images, verified using a Mechanical Turk procedure (see 12 examples in Fig. 4, top row, and Appendix A for procedure details). The average number of consistently identified elements within a single minimal configuration was 8. To capture the recognized internal components fully as perceived by humans, the primitive elements in the model were divided into three types: two-dimensional (2-D) regions, 1-D contours, and points (0-D). Example sets of primitives for modeling the interpretation of minimal images are shown in Fig. 4, bottom row. For instance, a point-type primitive may describe the eye in the horse head model (Fig. 4A), and a contour-type primitive describes borders such as the borders of the tie in the man-in-suit (Fig. 4B). Larger semantic features marked by observers such as the ship’s ‘bow’ region or the tie’s ‘knot’, were marked as region primitives (outlined squares in Fig. 4, bottom row). The three types of primitives are also supported by psychophysical and physiological studies (e.g., Attneave, 1954; Pasupathy & Connor, 1999). Given the semantic elements identified by humans in a minimal image of class C (e.g., a horse-head), we prepared a set of annotated images, in which the semantic components (denoted 𝑃" below) were marked manually (with automatic refinement). Examples for such annotations are shown in Fig. 5A. The unsupervised learning of components and relations is considered briefly in the final discussion (Sec. 6.2). Figure 4. Human interpretation of minimal configurations. (Top row). All components that were identified consistently by human observers (Appendix A). (Bottom row). In the interpretation model the components are represented by three types of primitives: points, contours, regions, together with relations between them. For each column, the identified components on the top panel are plotted in different colors on the bottom panel, and by either a point, a contour, or a region (an outlined square). Ear Mane contour Eye Head upper contour Mouth Neck Neck lower contour Head lower contour Shoulder contour Suit-shirt contour Neck right contour Neck left contour Tie knot Tie left contour Tie right contour Deck upper contour Bow contour Water-ship contour Mast left contour Mast Mast right contour Bow A B C D Eyebrow upper contour Iris Iris contour Sclera Lid upper contour Lid lower contour Eyebrow lower contour E Top tube’s lower contour Fork left contour Bottom tube’s upper contour Head tube contour Tire lower contour Tire upper contour Fork right contour Tire lower contour Tire upper contour Bottom tube’s lower contour 13 Having a set of interpretation examples, the learning process next searches automatically for negative interpretation examples – these are non-class images that are potentially confusable with class images. The procedure for identifying so-called ‘hard negatives’ (e.g., Azizpour et al., 2012; Felzenszwalb et al., 2010) (detailed in Sec. 4.3), starts from a large set of random non-class examples, and then iterates over two steps: applying interpretation, and finding non-class examples with high interpretation score (which is produced by the interpretation algorithm), then adding them to the training set and re-training the interpretation model. 3.2 Learning the semantic structure: For a minimal configuration C, we define its semantic structure 𝑆" as a pair of two sets: the set of semantic components 𝑃" mentioned in Sec. 3.1 (also called below the ‘primitives’), and a set of relations between primitives, denoted by 𝑅" , namely 𝑆" = < 𝑃", 𝑅" >. We include properties of a single primitive as a relation with a single argument. A basic problem at this stage is therefore to learn a set of relations that are useful for identifying configurations, namely, which appear in the positive class examples, and distinguish them from configurations found in the similar but non-class negative examples. The relevant relations for a given image are selected automatically during learning from an Figure 5. Stages in the interpretation scheme, with horse-head as an example. (A). Point, contour, and region primitives that represent the identified parts (cf. Fig. 4) are annotated in training examples (several shown here), and are used to learn an interpretation model, which combines the primitives with relations between them. (B). Results of the interpretation model for 3 novel examples of the horse head minimal configuration. Interpretation model Learning stage: Inference stage: input: Model results: A B 14 initial candidate set of potentially informative and useful relations to compute (see Sec. 4 on how this set was obtained). For instance, whether the relation 'containment' between pairs of primitives should be included in 𝑅" , all potential pairs of primitives are examined, using the positive and negative examples, to test if one primitive is consistently contained within the other. (See Appendix. B.1 for how the contribution of a relation to the final interpretation was measured.) Each of the relations used in the interpretation scheme is given an index, e.g. the relation ‘containment’ may have the index ‘4’. Following selection, the set of all informative relations identified in a given minimal image C are represented by the vector 𝑅" . Each element in 𝑅" specifies a relation, and its relevant components. For example, the 3rd component of 𝑅" (i.e., 𝑅"(+)) could be the triplet (4, 5, 7). This triplet means that relation 4, which is ‘containment’, holds between components 5 and 7 in the local image model, specifying that component 5 in the local model should be contained inside component 7. Similarly, the element in position 4 in 𝑅" (i.e., 𝑅"(-)) can be a ‘straightness’ (unary) relation of primitive index 2, etc. Relations in our model could be either binary, e.g., ‘containment’, or represented by a scalar, e.g., the property ‘location’, specifying the location of a component within the local image. A detailed description of the learning model and procedure based on ‘structured learning’ framework (e.g. Shalev-Shwarts & Ben-David, 2014) is given in Appendix B. For a novel image, the vector representation 𝑅" of the image structure is derived as described in the next section, and then used for final interpretation decision. 3.3. Interpretation of a novel image In this section we assume that a local image region has been identified as a likely candidate of a particular object or object part, and the current task is to produce an internal interpretation of the candidate region, and make a final decision about its identity. More details of the algorithm are given in Appendix B, and we also describe later (Sec. 6.4) how the initial detection and full interpretation are integrated together in a combined scheme of a bottom-up stage identifying likely candidates (e.g. by a DNN classifier trained for the task), followed by a top-down interpretation and validation stage. 15 The interpretation process starts with a candidate region and its proposed classification (e.g., that it contains a horse-head). The process then uses the learned model of the region’s structure to identify within the region a structure that best approximates the learned one. This process proceeds in two main stages. The first is a search for local primitives, namely points, contours, and regions in the image, to serve as potential candidates for different components of the expected structure. The second stage searches for a configuration of the components that best matches the learned structure. The first stage identifies in the image candidate primitives for the model components, as described in Appendix B1. This process includes local edge detection and grouping edges into local contours. Properties and relation between components are then computed as described in Appendix C. The second stage searches for the best configuration of candidates, by a structured prediction algorithm described in Appendix B1 and B2. To match a given image configuration to the learned structure, we compute the relations in 𝑅" for this configuration, and then use a compatibility scoring function Relations Input 𝑰 Detected candidates: Explore possible configurations: Output: The most compatible configuration Compute relations in 𝑹𝑪 Figure 6: An overview of the interpretation process of a novel image. From left to right, Input image. Detected candidates: of the primitive components, examples for 3 candidates of each primitive are shown. Configurations: examples of possible configurations of detected primitives (denoted by 𝜋 in Appendix B); the one at the bottom is the optimal one. Computing relations: compute the relations in 𝑅𝐶 for each candidate configuration (the vector 𝜙𝑆(𝐼, 𝜋) in Appendix B). A compatibility score: a scoring function (𝑔(𝜙6(𝐼, 𝜋); 𝑤) in Appendix B) is computed for each configuration. The configuration with highest score is returned as final interpretation; the highest score is returned as the ‘interpretation score’. Relations Relations 16 based on a random forest classifier (Breiman, 2001, Appendix B), which produces a number that evaluates the degree to which the configuration is a correct interpretation of the input image. The interpretation scheme finally selects the highest-scoring configuration, and returns both this configuration (termed ‘interpretation’) and its score (termed ‘interpretation score’). A search among multiple configurations is feasible due to the small number of primitives in the local region. This overall process is illustrated in Fig. 6. A detailed description of the scoring procedure and the optimization part (i.e., finding the most compatible configuration) is given in Appendix B. 4. Useful relations for interpretation Producing an interpretation of an image region requires the localization of its participating components, and verifying their correct configuration. The model verifies the structure using inter-elements relations, and a natural question is therefore which relations are useful in modeling local semantic structures. The visual system is known to be sensitive to a range of spatial properties and relations between components such as curvature, straightness, proximity, relative displacement, collinearity, inclusion, bisection, and others, which have been studied both perceptually and physiologically (see review in Sec. 4.1 below). It is unknown, however, which relations play a significant role in the task of visual interpretation. In this section we describe the methods we used to identify informative relations for interpretation, which were then included in the set of interpretation relations used by the model. In contrast with the richness of relations that can be efficiently perceived by the visual system (Sec. 4.1), the majority of models for image recognition and interpretation have been based on a limited number of basic relations. Recognition models based on deep networks obtain high performance in basic categorization, but when the task requires a more detailed interpretation, e.g. identifying keypoints in human pose estimation, performance often improves by explicitly incorporating inter-element relations, in particular relative displacement and orientations, using e.g. CRF models (Chen et al., 2017; Chen & Yuille, 2014; Wei et al., 2016;). We next examined the set of relations which are informative for the full interpretation of local images. 17 The availability of minimal images allowed us to examine whether basic relations used in previous schemes are sufficient for producing an accurate interpretation by the interpretation model. Minimal configurations are by construction non-redundant visual patterns, and therefore their recognition and interpretation depend on the effective use of all the available visual information. It consequently becomes of interest to examine the performance of a model that uses a limited set of relations when applied to the interpretation of minimal images. To this end, we constructed a version of the interpretation scheme, where the set of relations was limited to displacement and proximity relations. Performance for this version proved insufficient compared with human interpretation (see more details in Sec. 5). This limitation motivated the search for additional informative relations, which were shown to improve the interpretation of minimal images. It is worth noting that since minimal images contain small sets of components, it becomes more feasible to use in the model inter-element relations that are more complex and more computationally demanding than used in past models. We describe in Sections 4.1-4.4 below the process of identifying informative relations for the interpretation process. Previous psychophysical and physiological studies have proposed a number of relations that the visual system is sensitive to. These provided an initial set of candidate relations, and each relation was evaluated by measuring its contribution to the interpretation model applied to a test set of minimal configurations, combined with sub-minimal configurations (Sec. 4.2) and hard-negative examples (Sec. 4.3). We finally describe the relations that were found to be informative for learning interpretation. We also consider (Sec. 6.2) how a more complete set of informative interpretation relations could be learned and refined over time. 4.1 Relevant visual relations in past literature The study of relations between elements in the visual field dates back at least to the Gestalt school and its principles of perceptual organization (Wertheimer, 1923). These principles were based on relations that group visual elements together to be perceived as coherent units, and included proximity, similarity, connectivity, symmetry, and continuity between dots, contours, or regions. Psychophysical experiments since have shown that the human visual system is effortlessly sensitive to a range of spatial 18 properties and relations between visual elements. Such relations include: parallelism and symmetry (e.g., Feldman, 2007; Machilsen et al., 2009; Stahl & Wang, 2008), curvature and convexity (e.g., Foster et al., 1993; Murphy & Finkel, 2007; Rodriguez-Sanchez & Tsotsos, 2011), connectedness of blobs (e.g., Palmer & Rock, 1994), and connectedness of contours (e.g., Elder & Zucker, 1996; Elder et al., 2003; Jacobs, 1996; Jolicoeur et al., 1986), continuity of contours (e.g., Kanizsa, 1979), co-linearity (e.g., Field et al., 1993) and co-circularity (e.g., Parent & Zucker, 1989) of contours, relative length of lines and contours (e.g., Saarela et al., 2009), bisection (e.g., Westheimer et al., 2001), and inclusion (Ullman, 1984). For many of these relations, it remains unclear whether they are being formed at early stages of visual perception in a bottom-up manner (e.g., Field et al., 1993; Kanizsa, 1979; Parent & Zucker, 1989) or at later stages, applied in a top-down manner to early visual representations (e.g., Jolicoeur et al., 1986; Roelfsema et al., 1998; Ullman, 1984). It is also still unclear which of the relations perceived effortlessly by humans play also a direct role in recognition and interpretation. The computational test described below evaluated directly the contribution of different relations to the interpretation of minimal image. To search for informative relations for interpretation, we started with a list of visual relations identified in past studies listed above, called the ‘candidate relations’, and tested their contribution to the interpretation process applied to minimal and sub-minimal images and hard-negative examples, as discussed next (Sections 4.2-4.3). 4.2 Useful relations from minimal vs. sub-minimal images The sharp drop in humans’ ability to recognize and interpret a minimal configuration when the image is slightly reduced (Ullman et al., 2016), provided a tool for identifying useful relations for modeling human interpretation. A minimal image was compared with its similar, but unrecognized sub-image, to identify either a missing component (e.g., a contour), a missing region feature, or a relation (e.g., connected contours that become unconnected), which were present in the minimal image but not in the sub-minimal configuration. Examples are illustrated in Fig. 7, where pairs of minimal vs. sub-minimal configurations are shown (columns 1-2), along with the sets of internal semantic components that were identified by human observers in the minimal images (column 3). By using the human annotations, we found if any components in the minimal image were 19 missing in its sub-minimal image. Using the set of candidate relations, we identified relations that are satisfied in the minimal but not the sub-minimal image. The missing component or relation may not be unique, and in such cases we evaluated a number of alternatives. The examples in Fig. 7 include the existence of the left-side tie contour (7A), connectedness of the two horse muzzle contours (7B), high-curvature meeting of contours (7C), and characteristic texture in the water region (7D). These features have been shown in the past to be informative for recognition (e.g., Foster et al., 1993; Murphy & Finkel, 2007; Pasupathy & Connor, 1999; Rodriguez-Sanchez & Tsotsos, 2011). We next evaluated for each of the missing components or relations, how consistent it is among other examples of minimal images, and how informative it is for the interpretation Figure 7. Inferring relations between internal components with large contribution to recognition and interpretation. Minimal and sub-minimal pairs (columns 1,2, recognition rate shown below the images), are shown with internal components recognized by humans in the minimal images (column 3). To identify useful components and relations for interpretation, we compared the minimal and sub-minimal images. Using the identified components, we found if any component in (1) are missing in (2). Using the set of candidate relations, we identified relations that are satisfied in (1) but not in (2). The contribution of each missing component or relation was then evaluated using training examples (see text). When necessary, several alternatives were evaluated. Examples of informative components and relations are shown in column 4. Examples of additional MIRC / sub-MIRC pairs in the training set with the same missing component or relation, with its effect on recognition, are shown in columns 5,6. Inferred components and relations illustrated in the figure are: missing contour element (in A), connectedness of two contours (B), contours meet at high curvature (C), and characteristic texture in a region (bounded by the red contour and image border) in (D). 3. Identified components: 4. Tested feature/relation 1. Minimal Configuration: 2. Sub-minimal configuration: 5. Minimal Configuration: 6. Sub-minimal configuration: A 0.03 0.64 0.97 0.10 C 0.00 0.79 0.90 0.27 D 0.14 0.67 0.59 0.10 0.83 0.07 B 0.33 1.00 20 process, using our full data set of training examples. We start by testing for consistency in the set of minimal and sub-minimal pairs of the same class namely, finding additional pairs separated by the same component or relation (Fig. 7, columns 5-6). As an initial filtering stage, components or relations playing a role in at least 3 additional pairs were kept for the next stage, in which they were tested by their contribution to the performance of the interpretation algorithm. Each relation (similarly for candidate components) was tested by adding it to the set of relations (namely, to the relations 𝑅"), training a new interpretation algorithm, and measuring the difference in interpretation performance with and without this relation. In more details, to test how informative is a given relation to the interpretation process, we have trained and compared two alternative versions of the interpretation model. The first version, (termed ‘basic’), included a limited set of relations commonly used in the visual structure modeling literature (Sec. 2), namely, unary relations based on local texture and shape appearance, and binary ones based on the relative displacement of components. The basic model is then compared with a second interpretation model (termed ‘augmented’), where the basic set of relations is augmented with the relation we wish to test. Performance of both models was evaluated on a data set, which included for each of the minimal images in Fig. 4, a set of 120 positive examples, and 8000 negative (non-class) examples, split between training and validation sets. Performance of the two models was compared by classification by the random forest classifier (using the Out-Of- Bag test for strength of random forest features, Brieman, 2001, Appendix B), to assess the contribution of each new relation. Relations that improved random forest classification average precision by 1% or more (found in pilot experiments to be significant), were incorporated in our final extended set of relations. The extended set was subsequently used in the overall evaluation of the model, applied to the interpretation and recognition of minimal and sub-minimal images (Sec. 5). Fig. 7 illustrates the process for example relations, which were found to be informative for interpreting the corresponding minimal images. 4.3 Useful relations from ‘hard negative’ examples In addition to the sub-minimal images test discussed above, which compared images from the same class, a complementary source for identifying useful relations for full 21 interpretation is a comparison of minimal configurations with ‘hard’ non-class examples, which are difficult in the sense that they are confusable with true class examples by current computational models (a deep net model, Simonyan & Zisserman, 2015, and a human recognition model, Serre et al., 2007). Such a comparison can identify components and relations that are informative for human recognition and interpretation, but are missing from current models. We describe next how hard-negative examples were generated and how they were used to identify useful relations for interpretation. To identify hard negative examples for a given minimal image, we trained a deep CNN- based classifier. We used the 19-layer CNN model described in Simonyan & Zisserman (2015), adjusted to recognize minimal images as follows: we fine-tuned the network to classify regions at the size of minimal images, and then used an intermediate layer (conv3_4, the 8th convolutional layer) as a descriptor for a final SVM classifier. We chose this intermediate layer based on its classification performance, compared to other network layers, and to complete end-to-end fine-tuning of the network.). We trained the classifier using 120 examples of the minimal image (see Sec. 5 for how these examples were obtained), and a large set of negative examples (200,000 local regions cropped and rescaled from various non-class images). We then applied the classifier on a validation set (equal in size to the training set), and finally retained the 4000 non-class image regions with the highest detection scores. These are the hard-negative examples, used in the search for informative relations. Similar to the use of sub-minimal images described above, the search proceeds along the following steps. We start with the ‘basic’ interpretation model as defined in Sec. 4.2 and iterate over the following procedure: i) Keep the k hard-negative images that received the highest interpretation score (since images later required MTurk tests, we used the limit k=40). ii) Confirm (using MTurk testing) that these negative examples are not confusable for human observers. (Examples that were also difficult for humans were removed from the set in practice, no more than 2 examples were removed at this stage). iii) Compare the interpretation produced by the model for the images collected in (i), to human annotations of the corresponding minimal image examples. As in 22 Sec. 4.2, identify components or relations (from the list of candidate relations) present in the positive examples but not in the hard negatives. iv) For each such a relation, test its contribution to the interpretation model by the difference in random forest classification with and without this addition, as in Sec. 4.2. v) Once relations from all hard negative images were tested, and the contributing subset was added to the relations set, train a new version of the interpretation model and repeat the search from step (i), to discover additional relations from hard negatives of the new version. We iterated this procedure until no new contributing relations were found (at most 3 iterations were needed per class). Figure 8: Useful relations for interpretation extracted from ‘hard negative’ examples. Columns show (left to right): minimal images with their human interpretation, non-class examples with high detection score and their human recognition rate, interpretation applied to the negative example by the model. Differences in components or relations are identified and evaluated, see text. Column 4 shows relations found to be informative for the interpretation model. They include: high straightness of two contours, typical of man-made objects (in A), connectedness of two contours through the ear region (in B), connectedness of two contours through a tie knot region (in C), coherent texture between the two shirt parts, see text (in D). The identified relations were used to reject hard negatives, examples in the last two columns. Identified components: ‘Hard negative’: Model interpretation: Useful feature: ‘Hard negative’: Model interpretation: 0.02 0.02 0.00 0. 00 A C D B 0.94 0.90 0.00 0.02 0.03 0.00 23 Fig. 8 illustrates examples of hard negatives discovered and used to identify informative relations, and the process of finding these relations. Examples include ‘highly-straight’ contours (typical for man-made objects) in the horse head (e.g., the red and yellow contours in Fig. 8A), the connectedness of horse head contours through the ear region (red and cyan contours in Fig. 8B), sharp corners at the tie knot’s (cyan and magenta contours, connected inside the brown square in Fig. 8C), and coherent visual texture (or intensity level) between the two shirt parts (the area that is left to the red contour and the area that is bounded by the contours in cyan and yellow in Fig. 8D). Relation Operands Description Relation Operands Description 1 All primitives Location and relative location: for all primitives, and for all pairs of primitives in the structure. 8 Contour, Contour Length ratio between two contours 2 Point Strength of intensity maxima/minima, center-surround filter responses at a point location. 9 Contour, Contour Parallelism between two contours 3 Contour Deviation from line/circular arc: in particular for man-made objects. 10 Region, Region Coherent visual appearance similar appearance/texture features in region i and in region j 4 Contour Visual appearance along contour distribution of visual appearance/texture features along contour. 11 Contour, Point Cover of a point by a contour: if a contour i covers a point j. For ‘cover’ refer to appendix C. 5 Region Visual appearance inside a region distribution of visual appearance/texture features in a region 12 Contour, Region Contour ends in a region: if a contour i ends in a region j. 6 Contour, Contour Relative location of contour endings: between endings of two different contours 13 Point, Region Containment: if point i is inside region j 7 Contour, Contour Continuity: smooth continuation between two given contour endings. 14 Contour, Contour, Region Contour Bridging: Testing whether two disconnected contour elements can be bridged (linked in the edge map). Table 1. Relations that were found informative for the learning process, by the method and criteria in Sec. 4.2 and 4.3. See implementation details for relation procedures in Appendix C. Relations 1,4 and 5 form the ‘basic set’ of relations (widely used in previous visual structure approaches, see Sec. 2), relations 1-14 are part of the ‘extended set’ of relations. 4.4 The final set of relations The final set of relations, obtained by comparing MIRCs to both sub-MIRCs and hard negatives, includes unary relations (properties), binary relations, and relations among three or more primitives. Relations in this set are composed of basic relations as listed in Sec. 4.2, extended with candidate relations which proved to contribute to the recognition and interpretation accuracy by the computational experiments in Sec. 4.2 and 4.3. Relations in the model range from low-complexity ones such as computing relative location between primitives, to higher complexity procedures such as computing the continuity, bridging (connectedness), or parallelism of contours. Table 1 lists relations with the highest contribution, as measured in Sec. 4.2 and 4.3. Below, we refer to the set of relations in Table 1 as the 'extended set' of relations, and we distinguish it from the ‘basic set’ of relations, namely relations 1,4 and 5 in Table 1, which are based on local 24 appearance, location, and displacement, and were wieldy used in previous visual structure approaches (see Sec. 2). Technical details for implementing the relation procedures, including their higher order versions that include more than two primitives, are discussed in Appendix C. We further compared the relative contribution of individual binary vs. unary relations to successful interpretation, by comparing the model performance with and without the relation in question (Sec., 4.2, Appendix B1). By this measure, binary relations contribute more on average to reducing ambiguity than the unary ones; however, some unary relations have high contribution as well (e.g., ‘straightness’, ‘intensity minimum/maximum’). 5. Experimental evaluation So far, we have identified the useful components and relations when tested individually. We next combined all of them in the full interpretation model (as described in Sec. 3) and tested its performance. The full set of relations for the trained model was composed of the extended set of relations listed in Table 1. To evaluate the full interpretation model, we performed experiments to assess (i) the interpretation correctness on novel images, (ii) the ability of the interpretation model to predict human recognition at the level of minimal image, and (iii) the contribution of informative relations included in the model to human recognition, using modified minimal images. Training of the model was obtained as described in Sec. 3, with annotated examples of minimal images, and non-class (negative) examples. To get positive class examples for the minimal image we wanted to model (e.g., 'horse-head'), we collected fully-viewed object images from known data sets (Flicker, Google images, ImageNet), and manually extracted from each image a local region at the position and size similar to the discovered minimal image (Ullman et al., 2016). The minimal image examples used for training were in slightly higher resolution than the minimal images found in Ullman et al., 2016 (image resolution was increased by factor of 1.5), since we found that using this scale during training improved the model results when applied to novel images. 25 To have ground truth for the interpretation, two human subjects provided annotation of the set of primitives for all examples (one annotator used for ground truth, the other for measuring consistency, details in Appendix A). Negative (non-class) examples for training were collected automatically from cropped windows in non-class images at similar size to the minimal image. To get hard negative examples, we trained a deep CNN classifier (Simonyan & Zisserman, 2015), as described in Sec. 4.3, and collected images that received high recognition scores. We next turn to describe our three testing procedures, in Sec. 5.1-5.3 below. 5.1 Comparing model output to human interpretation The interpretations produced by the model were compared with the ground truth annotations supplied by the human annotators. Since the model is novel in terms of Figure 9: Interpretation results for minimal images belonging to (counter-clockwise) a horse-head, a man in a suit, a bike, and an eye. 26 producing full interpretation, it cannot be compared directly with any existing alternative models. However, we made our set of annotations publicly available, and the current model provides a baseline to also evaluate future results. To assess the role of the extended relations derived in Sec. 4.2 and 4.3, we compared results from two versions of our model, which differed in the relations included in the model: one using only the basic, and the other using the extended set of relations. Fig. 9 shows examples of the interpretations produced by the model with the extended set for novel test images. To assess the interpretations, we matched the model output to human annotations for multiple examples. Our training set contained 120 positive examples, and 25,000 negative examples for each interpretation model. Our test set contained 480 examples for the horse head minimal image (Fig. 4A), 330 examples for the man-in-suit minimal image (Fig. 4B), and 120 of the eye (Fig. 4D) and the bike (Fig. 4E) minimal images. We automatically matched the ground truth annotated primitives to the interpretation output by the so-called Jaccard index, (Tan et. al., 2006), which is a commonly used similarity measure for comparing automatic detection results (high Jaccard means similar interpretations). This index compares the similarity of two regions, by the area of the regions’ intersection divided by area of their union, and was adapted to compare the accuracy of detecting region, contour, and point primitives, as illustrated in Fig. 10, and explained in more details in Appendix D. Table 2 shows results for the basic and extended relation sets, as well as agreement between different human annotators, which can serve as an upper bound for comparing interpretation performance. Interpretation using the extended set of relations was significantly closer to the ‘ground truth’ human interpretation compared with the use of basic set of relations (𝑃 <4.99×10?@@ for all primitives in 4 classes, n=33, one-tailed paired t test). However, the agreement between the model and ground truth interpretations was still lower than the Figure 10: Quantitative evaluation of the model interpretation results. We compared interpretation results to human annotations based on the Jaccard measure similarity criteria: for regions, contours, and points (see Appendix D for details). Human annotations Algorithm output 27 agreement between different human interpretations (𝑃 < 1.14×10?@+ for all primitives in 4 classes, n=33, one-tailed paired t test). 5.2 Interpretation for predicting minimal and sub-minimal images The link between interpretation and recognition, as discussed in Sec. 1.3, suggests that the interpretation score (which is a part of the model output) may be used as a part of the human recognition process at the minimal image level. In particular, it Basic Extended Humans Basic Extended Humans Horse-head Man-In-Suit Ear region 0.11 0.37 0.60 Knot region 0.62 0.66 0.74 Mouth region 0.69 0.76 0.85 Left tie contour 0.48 0.55 0.72 Neck region 0.55 0.68 0.74 Right tie contour 0.47 0.53 0.72 Upper head contour 0.44 0.69 0.84 Suit-shirt contour 0.64 0.73 0.83 Mane contour 0.34 0.61 0.79 Shoulder contour 0.50 0.63 0.66 Lower head contour 0.46 0.66 0.79 Left neck contour 0.49 0.65 0.84 Lower neck contour 0.32 0.63 0.74 Right neck contour 0.39 0.49 0.77 Eye point 0.29 0.49 0.60 All primitives Man-In-Suit 0.51 Basic 0. 61 Compound 0.75 All primitives 0.40 0. 61 0.75 Eye Bike Iris region 0.39 0.56 0.79 Fork region 0.72 0.73 0.80 Lower lid contour 0.47 0.62 0.73 Tire lower contour (left side) 0.68 0.75 0.86 Cornea contour 0.33 0.60 0.81 Tire lower contour (left side) 0.62 0.75 0.90 Upper lid contour 0.41 0.64 0.74 Bottom tube’s upper contour 0.59 0.74 0.86 Lower eyebrow contour 0.51 0.64 0.83 Bottom tube’s lower contour 0.54 0.70 0.87 Upper eyebrow contour 0.45 0.51 0.81 Top tube’s lower contour 0.36 0.43 0.84 Sclera point 0.56 0.54 0.79 Head tube contour 0.49 0.60 0.85 All primitives 0.44 0.59 0.78 Tire upper contour(right side) 0.50 0.62 0.81 Tire lower contour(right side) 0.53 0.57 0.78 Fork left contour 0.60 0.68 0.82 Fork right contour 0.59 0.71 0.83 All primitives 0.56 0.66 0.84 Table 2. Accuracy of the interpretation results, comparing the basic model, extended model, and human annotators. Accuracy is measured by the average Jaccard index between the model interpretation and ground truth supplied by human annotations. For comparison, human accuracy is measured by the agreement, measured by the Jaccard index, between the human annotators. is interesting to compare the interpretation scores for minimal and sub-minimal images, to assess the usefulness of interpretation for recognition. In human perception, there is a sharp drop in recognition rates at the minimal image level: a small change to the image can have drastic effects on recognition rate (Sec. 1.2, Ullman et al., 2016). This sharp drop was not reproduced by computational models of recognition, and it therefore becomes of interest to examine whether the internal interpretation of minimal image may provide a basis for this perceptual sensitivity. It is possible, for example, that even small changes to a minimal image could disrupt the presence of key elements and their relations. To test this possibility, we measured the gap between human recognition rates for minimal and sub-minimal images (via MTurk search on new image examples) and 28 compared it to the gap predicted by two models: the current interpretation model, and a classifier based on deep convolutional networks (very-deep CNN, Simonyan & Zisserman, 2015), trained on minimal image examples, as in Sec. 4.3. Our test set included 12 examples of minimal images and 20 examples of sub-minimal images for each of two minimal image categories: the horse-head (Fig. 4A) and man-in-suit (Fig. 4B). The average gap measured between human recognition rates for minimal images and for sub-minimal images was 0.75 for the horse head, 0.74 for man-in-suit. This sharp gap in human recongition at the minimal image level was compared with the computational models as described next. Figure 11: Recognition of minimal and sub-minimal images. Recognition gaps between minimal and sub-minimal images were computed for two recognition models, a 19-layer feed-forward CNN classifier, and the interpretation model (trained on a similar set of examples, see details in Sec. 4.3 for CNN, and 5.1 for interpretation). Results were compared with the large gaps that characterize human recognition of minimal images. (A). The panel shows recognition scores (y-axis) of minimal images (blue dots) and sub-minimal images (red), as computed by the CNN classifier. Each column (marked 1-20 on the x-axis) shows scores of one pair of minimal and its sub-minimal image. (A single minimal image can have more than one sub-minimal image). Green dashed line represents the human recognition rate, and recognition gap for the model is derived from the number of blue and red points above the threshold green line (see text for details). (B). Similar to A, but computing the recognition gap for the interpretation Model. (C, D). Similar to A,B but applied to the horse-head images. The simulation results show that the sharp drop in human recognition between minimal and their sub-minimal images, is reproduces by the interpretation model but not by the deep CNN model. (E). Examples of minimal and sub-minimal pairs and their interpretation, produced by the model. The interpretation of the minimal images is more accurate compared with the sub-minimal ones. The arrows show the corresponding score of each image by the interpretation and CNN models. Note that the scores of the minimal and sub-minimal images are more separated by the interpretation model compared with the CNN model. CNN19 model Interpretation CNN19 model Interpretation Humans: 0.03 Humans: 0.94 Humans: 0.13 Humans: 0.91 minimal sub-minimal threshold A B C D E CN N1 9 sc or e: CN N1 9 sc or e: In te rp re ta tio n sc or e: In te rp re ta tio n sc or e: Pair index Pair index Pair index Pair index 29 To compute the recall gap of models, the model’s classification score was compared against an acceptance threshold, and scores above threshold were considered class detections. For each model, we set the acceptance threshold to match the human recognition rate. For example, for the man-in-suit, the average human recognition across all 12 examples was 0.88, and the model threshold was set so that 11/12 examples will be accepted (see Fig. 11A). Recognition rate for the sub-minimal images was then derived from the fraction of sub-minimal images exceeding the threshold, and the difference in recognition rates defines the model’s recognition gap. The scores of the CNN model for the minimal and sub-minimal images on the test sets are shown in Fig. 11A,C. The gaps computed for the horse-head and man-in-suit were 0.20, and 0.37, respectively, both considerably smaller than the human recognition gap. The second model tested the interpretation trained as in Sec. 5.1, with the extended set of relations. Interpretation scores are shown in Fig. 11B,D, along with the interpretation examples of minimal and sub-minimal pair from each category. The average interpretation gap was 0.75 for the horse-head and 0.76 for the man-in-suit, closely similar to the gaps measured for humans. The differences in recognition gap between the CNN and interpretation models were highly significant (𝑃 < 2.44×10?- for horse-head, 𝑃 < 5.7×10?+ for man-in-suit, n=20, Fisher’s exact test). The difference is likely to arise because the interpretation model incorporates class-specific properties and relations that are not included in the CNN model. We discuss this difference further in Sec. 6.3, 6.4 below. 5.3 Testing predicted relations via intervention on minimal images The interpretation model includes informative relations between components, which were identified using the data sets of sub-minimal images and hard negatives. The model predicts that disrupting these relations should reduce the ability of human observers to recognize and interpret minimal images. To further verify the role of these 30 relations, we used direct intervention (Pearl, 2009) on minimal images, testing whether removing specific relations from the minimal image will decrease human recognition. For this purpose, we created transformed versions of the minimal images, in which specific relations were selectively manipulated. The transformed versions were then tested psychophysically via the MTurk. The transformations applied to minimal images included rendering sketches, including rendering k-color cartoons (k≤5), and re-coloring a small set of pixels (number of re-colored pixels ≤ 4), examples in Fig 12. Such sketches are typically highly recognizable, and are similar in terms of perceptual and brain responses to natural images (Walther et al., 2011). To create sketches, we traced contours of the original MIRC image, either manually as in Fig. 12 A, B (right column), and C, or semi-manually using straight lines, as in Fig. 12 B, middle column. Cartoon sketches are similar, but using a small number of grey-levels (≤5) for the regions (12D). Re-coloring images were done with interactive graphics design tools (Irfan, Photoshop). For all sketches, we kept all Figure 12: 'Intervention': Testing informative relations via transformed minimal images. (A-C). Rendering sketches from images (D). Creating k-color cartoons. (E,F). Re-coloring a small set of pixels ( ≤ 4, pointed by the red arrow ) with the same color of their neighboring pixels. In a transformed image, a relation is removed to test its predicted role in human perception. Relations tested: sharp curvature in the tie contour (in A), high contour straightness (in B), containment of a point in bounded contours (in C), coherent color/texture in the two parts (in D), minimum intensity (in E), and maximum intensity (in F). Relation not included: Minimal image: Relation included: 0.57 0.14 A 0.74 0.23 B 0.91 0.23 D 0.32 0.67 E 0.19 0.58 F Minimal image: Relation not included: Relation included: 0.89 0. 48 C 31 contours or segments in the minimal image that are used as primitives in the interpretation model, and verified that the sketched images were still recognizable (e.g., Fig. A-D, middle column). In the sketch images, a specific contour or a region can be selectively modified, with minimal or no change to other image parts. We created a modified version for each sketch, where selected contours or regions were changed based on the tested property or relation (e.g., Fig. A-D, right column). Since we know how a relation is computed in the model, we can change contours or regions such that this relation will no longer be detected. We then tested whether the specific disruption of a single relation will cause a significant drop in MIRC recognition as predicted by the model. The tested relations were taken from the set of the most informative relations in the relations set (Table 1). For each tested relation, we first applied a manipulation which removes the relation from the model relations vector (the computed 𝑅") while keeping the rest of the relations intact (the model can provide interpretation for both natural and sketched minimal images). Each relation was tested using five different pairs of manipulated and non- manipulated versions, and the average human recognition drop for each relation was measured. Example results are shown in Fig. 12. Fig. 12A-D used sketches from minimal images. The sketched versions eliminated specific relations in the representations: sharp curvature (12A, the tie knot, cf. Fig. 8C), high straightness measure (12B, bike contours, cf. Fig. 8A), containment of a point in region (12C, bird’s eye) and the coherent appearance (in intensity or texture) between two regions (12D, cf. Fig. 8D). In Fig. 12E- F, a local change was introduced to disrupt the model property of minimal (12E) or maximal (12F) local intensity. The change was induced by re-coloring 3-4 pixels, to match the average intensity of their neighboring pixels. For all tested relations in Fig 12, the manipulation resulted in a significant drop in human recognition rate. (For example, Fig. 12A, 5 image pairs, average drop = 0.41, 𝑃 <2.46×10?-, n=5, one-tailed paired t test. In similar one-tailed paired t tests for Fig. B-F, 𝑃 < 0.0052 for all cases). In summary, the results show a sharp drop in recognition 32 following intervention to eliminate a relation predicted by the model to be highly informative for the interpretation of the relevant minimal image. This agreement between the model and human recognition supports the proposed role of the tested relations in human recognition and interpretation of minimal images. 5.4 Dealing with variability The interpretation process identifies features in the image that correspond to semantic components in the stored MIRC model. This correspondence between model and image features may be disrupted in several ways (or their combination). It may not be one-to- one because the image either lacks a model feature, or it may have additional ones. Alternatively, a feature in the model may be replaced by a different one in the image. To be robust, the interpretation process is required to deal with such changes, which can be caused by natural image variations. Because of its local nature and the limited number of components, image variability in MIRCs is reduced to a minimum. In addition, the interpretation model can cope with significant image variability as described below. With respect to losing a feature, in the case of a MIRC, such a reduction will render it a sub- MIRC, which cannot be reliably recognized. This means that to recognize an object, we need at least one of its multiple MIRCs to be preserved. The constraint can be relaxed in a more complex scheme, where two or more MIRCs that are below recognition can be combined. This raises an interesting empirical question about human vision, testing whether the integrity of at least one MIRCs in an object image is a requirement for recognition. We suspect that this may be the case, but the question is open for empirical studies. Additional features in the image, beyond the features included in the MIRC model, can be tolerated by the current scheme, since the algorithm searches for a configuration in the image that matches the model, without requiring to match all the image features. With respect to replacing a model feature by a different image feature, the current scheme can tolerate significant changes between the model and image features. It can allow the replacement of a feature by another and can allow a feature to change within a broad range of parameters (e.g. a range of orientation, curvatures etc.) This is obtained in the model in two ways. First, the model learns to use abstract relations, which allow 33 considerable variation. For example, the tie contour (Fig. 9) are required to be roughly parallel and end at the knot region, but are allowed to change considerably in orientation and be either straight or curved. Such abstract relations allow for variability of primitives’ shape and appearance, and for some degree of many-to-one matching (e.g., the bike tubes in Fig. 9). Second, the random forest representation allows for several possible correct configurations of primitives and relations, which can be captured by its different trees. A configuration of primitive candidates is represented by a vector of candidates’ properties and relations (the ‘relations vector’), which is then given to each tree in the random forest (Appendix B). If different features, or a range of parameters were present during training, it becomes likely that different relations vectors, which represent allowed variations of the correct interpretation will get high score. The range of variations allowed in this manner may be more restricted than, e.g., in general stochastic grammars. When the range of variations are too large, then an additional MIRC model will be required to cover the full range. 6. Discussion and implications In this work, we described a model for local image interpretation, applied to minimal recognizable images. The ultimate goal of full image interpretation is to recognize meaningful semantic components anywhere in the image, but we used minimal images for the development and testing of the model for two reasons. First, local interpretation reduces the number of components and the complexity of the model, and second, using a data set of minimal and sub-minimal images is useful for identifying informative components and relations, which play a part in the interpretation process. The interpretation model was shown to produce reliable interpretation of local image regions. It also helps to explain the sharp drop in recognition between minimal and sub-minimal images, which is characteristic of human observers, but not reproduced by current bottom-up computational models. It will be interesting to further test in the future the agreement between human recognition errors of difficult images and errors made by recognition models, with and without an interpretation stage. 34 Similar to other cognitive and computational models, interpretation is defined in the model in terms of a local structure, composed of components, properties, and relations. Our empirical testing of properties and relations proposed in past studies, showed that a number of them contributed to the performance of the model (Table 1). In comparison, restricting the relations to relative displacements between components (‘basic’ relations, 1, 4, 5 in Table 1), which are commonly used in computational models, proved insufficient for reliable interpretation. Consistent with this computational evidence, a subset of the relations used by the model were tested and found to directly affect human recognition, as human recognition of modified minimal images, where tested relations were excluded, dropped significantly. Taken together, the role of the components and relations incorporated in the interpretation model is supported by three complementary sources of evidence: their contribution to correct interpretation by the model, the effect they have on the sharp difference in recognition between MIRCs and sub-MIRCs, and the effects of their selective elimination from minimal images on human recognition of these images. Future work in modeling the interpretation process should go beyond the interpretation of local regions discussed in this study, towards the interpretation of full, natural images. The interpretation of full images is likely to be goal-directed, namely, providing detailed interpretation of regions of interest, rather than uniformly across the image. Minimal images, at multiple scales, can provide a natural starting point for the fuller interpretation process, because they can be reliably recognized and interpreted on their own, independent of the surrounding context, and can subsequently help in further disambiguation and interpretation of nearby regions. 6.1 Detailed interpretation for complex visual tasks Full interpretation of semantic components at the level produced by the current model can play a useful role for extracting meaning from complex configurations, arising in tasks such as recognizing actions or social interactions between agents. The reason is that the exact meaning of an image may depend on fine localization of object parts and the relations between relevant parts, as illustrated in Fig. 13. In particular, the recognition of agents’ social interactions by computational models has proven difficult, and is still a 35 largely open problem. It will be of interest to extend in the future the current work, to study the role of detailed image interpretation in complex scenes, including the recognition of social interactions. 6.2 Learning relations In the current model, relations between components of the local interpretation are used to identify the correct structure. There are two main questions regarding the relations used for the purpose of interpretation. The first is the full set of relations that are useful for the task, and the second is identifying informative relations for a particular local structure (e.g., horse-head). Since the set of so-called ‘basic’ relations proved insufficient, we evaluated a larger set of relations, using minimal, sub-minimal, and difficult non-class images. The resulting set is not necessarily complete, and future studies may identify additional relevant relations. In terms of the human visual systems, such relations could be in part pre-existing in the visual system, and in part learned from visual experience. Regarding the identification of informative relations for a novel class of images, the approach in the model was to examine the full set of possible relations, and identify the informative ones using positive and negative examples, where the negative examples came from high-scoring non-class examples. It will be of interest to examine in the future the possibility of replacing this search by network learning models, based on positive and negative examples, but without using an explicit set of possible relations. The issue of unsupervised learning of semantic components is left for future studies, we only note that some components may be learned based on their independent motion within the image (e.g. an eye or mouth within a face), or based on points of contact between an agent and an object (such as a cup-handle or door-knob). Figure 13. Examples of fine interpretation in recognizing human actions and interactions. (A). Recognizing petting vs. feeding a horse (Yao et al., 2011) depends on the exact location of the human hand on the horse muzzle. (B). Whether the hand is touching the knot or not, determines the action of ‘fixing a tie’. (C). The hands contact locations provide important cues for recognizing a ‘hug’ interaction between the agents. A B C 36 6.3 Interpretation and Top-Down processing Our model suggests that the relations required for a detailed interpretation are in part considerably more complex than spatial relations used in current recognition models (Sec. 2). Furthermore, the experimental results show that the relations used for interpretation are often class-specific, in the sense that the most informative relations for the interpretation of a given class often depend on the class. This does not mean that a given relation R is specific to a single class, but that it is typically used in the representation of some classes, and not others. This is illustrated in Table 3, which shows the most informative relations found by the model for the interpretation of 4 different classes of minimal images. Since the subsets of informative relations are class-dependent, it will be computationally efficient to compute the more complex relations selectively, in a class-specific manner, rather than computing all possible relations for all candidate classes. In addition, even when R is computed for a given class, it can be computed between some components, but not others. In such a scheme, the interpretation process will be naturally divided into two main stages. The first is a bottom-up recognition stage, similar to current feed-forward models. This stage will lead to the activation of one or several objects classes, but without detailed object interpretation. The activated classes will then trigger a top-down process for the computation of further class-specific components and relations required for a detailed interpretation. The interpretation will also be used for validation of the activated classes in the first stage, by rejecting bottom- up detections which do not have the expected interpretation. Future studies could explore this two-stage proposal further by psychophysical and physiological methods. For example, since the accurate recognition of minimal images depends in the model on its internal interpretation, the top-down component predicts that the reliable recognition and interpretation of minimal images will be a relatively slow process compared with a single feed-forward pass. Horse-head Man-in-suit Eye Bike Intensity minimum (at the eye point) Contour appearance (along the tie) Deviation from circular (lid upper contour) Parallelism (tube contours) Contour Bridging of the mane and mouth upper contours Region appearance (suit region) Cover of point by contour (sclera by lid contour) Continuity (tire upper contours) Contour Bridging (at the mouth) Contour ending in region (tie contour in knot region) Relative contour endings (lower lid and the iris contours) Region appearance (wheel region) Table 3. Top 3 informative relations found for the different class models of minimal images 37 A successful recognition scheme should be able to tolerate natural variations in image transformations, such as changes in position, scale and orientation, combined with distortions and occlusion. In the two-stage model, invariance to image transformations is determined by both the first, bottom-up stage, and by the following top-down stage. Regarding translation and rotation, invariance in the model depends primarily on the bottom-up stage. Current bottom-up models can identify and localize candidate objects or parts at different positions, and can tolerate a range of rotations, which depends on the variability encountered in training. In terms of scale, motivated in part by the human variable resolution in image sampling and representation, our model analyzes the image at multiple scales, from which candidate MIRC regions are derived. Regarding occlusion, the minimality of MIRCs implies that both humans and the model will be affected by occlusion, since occlusion can turn a MIRCs into its sub-minimal version. However, large object occlusions will be tolerated, as long as at least one of the multiple MIRCs remains visible. Tolerance to deformations arises in the model from two sources: First, MIRCs are often limited to local object regions, which are inherently more tolerant to deformations than larger regions. Second, as mentioned above (Sec. 5.4), the use of abstract properties and relations also contributes to tolerance in the face of image deformations. 6.4 Before and after MIRCs interpretation The focus in this work is on modeling the full interpretation of minimal images on their own, but it is also of interest to consider briefly the broader process within which this interpretation process takes places. As described above, our model suggests a view in which the recognition and interpretation of a larger, natural image, with multiple objects and clutter, is a process that includes a bottom-up and a top-down stage. In our computational simulations, the bottom-up stage was modeled using existing CNN models (e.g. VGG-19, adjusted and fine-tuned to detect minimal images, same as in Sec. 4.3). These models can reliably recognize multiple objects, and locate them e.g., by bounding boxes (Girshick et al., 2014), or by approximate segmentation (Long et al., 2015). Such bottom-up models are not necessarily sufficient models for the early stages of human vision, but they produce adequate responses for our top-down interpretation stage. When 38 trained for MIRCs recognition, their accuracy is well below human performance (Ullman et al., 2016), but they were sufficient in our simulations to identify candidate regions in the image for a given MIRC (Fig. 14, A-E). When dealing with a full image, a natural next step will involve the selection of a MIRC (or a cluster of related MIRCs) by some attentional process, and applying the top-down stage. The outcome of the second stage is a detailed interpretation, combined with a validation of the proposed category for the region. To test such validation in the horse-head minimal image example, we have compared the precision-recall curve between a CNN-based classifier (very-deep CNN, Simonyan & Zisserman, 2015, trained as in Sec. 4.3) and the interpretation model. Our test set included 120 positive horse-head minimal image examples, and 200,000 negative examples taken from non-class images, similar to the ones used for training. We measured validation by the capability of interpretation to disambiguate ‘hard negative’ examples, namely non-class examples that received high CNN classification score. Since all positive examples were recognized by human observers, we set a threshold for CNN score that allows 100% recall, and applied interpretation to all positive and negative examples that passed this filter (all 120 positive example, 1943 negative examples). We compared the average precision for the CNN and interpretation on this set. The results shown in Fig. 14F, indicate improvement in MIRC recognition when applying Interpretation as a second stage following CNN detection. Starting the process at the MIRCs level is potentially useful, because they are reliably recognizable on their own, and do not depend on additional supporting context to be recognized and interpreted (see Fig. 14E, as demonstration for a MIRC computation in a cluttered scene). We suggest that subsequent stages include at least two main components (and probably additional ones): integration and expansion. The integration process combines different MIRCs, in particular MIRCs that belong to the same object. An object in the image will typically be covered by multiple MIRCs at different locations and scales (Ullman et al., 2016), and their integration can lead to a robust and detailed representation of the object. This initial recognition and interpretation can next ‘spread out’ to surrounding regions, which are less recognizable on their own, but can use the 39 context of the already-recognized MIRCs for disambiguation. The expansion can add information about the scene, for example, about the interaction of the object depicted in the MIRC with another object nearby (as e.g., in Fig. 14G-I). 6.5 Interpretation by network models Recognition models based on deep convolutional networks have shown to produce high-accuracy results in object classification and promising results in related tasks, such as segmentation (e.g., Long et al., 2015). The current model combines network algorithms with other methods to extract complex relations and identify the final structure. Similar combinations have been used recently by other models that extract complex structures (e.g. human pose, Chen & Yuille, 2014, combining CNN with a subsequent conditional random field stage; Lake et al., 2015, in the domain of written Figure 14. Before and after Interpretation: Detection, Interpretation, Validation, and Expansion. (A). Detection: A bottom-up detector (here based on RCNN, details in Sec. 4.3) finds candidates for the horse-head MIRCs. Pink boxes denote top 3 candidates (B-D). Interpretation and Validation: The interpretation applied to each candidate, to produce an interpretation combined with a confidence score. (E). Another example of a horse-head MIRC detection and interpretation in a cluttered scene. (F). Precision- recall curves for CNN19 alone (magenta, average precision (ap) = 0.78), and with interpretation confidence score (blue, ap=0.87). Classification scores in (F) computed for the positive and negative horse-head minimal image examples in Sec. 5.1. (G-I). Expansion: The initial interpretation can next ‘expand’ to surrounding regions. In image region (G), slightly extended from the horse-head MIRC region, the expanded interpretation is to a region containing a portion of the human hand, which together with the horse-head is sufficient to recognize the human-horse interaction. In this image region humans can recognize the hand (I) and ‘feeding’ interaction (G), but the hand on its own is not recognized (H). A 0.28 0.05 0.92 0.56 0.45 -0.10 INTERP score: CNN19 score: B C D E F G Hand/Arm = 0.91 Hand/Arm: 0.2 Horse feeding: 0.66 Human recognition rate: H I 40 characters). We found that existing feed-forward network models have limited accuracy when applied to the interpretation of minimal images. Our evaluation trained a recent semantic segmentation network (Long et al., 2015) to identify interpretation components of minimal images. The accuracy of the resulting interpretation was closer to the ‘basic’ version of our interpretation model, compared with the full version of our model, which uses the extended set of relations (Sec. 5.1). It is plausible, however, that extended network models, such as models using recurrence and memory, will cope more successfully with local interpretation. It will be of interest to develop such models in future work, and compare network structures that prove successful for local interpretation, as perceived by humans, and compare models with aspects of cortical circuitry in the visual system, e.g. in terms of using recurrent and feedback connectivity. Acknowledgements: We thank Daniel Harari for sharing psychophysics data and for help with data collection, and anonymous reviewers for helpful comments and references. This work was supported by ERC Advanced Grant “Digital Baby”, the EU’s Horizon 2020 research and innovation program under grant agreement No. 720270, Israeli Science Foundation grant 320/16, and the Center for Brains, Minds and Machines, funded by NSF Science and Technology Centers Award CCF-1231216. Appendix A. Psychophysics experimental methods A.1. Labeling all semantic components in a minimal image: This experiment was used for identifying semantic elements, which humans can consistently identify in minimal images. Subjects (n=30) were presented with a minimal image in which a red arrow pointed to a location in the image (e.g., the horse eye, or the center of the mouth region), and were asked to name the indicated location. Similarly, a contour was marked in red on the image, and subjects produced two labels for the two sides of the contours (e.g., tie and shirt). In both cases subjects were asked to also name the object they saw in the image (without the markings). To map the scope of ‘full’ human-level interpretation, we put the red arrows and contours at multiple image locations, and tested their consistent labeling. We considered a recognized component if 41 more than 50% of human tags were consistent. Presentation time was unlimited, and the subjects responded by typing the labels. All experiments and procedures were approved by the institutional review boards of the Weizmann Institute of Science, Rehovot, Israel. All participants gave informed consent before starting the experiments. A.2. Annotating point, contour, and region components in minimal image examples: Subjects (N=2) were presented with examples of the semantic components found for a given minimal image by the experiment in Appendix A.1 (annotated by points, contours, and regions, as in Fig. 4B), and were asked to produce similar annotations in novel examples. Annotators were given partially overlapping sets of examples from each class, which together covered the complete training and testing sets. At least 50 examples from each class were annotated by two different subjects, and were used to test consistency in human annotations (see Table 2). The annotated images served as the ‘ground truth’ in evaluating the performance of the interpretation model (Sec. 5.1, and Table 2). Appendix B. The learning model and procedure B.1. A structured learning model based on random forest The problem of local interpretation can be viewed as an instance of so-called ‘structured learning’ (e.g. Shalev-Shwarts & Ben-David, 2014). As described in Sec. 3.2, given a structure 𝑆" consisting of a set of primitives 𝑃" , and a vector 𝑅" of relations between them, we wish to learn an interpretation function 𝑓6 that finds the structure 𝑆" (denoted 𝑆 below for simplicity) in an image I 𝑓6 𝐼 = 𝜋 where I is the object image, and 𝜋 is not just a class label, but a full assignment, which is in our case a mapping between components in the structure 𝑆 and points, contours, and regions in the image I. We refer to 𝜋 as an ‘assignment’, since it assigns to any primitive in the model 𝑆, a counterpart in the image, identified by 𝜋F. 𝜋 is then a vector 𝜋 =[𝜋@, 𝜋H, … , 𝜋J], where N is the number of primitives in the model 𝑆. For example, if the minimal image is the horse head, and the primitives set in 𝑆 includes, among others, the horse eye (primitive index = 1, type = point), and the horse mane contour (primitive 42 index = 5, type = contour), then, 𝜋@ is a point in I assigned to the horse’s eye, and 𝜋L is a contour in I assigned to the horse’s mane. It is common to express the function 𝑓6 using a (learnable) scoring function 𝑔 𝐼, 𝜋; 𝑤 , which measures the compatibility between the model structure 𝑆, and the corresponding structure identified in the image. The additional variables 𝑤 are parameters of the interpretation function, described below. 𝑓6(𝐼) then takes the form: 1) 𝑓M 𝐼; 𝑤 = 𝑎𝑟𝑔maxS { 𝑔 𝐼, 𝜋; 𝑤 }, namely, given an image I (with parameters w already fixed), find the assignment 𝜋 into I that has the highest compatibility with the model structure 𝑆. The goal of the function 𝑓6 is then to find the configuration of elements within the image I, which is as compatible as possible with the model structure 𝑆. The function 𝑔 in our interpretation measures the compatibility between properties and relations specified by the structure 𝑆 of the model, and the same properties and relations computed for the corresponding image elements, identified by the assignment 𝜋. This compatibility is computed as follows. Given an assignment 𝜋 of the model primitives to the image I, we denote the results of measuring all the model relations in the specific image I by the vector 𝜙6(𝜋, 𝐼). Following the example in Sec. 3.2, position 3 in the vector 𝜙6(𝜋, 𝐼) could be ‘true’ (or 1), indicating that primitive 5 is contained in primitive 7, and position 4 could be 0.9 indicating the degree of straightness for primitive 2. The relations vector 𝜙6 𝐼, 𝜋 is then used to measure the compatibility of the image structure with the model structure. This is obtained in our model by a random forest algorithm (Amit & Geman, 1996; Breiman, 2001), which is learned from training examples. A random forest is a non-linear model composed of a set of classification trees: 𝑡@, 𝑡H, … , 𝑡W, … , 43 where 𝑡W is the j-th tree in a forest. The parameters w in this model (in the definition of 𝑓6 and 𝑔) are the queries in the tree nodes, and a standard learning procedure for random forests (Breiman, 2001) is used to set these parameters based on training examples. Each tree is applied to the relations vector 𝜙6 𝐼, 𝜋 to produce a decision whether the given assignment, represented by 𝜋, is consistent with a class structure or not (i.e., the relations vector 𝜙6(𝐼, 𝜋) was classified as 1 or 0). Finally, the function 𝑔 returns the average of all tree votes: 2) 𝑔 𝐼, 𝜋, 𝑤 = @X 𝑡W(XWY@ 𝜙6 𝐼, 𝜋 ), where 𝐾 is the number of trees in the forest. The assignment we seek is the one that maximizes the value of this expression, and the value of g for this assignment is the corresponding ‘interpretation score’. An effective optimization search is described in Appendix B.2 below. The random forest algorithm also provides a method for evaluating the individual contribution of each of the relations in the model to the learning process. This is obtained by removing a single relation in 𝜙6 𝐼, 𝜋 in all vectors in our data, and measuring the interpretation correctness (score) by the random forest with and without this relation. (Referred to as the ‘Out of bag estimate’ for strength of random forest features, Brieman, 2001). We used this method in Sec. 4 to derive a set of relations, which are useful for the interpretation process. ‘Informative’ relations in Sec. 4 are measured by the difference in the performance of the model (the interpretation score) with and without the relation in question. B.2. Detecting primitive candidates and an effective optimization search We describe below how we implemented the calculation of 𝑓6 (Eq. 2), namely, derive the best assignment 𝜋 for a given image I. Our implementation includes two stages: (i) finding k (k = 10) candidates for each primitive in 𝑆, and (ii) seeking the candidate combination that forms the best assignment. In more details, the two stages are i. Primitive candidates: For primitives of type ’point’ and ’region’ we find candidates in a bottom-up manner: for ’point’, we consider all pixels in the minimal image, 44 and for ’region’ we take all image windows of the region size in a ‘sliding window’ search. For type ’contour’ we find the candidates in a top-down manner, as follows: We project ground truth annotated contours on an edge map (Arbelaez et al., 2011), to get edge contour fragments similar in their location and shape to the ground truth ones. We then used connected pairs of fragments (by the Kovesi edge linking toolbox, 2000) as candidates for the contour primitive. We rank all candidates of point, contour, and region types by their unary relations in 𝑅" , and keep the top k for each primitive. Unary relations used for ranking include visual appearance of regions and contours (relations 4 and 5 in Table 1), and intensity minima/maxima of points (relation 2 in Table 1). ii. Finding the best assignment: Given an image I, a trained model 𝑤, and a set of candidates for each primitive in 𝑃" , we run over different configurations of candidates in a coordinate descent manner (Bertsekas, 1999). We start with a random configuration, and then optimize successively one candidate at a time. Specifically, the procedure is: 1) Start with a random configuration of primitive candidates 𝜋 =[𝜋@, 𝜋H, … , 𝜋F, … , 𝜋J]. 2) Repeat until g converges: For each primitive i, go over all candidates 𝜋F[ and update: • 𝜋[ = [𝜋@, 𝜋H, … , 𝜋F[, … , 𝜋J] • 𝜋 ← 𝑎𝑟𝑔𝑚𝑎𝑥{𝑔 𝐼, 𝜋, 𝑤 , 𝑔 𝐼, 𝜋[, 𝑤 } 3) Return 𝜋. Such a procedure is guaranteed to converge to a local optimum (Bertsekas, 1999; a similar optimization search was used for Hopfield networks, Hopfield, 1982). Experimentally, because the search space in minimal images is limited due to small number of primitives, 3 initiations of the procedure were usually sufficient to get good convergence. The final assignment, together with contour grouping and bridging (Appendix C below), identify all the image components which correspond to the MIRC model. 45 Appendix C. Details of computing relation Table 1 in Sec. 4.4 contains the extended set of relations used in our models. In this appendix, we add technical details about the computational procedures for computing the different relations. For all procedures described here, x,y represent the coordinates of the image plane. All procedures were implemented in MATLAB, code is available from the authors. Containment: Given a pixel point 𝑥, 𝑦 and a set of pixels comprising a region 𝑅, we return true if the point is in the region, i.e., 𝑥, 𝑦 ∈ 𝑅. 𝑅 can be either a single region primitive, or a region bounded by two (or more) contour primitives. Contour ends in a region: Given an end point pixel 𝑥@", 𝑦@" of a contour 𝐶, and a set of region pixels 𝑅, we return ‘true’ if the end point is in the region, i.e., 𝑥@", 𝑦@" ∈ 𝑅. Parallelism: Given two contours, 𝐶a and 𝐶b, we compute a binary mask 𝑀: 𝑀 𝑥, 𝑦 = 1 𝑖𝑓 𝑥, 𝑦 ∈ 𝐶a 𝑜𝑟 𝑥, 𝑦 ∈ 𝐶b 𝑀 𝑥, 𝑦 = 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒. We then compute the distance transform map (Maurer et al., 2003) for 𝑀, denoted 𝐷𝑇{𝑀}, followed by a non-maxima suppression to get the ridges R of DT{M}. The ridges R is the set of pixels that are at equal distance from both contours. The two contours are considered parallel if the variance of R is close to zero. We exclude cases where the size of R is small. We thus return ‘true’ if 𝑉𝑎𝑟 𝑅 < 𝜀, where 𝜀 is a threshold close to zero (we chose empirically 𝜀 = 0.2). Continuity of contours: Given a contour 𝐶a with one of its endings: [𝑥@"m, 𝑦@"m], and a contour 𝐶b with one of its ending: [𝑥@"n, 𝑦@"n], we estimate the local orientations at the endings, namely 𝜃@"m and 𝜃@"n, and use them to compute the completion path between 𝑥@"m, 𝑦@"m, 𝜃@"m and 𝑥@"n, 𝑦@"n, 𝜃@"n (Ben-Yosef & Ben-Shahar, 2012). We consider ‘good continuation’ between the two contours if the completed path does not contain inflection points. We return ‘true’ if the number of inflection points in the path equals to zero. 46 Bridging contours: Given a contour 𝐶a with one of its endings [𝑥@"m, 𝑦@"m], a contour 𝐶b with one of its endings [𝑥@"n, 𝑦@"n], and the image 𝐼 from which the two contours are extracted, we test for an image contour connecting them. We compute the UCM map (an edge map, Arbelaez et al., 2011) for 𝐼 and define a graph 𝐺 =< 𝑉, 𝐸 >, where 𝑉 is the set all pixels in the UCM map, namely 𝑣F ∈ 𝑉 ∶ 𝑈𝐶𝑀(𝑣F) > 𝜏, 𝜏 is a UCM threshold (𝜏=0.1), and 𝐸 is a set of weighted edges. An edge 𝑒 ∈ 𝐸 is put for each pair of pixels in 𝑉 that are immediate image neighbors. The weight of an edge 𝑒 =𝑣F, 𝑣W is defined as the difference in UCM levels between pixels: 𝑤 𝑒 = 𝑈𝐶𝑀(𝑣W) − 𝑈𝐶𝑀(𝑣F) (The graph 𝐺 is computed in a pre-process stage.) We return the shortest weighted path in 𝐺 (if exists) between [𝑥@"m, 𝑦@"m] and 𝑥@"n, 𝑦@"n . The bridging procedure was also extended in two versions: (i) finding a path in G that is the most consistent with the ways contours 𝐶a and 𝐶b are connected in positive train images, and (ii) finding a path in G that is constrained to pass through region primitive. Visual appearance inside regions or along contours: Given a candidate image region 𝑅a for a primitive 𝑅 in the model, we match the distribution of the visual appearance features in 𝑅a and in the training examples of 𝑅. Visual appearance features were ‘visual words’ features (Arandjelovic & Zisserman, 2013), and deep neural network features (top layer of a fully convolutional network, Long et al., 2015). For a contour candidate, we used a similar match of visual appearance features, this time along a thin region surrounding the contour. The visual appearance relations were used in our scheme as follows: suppose that human interpretation psychophysics suggests a region element (e.g., the tie knot region in Fig. 4B) as one of the primitives in the interpretation model. We then produce descriptors for both the visual words and deep CNN features, which serve as potential unary relations for this element. These descriptions have been used successfully in past models, and we found empirically that both can be useful for the 47 current task. We then evaluate which of these unary relations is more informative for interpretation and use it as a part of the MIRC’s model. Coherent visual appearance: Given two candidate image regions 𝑅a and 𝑅b, we match the distribution of the visual appearance features in these two regions. Visual appearance features were ‘visual words’ features (Arandjelovic & Zisserman, 2013), and deep neural network features (Long et al., 2015). 𝑅a or 𝑅b could be either a single region primitive, or a region bounded by two (or more) contour primitives. Cover of a point by a contour: Given a pixel point 𝑥, 𝑦 and a contour C, we project C on the X-axis of the image plane, and return ‘true’ if x is within the range of projection. We composed procedures for different directions of cover, namely for a contour covers a point from top or from bottom. Similar ‘cover’ procedures were also for the Y axis. Appendix D. Evaluating similarity between elements: points, contours, and regions This process was used for evaluating the correctness of the interpretation produced by the model (Sec. 5.1). For two regions, A and B, the standard Jaccard measure ( 𝐴 ∩ 𝐵 / 𝐴 ∪ 𝐵 , Tan et al., 2006) was used. For two points, we construct a small square region around each point (size of 12% of the minimal image), and then evaluate the Jaccard index of these regions. For two contours, we used a simple extension of the Jaccard index to contours, by extending the contours into tube shaped regions (tube width was 4% of the minimal image) and measure the Jaccard index between these regions. References 1. Attneave, F. (1954). Some informational aspects of visual perception. Psychological review, 61(3), 183-193. 2. Amit, Y., & Geman, D. (1997). Shape quantization and recognition with randomized trees. Neural computation, 9(7), 1545-1588. 3. Azizpour, H., & Laptev, I. (2012). Object detection using strongly-supervised deformable part models. Proceedings of the European Conference on Computer Vision, 836-849. 4. Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2d human pose estimation: New benchmark and state of the art analysis. Proceedings of the IEEE Conference on computer Vision and Pattern Recognition ,3686-3693. 5. Arandjelovic, R., & Zisserman, A. (2013). All about VLAD. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1578-1585. 6. Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5), 898- 916. 48 7. Ben-Yosef, G., & Ben-Shahar, O. (2012). A tangent bundle theory for visual curve completion. IEEE transactions on pattern analysis and machine intelligence, 34(7), 1263-1280. 8. Ben-Yosef, G., Assif, L., Harari, D., & Ullman, S. (2015). A model for full local image interpretation. Proceedings of the annual meeting of the Cognitive Science Society, 220-225. 9. Bertsekas, D. P. (1999). Nonlinear programming (pp. 1-60). Belmont: Athena scientific. 10. Biederman, I. (1987). Recognition-by-components: a theory of human image understanding. Psychological review, 94(2), 115-147. 11. Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. 12. Brooks R. (1983). Model-based 3-D interpretations of 2-D images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(2):140-150. 13. Chen, X., & Yuille, A. L. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. Advances in Neural Information Processing Systems, 1736-1744. 14. Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A. L. (2017). DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE transactions on pattern analysis and machine intelligence, PP(99), 1-1. doi: 10.1109/TPAMI.2017.2699184 15. Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. Proceedings of the workshop on statistical learning in computer vision, European Conference on Computer Vision 1(1), 1-2. 16. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 886-893. 17. [dataset] Deng, J., Berg, A., Satheesh, S., Su, H., Khosla, A., & Fei-Fei, L. (2012). Imagenet large scale visual recognition competition 2012 (ILSVRC2012). net.org/challenges/LSVRC/2012/. 18. Denil, M., Bazzani, L., Larochelle, H., & de Freitas, N. (2012). Learning where to attend with deep architectures for image tracking. Neural computation, 24(8), 2151-2184. 19. Elder, J. H., Krupnik, A., & Johnston, L. A. (2003). Contour grouping with prior models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(6), 661-674. 20. Elder, J., & Zucker, S. (1996). Computing contour closure (1996). European Conference on Computer Vision, 399-412. 21. Epshtein, B., Lifshitz, I., & Ullman, S. (2008). Image interpretation by a single bottom-up top- down cycle. Proceedings of the National Academy of Sciences, 105(38), 14298-14303. 22. [dataset] Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The Pascal visual object classes (voc) challenge. International journal of computer vision, 88(2), 303- 338. 23. Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4), 594-611. 24. Feldman, J. (2007). Formation of visual “objects” in the early computation of spatial relations. Perception & Psychophysics, 69(5), 816-827. 25. Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International journal of computer vision, 61(1), 55-79. 26. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9), 1627-1645. 27. Ferrari, V., Jurie, F., & Schmid, C. (2010). From images to shape models for object detection. International journal of computer vision, 87(3), 284-303. 28. Fidler, S., & Leonardis, A. (2007). Towards scalable representations of object categories: Learning a hierarchy of parts. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1-8. 29. Foster, D. H., Simmons, D. R., & Cook, M. J. (1993). The cue for contour-curvature discrimination. Vision research, 33(3), 329-341. 30. Field, D. J., Hayes, A., & Hess, R. F. (1993). Contour integration by the human visual system: evidence for a local “association field”. Vision research, 33(2), 173-193. 31. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 580-587. 49 32. Girshick, R., Iandola, F., Darrell, T., & Malik, J. (2015). Deformable part models are convolutional neural networks. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 437-446. 33. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. 34. Hanson, A. & Riseman, E. (1978). Visions: A computer vision system for interpreting scenes. In A. Hanson and E. Riseman, editors, Computer Vision Systems, pages 303-334. Academic Press, New York, NY. 35. Hinton, G. E. (2007). Learning multiple layers of representation. Trends in cognitive sciences, 11(10), 428-434. 36. Jacobs, D. W. (1996). Robust and efficient detection of salient convex groups. IEEE transactions on pattern analysis and machine intelligence, 18(1), 23-37. 37. Jin, Y., & Geman, S. (2006). Context and hierarchy in a probabilistic image model. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2145-2152. 38. Joachims, T., Hofmann, T., Yue, Y., & Yu, C. N. (2009). Predicting structured objects with support vector machines. Communications of the ACM, 52(11), 97-104. 39. Jolicoeur, P., Ullman, S., & Mackay, M. (1986). Curve tracing: A possible basic operation in the perception of spatial relations. Memory & Cognition, 14(2), 129-140. 40. Kanizsa, G. (1979). Organization in vision: Essays on Gestalt perception. Praeger Publishers. 41. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 1097-1105. 42. Kovesi, P. D. (2000). MATLAB and Octave functions for computer vision and image processing. Online: http://www. csse. uwa. edu. au/~ pk/Research/MatlabFns/# match. 43. Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350(6266), 1332-1338. 44. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444. 45. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the International Conference on Machine Learning, 282-289. 46. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 3431-3440. 47. [dataset] Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. Proceedings of the European Conference on Computer Vision, 740-755. 48. Machilsen, B., Pauwels, M., & Wagemans, J. (2009). The role of vertical mirror symmetry in visual shape detection. Journal of Vision, 9(12), 11-11. 49. Maurer, C. R., Qi, R., & Raghavan, V. (2003). A linear time algorithm for computing exact Euclidean distance transforms of binary images in arbitrary dimensions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(2), 265-270. 50. Marr, D., & Nishihara, H. K. (1978). Representation and recognition of the spatial organization of three-dimensional shapes. Proceedings of the Royal Society of London B: Biological Sciences, 200(1140), 269-294. 51. Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. Advances in neural information processing systems, 2204-2212. 52. Murphy, T. M., & Finkel, L. H. (2007). Shape representation by a network of V4-like cells. Neural Networks, 20(8), 851-867. 53. Ommer, B., & Buhmann, J. M. (2007, June). Learning the compositional nature of visual objects. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 48. 54. Opelt, A., Pinz, A., & Zisserman, A. (2006). Incremental learning of object detectors using a visual shape alphabet. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3-10. 55. Palmer, S., & Rock, I. (1994). Rethinking perceptual organization: The role of uniform connectedness. Psychonomic bulletin & review, 1(1), 29-55. 56. Pasupathy, A., & Connor, C. E. (1999). Responses to contour features in macaque area V4. Journal of Neurophysiology, 82(5), 2490-2502. 57. Pearl, J. (2009). Causality. Cambridge university press. 50 58. Parent, P., & Zucker, S. W. (1989). Trace inference, curvature consistency, and curve detection. IEEE Transactions on pattern analysis and machine intelligence, 11(8), 823-839. 59. Rodríguez-Sánchez, A. J., & Tsotsos, J. K. (2011). The importance of intermediate representations for the modeling of 2d shape detection: Endstopping and curvature tuned computations. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4321-4326. 60. Roelfsema, P. R., Lamme, V. A., & Spekreijse, H. (1998). Object-based attention in the primary visual cortex of the macaque monkey. Nature, 395(6700), 376-381. 61. Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature neuroscience, 2(11), 1019-1025. 62. Saarela, T. P., Sayim, B., Westheimer, G., & Herzog, M. H. (2009). Global stimulus configuration modulates crowding. Journal of Vision, 9(2), 1-11. 63. Serre, T., Oliva, A., & Poggio, T. (2007). A feedforward architecture accounts for rapid categorization. Proceedings of the National Academy of Sciences, 104(15), 6424-6429. 64. Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press. 65. Siddiqi, K., Shokoufandeh, A., Dickinson, S. J., & Zucker, S. W. (1999). Shock Graphs and Shape Matching. International Journal of Computer Vision, 35(1), 13-32. 66. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations. 67. Tan, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining (Vol. 1). Boston: Pearson Addison Wesley. 68. Stahl, J. S., & Wang, S. (2008). Globally optimal grouping for symmetric closed boundaries by combining boundary and region information. IEEE transactions on pattern analysis and machine intelligence, 30(3), 395-411. 69. Todorovic, S., & Ahuja, N. (2006). Extracting subimages of an unknown category from a set of images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 927- 934. 70. Tompson, J. J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. Advances in neural information processing systems,1799-1807. 71. Torralba, A. (2009). How many pixels make an image?. Visual neuroscience, 26(01), 123-131. 72. Ullman, S., Assif, L., Fetaya, E., & Harari, D. (2016). Atoms of recognition in human and computer vision. Proceedings of the National Academy of Sciences, 113(10), 2744-2749. 73. Ullman, S. (1984). Visual routines. Cognition, 18(1-3), 97-159. 74. Yamins, D. L., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., & DiCarlo, J. J. (2014). Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, 111(23), 8619-8624. 75. Vedaldi, A., Fulkerson, B.: VLFeat: An open and portable library of computer vision algorithms. http://www.vlfeat.org/ (2008) 76. Vedaldi, A., Mahendran, S., Tsogkas, S., Maji, S., Girshick, R., Kannala, J., ... & Taskar, B. (2014). Understanding objects in detail with fine-grained attributes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3622-3629. 77. Walther, D. B., Chai, B., Caddigan, E., Beck, D. M., & Fei-Fei, L. (2011). Simple line drawings suffice for functional MRI decoding of natural scene categories. Proceedings of the National Academy of Sciences, 108(23), 9661-9666. 78. Wertheimer, M. (1923). Untersuchungen zur Lehre von der Gestalt. II. Psychological Research, 4(1), 301-350. 79. Westheimer, G., Crist, R. E., Gorski, L., & Gilbert, C. D. (2001). Configuration specificity in bisection acuity. Vision research, 41(9), 1133-1138. 80. Wei, S. E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional Pose Machines. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4724-4732. 81. Xia, F., Wang, P., Chen, L. C., & Yuille, A. L. (2016). Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. European Conference on Computer Vision, 648-663. 82. Xiao, S., Feng, J., Xing, J., Lai, H., Yan, S., & Kassim, A. (2016). Robust Facial Landmark Detection via Recurrent Attentive-Refinement Networks. European Conference on Computer Vision, 57-72. 51 83. Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L., & Fei-Fei, L. (2011, November). Human action recognition by learning bases of action attributes and parts. Proceedings of the IEEE International Conference on Computer Vision, 1331-1338. 84. Yang, S., Luo, P., Loy, C. C., & Tang, X. (2015). From facial parts responses to face detection: A deep learning approach. Proceedings of the IEEE International Conference on Computer Vision, 3676-3684. 85. Zhu, L., Chen, Y., & Yuille, A. (2009). Unsupervised learning of probabilistic grammar-markov models for object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1), 114-128. 86. Zhu, S. C., & Mumford, D. (2007). A stochastic grammar of images. Foundations and Trends® in Computer Graphics and Vision, 2(4), 259-362.