CBMM Memo Series

pAI/MSc: ML Theory Research with Humans on the Loop

2026-03-25T00:00:00Z

pAI/MSc: ML Theory Research with Humans on the Loop Abdelmoneum, Mahmoud; Beneventano, Pierfrancesco; Poggio, Tomaso We present pAI/MSc3, an open-source, customizable, modular multi-agent system for academic research workflows. Our goal is not autonomous scientific ideation, nor fully automated research. It is narrower and more practical: to reduce by orders of magnitude the human steering required to turn a specified hypothesis into a literature-grounded, mathematically established, experimentally supported, submission-oriented manuscript draft. pAI/MSc is built with a current emphasis on machine learning theory and adjacent quantitative fields.

Multiplicative Regularization Generalizes Better Than Additive Regularization

2025-07-02T00:00:00Z

Multiplicative Regularization Generalizes Better Than Additive Regularization Dubach, Rafael; Abdallah, Mohamed S.; Poggio, Tomaso We investigate the effectiveness of multiplicative versus additive (L2) regularization in deep neural networks, focusing on convolutional neural networks for classification. While additive methods constrain the sum of squared weights, multiplicative regularization directly penalizes the product of layerwise Frobenius norms, a quantity theoretically linked to tighter Rademacher-based generalization bounds. Through experiments on binary classification tasks in a controlled setup, we observe that multiplicative regularization consistently yields wider margin distributions, stronger rank suppression in deeper layers, and improved robustness to label noise. Under 20% label corruption, multiplicative regularization preserves margins that are 5.2% higher and achieves 3.59% higher accuracy compared to additive regularization in our main network architecture. Furthermore, multiplicative regularization achieves a 3.53% boost in test performance for multiclass classification compared to additive regularization. Our analysis of training dynamics shows that directly constraining the global product of norms leads to flatter loss landscapes that correlate with greater resilience to overfitting. These findings highlight the practical benefits of multiplicative penalties for improving generalization and stability in deep models.

Position: A Theory of Deep Learning Must Include Compositional Sparsity

2025-07-02T00:00:00Z

Position: A Theory of Deep Learning Must Include Compositional Sparsity Danhofer, David A.; D'Ascenso, Davide; Dubach, Rafael; Poggio, Tomaso Overparametrized Deep Neural Networks (DNNs) have demonstrated remarkable success in a wide variety of domains too high-dimensional for classical shallow networks subject to the curse of dimensionality. However, open questions about fundamental principles, that govern the learning dynamics of DNNs, remain. In this position paper we argue that it is the ability of DNNs to exploit the compositionally sparse structure of the target function driving their success. As such, DNNs can leverage the property that most practically relevant functions can be composed from a small set of constituent functions, each of which relies only on a low-dimensional subset of all inputs. We show that this property is shared by all efficiently Turing-computable functions and is therefore highly likely present in all current learning problems. While some promising theoretical insights on questions concerned with approximation and generalization exist in the setting of compositionally sparse functions, several important questions on the learnability and optimization of DNNs remain. Completing the picture of the role of compositional sparsity in deep learning is essential to a comprehensive theory of artificial— and even general—intelligence.

On efficiently computable functions, deep networks and sparse compositionality

2025-02-01T00:00:00Z

On efficiently computable functions, deep networks and sparse compositionality Poggio, Tomaso In previous papers [4, 6] we have claimed that for each function which is efficiently Turing computable there exists a deep and sparse network which approximates it arbitrarily well. We also claimed a key role for compositional sparsity in this result. Though the general claims are correct some of our statements may have been imprecise and thus potentially misleading. In this short paper we wish to formally restate our claims and provide definitions and proofs.

Self-Assembly of a Biologically Plausible Learning Circuit

2024-12-28T00:00:00Z

Self-Assembly of a Biologically Plausible Learning Circuit Liao, Qianli; Ziyin, Liu; Gan, Yulu; Cheung, Brian; Harnett, Mark; Poggio, Tomaso Over the last four decades, the amazing success of deep learning has been driven by the use of Stochastic Gradient Descent (SGD) as the main optimization technique. The default implementation for the computation of the gradient for SGD is backpropagation, which, with its variations, is used to this day in almost all computer implementations. From the perspective of neuroscientists, however, the consensus is that backpropagation is unlikely to be used by the brain. Though several alternatives have been discussed, none is so far supported by experimental evidence. Here we propose a circuit for updating the weights in a network that is biologically plausible, works as well as backpropagation, and leads to verifiable predictions about the anatomy and the physiology of a characteristic motif of four plastic synapses between ascending and descending cortical streams. A key prediction of our proposal is a surprising property of self-assembly of the basic circuit, emerging from initial random connectivity and heterosynaptic plasticity rules.

On Generalization Bounds for Neural Networks with Low Rank Layers

2024-10-11T00:00:00Z

On Generalization Bounds for Neural Networks with Low Rank Layers Pinto, Andrea; Rangamani, Akshay; Poggio, Tomaso While previous optimization results have suggested that deep neural networks tend to favour low-rank weight matrices, the implications of this inductive bias on generalization bounds remain under-explored. In this paper, we apply a chain rule for Gaussian complexity (Maurer, 2016a) to analyze how low-rank layers in deep networks can prevent the accumulation of rank and dimensionality factors that typically multiply across layers. This approach yields generalization bounds for rank and spectral norm constrained networks. We compare our results to prior generalization bounds for deep networks, highlighting how deep networks with low-rank layers can achieve better generalization than those with full-rank layers. Additionally, we discuss how this framework provides new perspectives on the generalization capabilities of deep nets exhibiting neural collapse.

Formation of Representations in Neural Networks

2024-10-07T00:00:00Z

Formation of Representations in Neural Networks Ziyin, Liu; Chuang, Isaac; Galanti, Tomer; Poggio, Tomaso Understanding neural representations will help open the black box of neural networks and advance our scientific understanding of modern AI systems. However, how complex, structured, and transferable representations emerge in modern neural networks has remained a mystery. Building on previous results, we propose the Canonical Representation Hypothesis (CRH), which posits a set of six alignment relations to universally govern the formation of representations in most hidden layers of a neural network. Under the CRH, the latent representations (R), weights (W), and neuron gradients (G) become mutually aligned during training. This alignment implies that neural networks naturally learn compact representations, where neurons and weights are invariant to task-irrelevant transformations. We then show that the breaking of CRH leads to the emergence of reciprocal power-law relations between R, W, and G, which we refer to as the Polynomial Alignment Hypothesis (PAH). We present a minimal-assumption theory demonstrating that the balance between gradient noise and regularization is crucial for the emergence the canonical representation. The CRH and PAH lead to an exciting possibility of unifying major key deep learning phenomena, including neural collapse and the neural feature ansatz, in a single framework.

On the Power of Decision Trees in Auto-Regressive Language Modeling

2024-09-27T00:00:00Z

On the Power of Decision Trees in Auto-Regressive Language Modeling Gan, Yulu; Galanti, Tomer; Poggio, Tomaso; Malach, Eran Originally proposed for handling time series data, Auto-regressive Decision Trees (ARDTs) have not yet been explored for language modeling. This paper delves into both the theoretical and practical applications of ARDTs in this new context. We theoretically demonstrate that ARDTs can compute complex functions, such as simulating automata, Turing machines, and sparse circuits, by leveraging "chain-of-thought" computations. Our analysis provides bounds on the size, depth, and computational efficiency of ARDTs, highlighting their surprising computational power. Empirically, we train ARDTs on simple language generation tasks, showing that they can learn to generate coherent and grammatically correct text on par with a smaller Transformer model. Additionally, we show that ARDTs can be used on top of transformer representations to solve complex reasoning tasks. This research reveals the unique computational abilities of ARDTs, aiming to broaden the architectural diversity in language model development.

For HyperBFs AGOP is a greedy approximation to gradient descent

2024-07-13T00:00:00Z

For HyperBFs AGOP is a greedy approximation to gradient descent Gan, Yulu; Poggio, Tomaso The Average Gradient Outer Product (AGOP) provides a novel approach to feature learning in neural networks. We applied both AGOP and Gradient Descent to learn the matrix M in the Hyper Basis Function Network (HyperBF) and observed very similar performance. We show formally that AGOP is a greedy approximation of gradient descent.

Compositional Sparsity of Learnable Functions

2024-02-08T00:00:00Z

Compositional Sparsity of Learnable Functions Poggio, Tomaso; Fraser, Maia Neural networks have demonstrated impressive success in various domains, raising the question of what fundamental principles underlie the effectiveness of the best AI systems and quite possibly of human intelligence. This perspective argues that compositional sparsity, or the property that a compositional function have "few" constituent functions, each depending on only a small subset of inputs, is a key principle underlying successful learning architectures. Surprisingly, all functions that are efficiently Turing computable have a compositional sparse representation. Furthermore, deep networks that are also sparse can exploit this general property to avoid the “curse of dimensionality". This framework suggests interesting implications about the role that machine learning may play in mathematics.

The Janus effects of SGD vs GD: high noise and low rank

2023-12-21T00:00:00Z

The Janus effects of SGD vs GD: high noise and low rank Xu, Mengjia; Galanti, Tomer; Rangamani, Akshay; Rosasco, Lorenzo; Poggio, Tomaso It was always obvious that SGD has higher fluctuations at convergence than GD. It has also been often reported that SGD in deep RELU networks has a low-rank bias in the weight matrices. A recent theoretical analysis linked SGD noise with the low-rank bias induced by the SGD updates associated with small minibatch sizes [1]. In this paper, we provide an empirical and theoretical analysis of the convergence of SGD vs GD, first for deep RELU networks and then for the case of linear regression, where sharper estimates can be obtained and which is of independent interest. In the linear case, we prove that the components of the matrix W corresponding to the null space of the data matrix X converges to zero for both SGD and GD, provided the regularization term is non-zero (in the case of square loss; for exponential loss the result holds independently of regularization). The convergence rate, however, is exponential for SGD, and linear for GD. Thus SGD has a much stronger bias than GD towards solutions for weight matrices W with high fluctuations and low rank, provided the initialization is from a random matrix (but not if W is initialized as a zero matrix). Thus SGD under exponential loss, or under the square loss with non-zero regularization, shows the coupled phenomenon of low rank and asymptotic noise.

A Homogeneous Transformer Architecture

2023-09-18T00:00:00Z

A Homogeneous Transformer Architecture Gan, Yulu; Poggio, Tomaso While the Transformer architecture has made a substantial impact in the field of machine learning, it is unclear what purpose each component serves in the overall architecture. Heterogeneous nonlinear circuits such as multi-layer RELU networks are interleaved with layers of soft-max units. We introduce here a homogeneous architecture based on Hyper Radial Basis Function (HyperBF) units. Evalua- tions on CIFAR10, CIFAR100, and Tiny ImageNet demonstrate a performance comparable to standard vision transformers.

Skip Connections Increase the Capacity of Associative Memories in Variable Binding Mechanisms

2023-06-27T00:00:00Z

Skip Connections Increase the Capacity of Associative Memories in Variable Binding Mechanisms Xie, Yi; Li, Yichen; Rangamani, Akshay The flexibility of intelligent behavior is fundamentally attributed to the ability to separate and assign structural information from content in sensory inputs. Variable binding is the atomic computation that underlies this ability. In this work, we investigate the implementation of variable binding via pointers of assemblies of neurons, which are sets of excitatory neurons that fire together. The Assembly Calculus is a framework that describes a set of operations to create and modify assemblies of neurons. We focus on the project (which creates assemblies) and reciprocal-project (which performs vari- able binding) operations and study the capacity of networks in terms of the number of assemblies that can be reliably created and retrieved. We find that assembly calculus networks implemented through Hebbian plasticity resemble associative memories in their structure and behavior. However, for net- works with N neurons per brain area, the capacity of variable binding networks (0.01N) is an order of magnitude lower than the capacity of assembly creation networks (0.22N). To alleviate this drop in capacity, we propose a skip connection between the input and variable assembly, which boosts the capacity to a similar order of magnitude (0.1N ) as the Project operation, while maintain its biological plausibility.

Feature learning in deep classifiers through Intermediate Neural Collapse

2023-02-27T00:00:00Z

Feature learning in deep classifiers through Intermediate Neural Collapse Rangamani, Akshay; Lindegaard, Marius; Galanti, Tomer; Poggio, Tomaso In this paper, we conduct an empirical study of the feature learning process in deep classifiers. Recent research has identified a training phenomenon called Neural Collapse (NC), in which the top-layer feature embeddings of samples from the same class tend to concentrate around their means, and the top layer’s weights align with those features. Our study aims to investigate if these properties extend to intermediate layers. We empirically study the evolution of the covariance and mean of representations across different layers and show that as we move deeper into a trained neural network, the within-class covariance decreases relative to the between-class covariance. Additionally, we find that in the top layers, where the between-class covariance is dominant, the subspace spanned by the class means aligns with the subspace spanned by the most significant singular vector components of the weight matrix in the corresponding layer. Finally, we discuss the relationship between NC and Associative Memories.

SGD and Weight Decay Provably Induce a Low-Rank Bias in Deep Neural Networks

2023-02-14T00:00:00Z

SGD and Weight Decay Provably Induce a Low-Rank Bias in Deep Neural Networks Galanti, Tomer; Siegel, Zachary; Gupte, Aparna; Poggio, Tomaso In this paper, we study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep ReLU neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matri- ces. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization. Our analysis is based on a minimal set of assumptions and applies to neural networks of any width or depth, including those with residual connections and convolutional layers.

Norm-Based Generalization Bounds for Compositionally Sparse Neural Network

2023-02-14T00:00:00Z

Norm-Based Generalization Bounds for Compositionally Sparse Neural Network Galanti, Tomer; Xu, Mengjia; Galanti, Liane; Poggio, Tomaso In this paper, we investigate the Rademacher complexity of deep sparse neural networks, where each neuron receives a small number of inputs. We prove generalization bounds for multilayered sparse ReLU neural networks, including convolutional neural networks. These bounds differ from previous ones, as they consider the norms of the convolutional filters instead of the norms of the associated Toeplitz matrices, independently of weight sharing between neurons. As we show theoretically, these bounds may be orders of magnitude better than standard norm- based generalization bounds and empirically, they are almost non-vacuous in estimating generalization in various simple classification problems. Taken together, these results suggest that compositional sparsity of the underlying target function is critical to the success of deep neural networks.

Compositional Sparsity: a framework for ML

2022-10-10T00:00:00Z

Compositional Sparsity: a framework for ML Poggio, Tomaso The main claim of this perspective is that compositional sparsity of the target function, which corre- sponds to the task to be learned, is the key principle underlying machine learning. I prove that under restrictions of smoothness of the constituent functions, sparsity of the compositional target functions naturally leads to sparse deep networks for approximation, optimization and generalization. This is the case of most CNNs in current use, in which the known sparse graph of the target function is reflected in the sparse connectivity of the network. When the graph of the target function is unknow, I conjec- ture that transformers are able to implement a flexible version of sparsity (selecting which input tokens interact in the MLP layer), through the self-attention layers. Surprisingly, the assumption of compositional sparsity of the target function is not restrictive in practice, since for computable functions with Lipschitz continuous derivatives compositional sparsity is equivalent to efficient computability, that is computability in polynomial time.

Understanding the Role of Recurrent Connections in Assembly Calculus

2022-07-06T00:00:00Z

Understanding the Role of Recurrent Connections in Assembly Calculus Rangamani, Akshay; Xie, Yi In this note, we explore the role of recurrent connections in Assembly Calculus through a number of experiments conducted on models with and without recurrent connections. We observe that as- semblies can be formed even in the absence of recurrent connections, but also find that models with recurrent connections are more robust to noisy inputs. We also investigate the spectral structure of the synaptic weights and find intriguing similarities between models of neural assemblies and associative memories.

System identification of neural systems: If we got it right, would we know?

2022-07-02T00:00:00Z

System identification of neural systems: If we got it right, would we know? Han, Yena; Poggio, Tomaso; Cheung, Brian Various artificial neural networks developed by engineers have been evaluated as models of the brain, such as the ventral stream in the primate visual cortex. After being trained on large datasets, the network outputs are compared to recordings of biological neurons. Good performance in reproducing neural responses is taken as validation for the model. This system identification approach is different from the traditional ways to test theories and associated models in the natural sciences. Furthermore, it lacks a clear foundation in terms of theory and empirical validation. Here we begin characterizing some of these emerging approaches: what do they tell us? To address this question, we benchmark their ability to correctly identify a model by replacing the brain recordings with recordings from a known ground truth model. We evaluate commonly used identification techniques such as neural regression (linear regression on a population of model units) and centered kernel alignment (CKA). Even in the setting where the correct model is among the candidates, we find that the performance of these approaches at system identification is quite variable; it also depends significantly on factors independent of the ground truth architecture, such as scoring function and dataset.

PCA as a defense against some adversaries

2022-03-30T00:00:00Z

PCA as a defense against some adversaries Aparne, Gupta; Banburski, Andrzej; Poggio, Tomaso Neural network classifiers are known to be highly vulnerable to adversarial perturbations in their inputs. Under the hypothesis that adversarial examples lie outside of the sub-manifold of natural images, previous work has investigated the impact of principal components in data on adversarial robustness. In this paper we show that there exists a very simple defense mechanism in the case where adversarial images are separable in a previously defined $(k,p)$ metric. This defense is very successful against the popular Carlini-Wagner attack, but less so against some other common attacks like FGSM. It is interesting to note that the defense is still successful for relatively large perturbations.

SGD Noise and Implicit Low-Rank Bias in Deep Neural Networks

2022-03-28T00:00:00Z

SGD Noise and Implicit Low-Rank Bias in Deep Neural Networks Galanti, Tomer; Poggio, Tomaso We analyze deep ReLU neural networks trained with mini-batch stochastic gradient decent and weight decay. We prove that the source of the SGD noise is an implicit low rank constraint across all of the weight matrices within the network. Furthermore, we show, both theoretically and empirically, that when training a neural network using Stochastic Gradient Descent (SGD) with a small batch size, the resulting weight matrices are expected to be of small rank. Our analysis relies on a minimal set of assumptions and the neural networks may include convolutional layers, residual connections, as well as batch normalization layers.

Incorporating Rich Social Interactions Into MDPs

2022-02-07T00:00:00Z

Incorporating Rich Social Interactions Into MDPs Tejwani, Ravi; Kuo, Yen-Ling; Shu, Tianmin; Stankovits, Bennett; Gutfreund, Dan; Tenenbaum, Joshua B.; Katz, Boris; Barbu, Andrei Much of what we do as humans is engage socially with other agents, a skill that robots must also eventually possess. We demonstrate that a rich theory of social interactions originating from microso- ciology and economics can be formalized by extending a nested MDP where agents reason about arbitrary functions of each other’s hidden rewards. This extended Social MDP allows us to encode the five basic interactions that underlie microsociology: cooperation, conflict, coercion, competition, and exchange. The result is a robotic agent capable of executing social interactions zero-shot in new environments; like humans it can engage socially in novel ways even without a single example of that social interaction. Moreover, the judgments of these Social MDPs align closely with those of humans when considering which social interaction is taking place in an environment. This method both sheds light on the nature of social interactions, by providing concrete mathematical definitions, and brings rich social interactions into a mathematical framework that has proven to be natural for robotics, MDPs.

Trajectory Prediction with Linguistic Representations

2022-03-09T00:00:00Z

Trajectory Prediction with Linguistic Representations Kuo, Yen-Ling; Huang, Xin; Barbu, Andrei; McGill, Stephen G.; Katz, Boris; Leonard, John J.; Rosman, Guy Language allows humans to build mental models that interpret what is happening around them resulting in more accurate long-term predictions. We present a novel trajectory prediction model that uses linguistic intermediate representations to forecast trajectories, and is trained using trajectory sam- ples with partially-annotated captions. The model learns the meaning of each of the words without direct per-word supervision. At inference time, it generates a linguistic description of trajectories which captures maneuvers and interactions over an extended time interval. This generated description is used to refine predictions of the trajectories of multiple agents. We train and validate our model on the Argoverse dataset, and demonstrate improved accuracy results in trajectory prediction. In addition, our model is more interpretable: it presents part of its reasoning in plain language as captions, which can aid model development and can aid in building confidence in the model before deploying it.

Neural Regression, Representational Similarity, Model Zoology Neural Taskonomy at Scale in Rodent Visual Cortex

2021-12-06T00:00:00Z

Neural Regression, Representational Similarity, Model Zoology Neural Taskonomy at Scale in Rodent Visual Cortex Conwell, Colin; Mayo, David; Buice, Michael A.; Katz, Boris; Alvarez, George A.; Barbu, Andrei How well do deep neural networks fare as models of mouse visual cortex? A majority of research to date suggests results far more mixed than those produced in the modeling of primate visual cortex. Here, we perform a large-scale bench- marking of dozens of deep neural network models in mouse visual cortex with both representational similarity analysis and neural regression. Using the Allen Brain Observatory’s 2-photon calcium-imaging dataset of activity in over 6,000 reliable rodent visual cortical neurons recorded in response to natural scenes, we replicate previous findings and resolve previous discrepancies, ultimately demonstrating that modern neural networks can in fact be used to explain activity in the mouse visual cortex to a more reasonable degree than previously suggested. Using our benchmark as an atlas, we offer preliminary answers to overarching questions about levels of analysis, questions about the properties of models that best predict the visual system overall and questions about the mapping between biological and artificial representations. Our results provide a reference point for future ventures in the deep neural network modeling of mouse visual cortex, hinting at novel combinations of mapping method, architecture, and task to more fully characterize the computational motifs of visual representation in a species so central to neuroscience, but with a perceptual physiology and ecology markedly different from the ones we study in primates.

Social Interactions as Recursive MDPs

2021-11-08T00:00:00Z

Social Interactions as Recursive MDPs Tejwani, Ravi; Kuo, Yen-Ling; Shu, Tianmin; Katz, Boris; Barbu, Andrei While machines and robots must interact with humans, providing them with social skills has been a largely overlooked topic. This is mostly a consequence of the fact that tasks such as navigation, command following, and even game playing are well-defined, while social reasoning still mostly re- mains a pre-theoretic problem. We demonstrate how social interactions can be effectively incorporated into MDPs (Markov decision processes) by reasoning recursively about the goals of other agents. In essence, our method extends the reward function to include a combination of physical goals (something agents want to accomplish in the configuration space, a traditional MDP) and social goals (something agents want to accomplish relative to the goals of other agents). Our Social MDPs allow specifying reward functions in terms of the estimated reward functions of other agents, modeling interactions such as helping or hindering another agent (by maximizing or minimizing the other agent’s reward) while bal- ancing this with the actual physical goals of each agent. Our formulation allows for an arbitrary function of another agent’s estimated reward structure and physical goals, enabling more complex behaviors such as politely hindering another agent or aggressively helping them. Extending Social MDPs in the same manner as I-POMDPs (Interactive-partially observed Markov decision processes) extension would enable interactions such as convincing another agent that something is true. To what extent the Social MDPs presented here and their potential Social POMDPs variant account for all possible social interactions is unknown, but having a precise mathematical model to guide questions about social in- teractions has both practical value (we demonstrate how to make zero-shot social inferences and one could imagine chatbots and robots guided by Social MDPs) and theoretical value by bringing the tools of MDP that have so successfully organized research around navigation to shed light on what social interactions really are given their extreme importance to human well-being and human civilization.

Compositional Networks Enable Systematic Generalization for Grounded Language Understanding

2021-11-07T00:00:00Z

Compositional Networks Enable Systematic Generalization for Grounded Language Understanding Kuo, Yen-Ling; Katz, Boris; Barbu, Andrei Humans are remarkably flexible when under- standing new sentences that include combinations of concepts they have never encountered before. Recent work has shown that while deep networks can mimic some human language abilities when presented with novel sentences, systematic variation un- covers the limitations in the language-understanding abilities of networks. We demonstrate that these limitations can be overcome by addressing the generalization challenges in the gSCAN dataset, which explicitly measures how well an agent is able to interpret novel linguistic commands grounded in vision, e.g., novel pairings of adjectives and nouns. The key principle we employ is compositionality: that the compositional structure of networks should reflect the compositional structure of the problem domain they address, while allowing other parameters to be learned end-to-end. We build a general-purpose mechanism that enables agents to generalize their language understanding to compositional domains. Crucially, our network has the same state-of-the art performance as prior work while generalizing its knowledge when prior work does not. Our network also provides a level of interpretability that enables users to inspect what each part of networks learns. Robust grounded language understanding without dramatic failures and without corner cases is critical to building safe and fair robots; we demonstrate the significant role that compositionality can play in achieving that goal.

Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

2021-08-30T00:00:00Z

Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset Palmer, Ian; Rouditchenko, Andrew; Barbu, Andrei; Katz, Boris; Glass, James Visually-grounded spoken language datasets can enable models to learn cross-modal correspon- dences with very weak supervision. However, modern audio-visual datasets contain biases that un- dermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designed to remove some of these biases and provide a way to better evaluate how effec- tively models will perform in real-world scenarios. This dataset expands upon ObjectNet, which is a biascontrolled image dataset that features similar image classes to those present in ImageNet. We detail our data collection pipeline, which features several methods to improve caption quality, including automated language model checks. Lastly, we show baseline results on image retrieval and audio re- trieval tasks. These results show that models trained on other datasets and then evaluated on Spoken ObjectNet tend to perform poorly due to biases in other datasets that the models have learned. We also show evidence that the performance decrease is due to the dataset controls, and not the transfer setting.

Compositional RL Agents That Follow Language Commands in Temporal Logic

2021-07-19T00:00:00Z

Compositional RL Agents That Follow Language Commands in Temporal Logic Kuo, Yen-Ling; Barbu, Andrei; Katz, Boris We demonstrate how a reinforcement learning agent can use compositional recurrent neural net- works to learn to carry out commands specified in linear temporal logic (LTL). Our approach takes as input an LTL formula, structures a deep network according to the parse of the formula, and determines satisfying actions. This compositional structure of the network enables zero-shot generalization to sig- nificantly more complex unseen formulas. We demonstrate this ability in multiple problem domains with both discrete and continuous state-action spaces. In a symbolic domain, the agent finds a sequence of letters that satisfy a specification. In a Minecraft-like environment, the agent finds a sequence of actions that conform to a formula. In the Fetch environment, the robot finds a sequence of arm config- urations that move blocks on a table to fulfill the commands. While most prior work can learn to execute one formula reliably, we develop a novel form of multi-task learning for RL agents that allows them to learn from a diverse set of tasks and generalize to a new set of diverse tasks without any additional training. The compositional structures presented here are not specific to LTL, thus opening the path to RL agents that perform zero-shot generalization in other compositional domains.

Measuring Social Biases in Grounded Vision and Language Embeddings

2021-06-06T00:00:00Z

Measuring Social Biases in Grounded Vision and Language Embeddings Ross, Candace; Barbu, Andrei; Katz, Boris We generalize the notion of measuring social biases in word embeddings to visually grounded word embeddings. Biases are present in grounded embeddings, and indeed seem to be equally or more significant than for ungrounded embeddings. This is despite the fact that vision and language can suffer from different biases, which one might hope could attenuate the biases in both. Multiple ways exist to generalize metrics measuring bias in word embeddings to this new setting. We introduce the space of generalizations (GroundedWEAT and Grounded-SEAT) and demonstrate that three gener- alizations answer different yet important questions about how biases, language, and vision interact. These metrics are used on a new dataset, the first for grounded bias, created by augmenting stan- dard linguistic bias benchmarks with 10,228 images from COCO, Conceptual Captions, and Google Images. Dataset construction is challenging because vision datasets are themselves very biased. The presence of these biases in systems will begin to have real-world consequences as they are deployed, making carefully measuring bias and then mitigating it critical to building a fair society.

Encoding formulas as deep networks: Reinforcement learning for zero-shot execution of LTL formulas

2020-10-25T00:00:00Z

Encoding formulas as deep networks: Reinforcement learning for zero-shot execution of LTL formulas Kuo, Yen-Ling; Katz, Boris; Barbu, Andrei We demonstrate a reinforcement learning agent which uses a compositional recurrent neural network that takes as input an LTL formula and determines satisfying actions. The input LTL formulas have never been seen before, yet the network performs zero-shot generalization to satisfy them. This is a novel form of multi-task learning for RL agents where agents learn from one diverse set of tasks and generalize to a new set of diverse tasks. The formulation of the network enables this capacity to generalize. We demonstrate this ability in two domains. In a symbolic domain, the agent finds a sequence of letters that is accepted. In a Minecraft-like environment, the agent finds a sequence of actions that conform to the formula. While prior work could learn to execute one formula reliably given examples of that formula, we demonstrate how to encode all formulas reliably. This could form the basis of new multi- task agents that discover sub-tasks and execute them without any additional training, as well as the agents which follow more complex linguistic commands. The structures required for this generalization are specific to LTL formulas, which opens up an interesting theoretical question: what structures are required in neural networks for zero-shot generalization to different logics?

Deep compositional robotic planners that follow natural language commands

2020-05-31T00:00:00Z

Deep compositional robotic planners that follow natural language commands Kuo, Yen-Ling; Katz, Boris; Barbu, Andrei We demonstrate how a sampling-based robotic planner can be augmented to learn to understand a sequence of natural language commands in a continuous configuration space to move and manipu- late objects. Our approach combines a deep network structured according to the parse of a complex command that includes objects, verbs, spatial relations, and attributes, with a sampling-based planner, RRT. A recurrent hierarchical deep network controls how the planner explores the environment, de- termines when a planned path is likely to achieve a goal, and estimates the confidence of each move to trade off exploitation and exploration between the network and the planner. Planners are designed to have near-optimal behavior when information about the task is missing, while networks learn to ex- ploit observations which are available from the environment, making the two naturally complementary. Combining the two enables generalization to new maps, new kinds of obstacles, and more complex sentences that do not occur in the training set. Little data is required to train the model despite it jointly acquiring a CNN that extracts features from the environment as it learns the meanings of words. The model provides a level of interpretability through the use of attention maps allowing users to see its reasoning steps despite being an end-to-end model. This end-to-end model allows robots to learn to follow natural language commands in challenging continuous environments.

PHASE: PHysically-grounded Abstract Social Events for Machine Social Perception

2021-03-19T00:00:00Z

PHASE: PHysically-grounded Abstract Social Events for Machine Social Perception Netanyahu, Aviv; Shu, Tianmin; Katz, Boris; Barbu, Andrei; Tenenbaum, Joshua B. The ability to perceive and reason about social interactions in the context of physical environments is core to human social intelligence and human-machine cooperation. However, no prior dataset or benchmark has systematically evaluated physically grounded perception of complex social interactions that go beyond short actions, such as high-fiving, or simple group activities, such as gathering. In this work, we create a dataset of physically-grounded abstract social events, PHASE, that resemble a wide range of real-life social interactions by including social concepts such as helping another agent. PHASE consists of 2D animations of pairs of agents moving in a continuous space generated procedurally using a physics engine and a hierarchical planner. Agents have a limited field of view, and can interact with multiple objects, in an environment that has multiple landmarks and obstacles. Using PHASE, we design a social recognition task and a social prediction task. PHASE is validated with human experiments demonstrating that humans perceive rich interactions in the social events, and that the simulated agents behave similarly to humans. As a baseline model, we introduce a Bayesian inverse planning approach, SIMPLE (SIMulation, Planning and Local Estimation), which outperforms state-of- the-art feedforward neural networks. We hope that PHASE can serve as a difficult new challenge for developing new models that can recognize complex social interactions.

Learning a natural-language to LTL executable semantic parser for grounded robotics

2020-11-16T00:00:00Z

Learning a natural-language to LTL executable semantic parser for grounded robotics Wang, Christopher; Ross, Candace; Kuo, Yen-Ling; Katz, Boris; Barbu, Andrei Children acquire their native language with apparent ease by observing how language is used in context and attempting to use it themselves. They do so without laborious annotations, negative examples, or even direct corrections. We take a step toward robots that can do the same by training a grounded semantic parser, which discovers latent linguistic representations that can be used for the execution of natural-language commands. In particular, we focus on the difficult domain of commands with a temporal aspect, whose semantics we capture with Linear Temporal Logic, LTL. Our parser is trained with pairs of sentences and executions as well as an executor. At training time, the parser hypothesizes a meaning representation for the input as a formula in LTL. Three competing pressures allow the parser to discover meaning from language. First, any hypothesized meaning for a sentence must be permissive enough to reflect all the annotated execution trajectories. Second, the executor — a pretrained end-to-end LTL planner — must find that the observed trajectories are likely executions of the meaning. Finally, a generator, which reconstructs the original input, encourages the model to find representations that conserve knowledge about the command. Together these ensure that the meaning is neither too general nor too specific. Our model generalizes well, being able to parse and execute both machine-generated and human-generated commands, with near-equal accuracy, despite the fact that the human-generated sentences are much more varied and complex with an open lexicon. The approach presented here is not specific to LTL: it can be applied to any domain where sentence meanings can be hypothesized and an executor can verify these meanings, thus opening the door to many applications for robotic agents.

Transformer Module Networks for Systematic Generalization in Visual Question Answering

2022-02-03T00:00:00Z

Transformer Module Networks for Systematic Generalization in Visual Question Answering Yamada, Moyuru; D'Amario, Vanessa; Takemoto, Kentaro; Boix, Xavier; Sasaki, Tomotake Transformer-based models achieve great performance on Visual Question Answering (VQA). How- ever, when we evaluate them on systematic generalization, i.e., handling novel combinations of known concepts, their performance degrades. Neural Module Networks (NMNs) are a promising approach for systematic generalization that consists on composing modules, i.e., neural networks that tackle a sub-task. Inspired by Transformers and NMNs, we propose Transformer Module Network (TMN), a novel Transformer-based model for VQA that dynamically composes modules into a question-specific Transformer network. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, namely, CLEVR-CoGenT, CLOSURE and GQA-SGL, in some cases improving more than 30% over standard Transformers.

Three approaches to facilitate DNN generalization to objects in out-of-distribution orientations and illuminations

2022-01-26T00:00:00Z

Three approaches to facilitate DNN generalization to objects in out-of-distribution orientations and illuminations Sakai, Akira; Sunagawa, Taro; Madan, Spandan; Suzuki, Kanata; Katoh, Takashi; Kobashi, Hiromichi; Pfister, Hanspeter; Sinha, Pawan; Boix, Xavier; Sasaki, Tomotake The training data distribution is often biased towards objects in certain orientations and illumination conditions. While humans have a remarkable capability of recognizing objects in out-of-distribution (OoD) orientations and illu- minations, Deep Neural Networks (DNNs) severely suffer in this case, even when large amounts of training examples are available. In this paper, we investigate three different approaches to improve DNNs in recognizing objects in OoD orientations and illuminations. Namely, these are (i) training much longer after convergence of the in-distribution (InD) validation accuracy, i.e., late-stopping, (ii) tuning the momentum parameter of the batch normalization layers, and (iii) enforcing invariance of the neural activity in an intermediate layer to orientation and illumination conditions. Each of these approaches substantially improves the DNN’s OoD accuracy (more than 20% in some cases). We report results in four datasets: two datasets are modified from the MNIST and iLab datasets, and the other two are novel (one of 3D rendered cars and another of objects taken from various controlled orientations and illumination conditions). These datasets allow to study the effects of different amounts of bias and are challenging as DNNs perform poorly in OoD conditions. Finally, we demonstrate that even though the three approaches focus on different aspects of DNNs, they all tend to lead to the same underlying neural mechanism to enable OoD accuracy gains – individual neurons in the intermediate layers become more selective to a category and also invariant to OoD orientations and illumina- tions. We anticipate this study to be a basis for further improvement of deep neural networks’ OoD generalization performance, which is highly demanded to achieve safe and fair AI applications.

Image interpretation by iterative bottom-up top-down processing

2021-11-01T00:00:00Z

Image interpretation by iterative bottom-up top-down processing Ullman, Shimon; Assif, Liav; Strugatski, Alona; Vatashsky, Ben-Zion; Levi, Hila; Netanyahu, Aviv; Yaari, Adam Scene understanding requires the extraction and representation of scene components, such as objects and their parts, people, and places, together with their individual properties, as well as relations and interactions between them. We describe a model in which meaningful scene structures are extracted from the image by an iterative process, combining bottom-up (BU) and top-down (TD) networks, interacting through a symmetric bi-directional communication between them (‘counter-streams’ structure). The BU- TD model extracts and recognizes scene constituents with their selected properties and relations, and uses them to describe and understand the image. The scene representation is constructed by the iterative use of three components. The first model component is a bottom-up stream that extracts selected scene elements, properties and relations. The second component (‘cognitive augmentation’) augments the extracted visual representation based on relevant non-visual stored representations. It also provides input to the third component, the top-down stream, in the form of a TD instruction, instructing the model what task to perform next. The top-down stream then guides the BU visual stream to perform the selected task in the next cycle. During this process, the visual representations extracted from the image can be combined with relevant non- visual representations, so that the final scene representation is based on both visual information extracted from the scene and relevant stored knowledge of the world. We show how the BU-TD model composes complex visual tasks from sequences of steps, invoked by individual TD instructions. In particular, we describe how a sequence of TD-instructions is used to extract from the scene structures of interest, including an algorithm to automatically select the next TD- instruction in the sequence. The selection of TD instruction depends in general on the goal, the image, and on information already extracted from the image in previous steps. The TD-instructions sequence is therefore not a fixed sequence determined at the start, but an evolving program (or ‘visual routine’) that depends on the goal and the image. The extraction process is shown to have favourable properties in terms of combinatorial generalization, generalizing well to novel scene structures and new combinations of objects, properties and relations not seen during training. Finally, we compare the model with relevant aspects of the human vision, and suggest directions for using the BU-TD scheme for integrating visual and cognitive components in the process of scene understanding.

From Marr’s Vision to the Problem of Human Intelligence

2021-09-01T00:00:00Z

From Marr’s Vision to the Problem of Human Intelligence Poggio, Tomaso

The Effects of Image Distribution and Task on Adversarial Robustness

2021-02-18T00:00:00Z

The Effects of Image Distribution and Task on Adversarial Robustness Kunhardt, Owen; Deza, Arturo; Poggio, Tomaso In this paper, we propose an adaptation to the area under the curve (AUC) metric to measure the adversarial robustness of a model over a particular ε-interval [ε_0, ε_1] (interval of adversarial perturbation strengths) that facilitates unbiased comparisons across models when they have different initial ε_0 performance. This can be used to determine how adversarially robust a model is to different image distributions or task (or some other variable); and/or to measure how robust a model is comparatively to other models. We used this adversarial robustness metric on models of an MNIST, CIFAR-10, and a Fusion dataset (CIFAR-10 + MNIST) where trained models performed either a digit or object recognition task using a LeNet, ResNet50, or a fully connected network (FullyConnectedNet) architecture and found the following: 1) CIFAR-10 models are inherently less adversarially robust than MNIST models; 2) Both the image distribution and task that a model is trained on can affect the adversarial robustness of the resultant model. 3) Pretraining with a different image distribution and task sometimes carries over the adversarial robustness induced by that image distribution and task in the resultant model; Collectively, our results imply non-trivial differences of the learned representation space of one perceptual system over another given its exposure to different image statistics or tasks (mainly objects vs digits). Moreover, these results hold even when model systems are equalized to have the same level of performance, or when exposed to approximately matched image statistics of fusion images but with different tasks.

Cross-validation Stability of Deep Networks

2021-02-09T00:00:00Z

Cross-validation Stability of Deep Networks Banburski, Andrzej; De La Torre, Fernanda; Plant, Nishka; Shastri, Ishana; Poggio, Tomaso Recent theoretical results show that gradient descent on deep neural networks under exponential loss functions locally maximizes classification margin, which is equivalent to minimizing the norm of the weight matrices under margin constraints. This property of the solution however does not fully characterize the generalization performance. We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization. We then show that, after data separation is achieved, it is possible to dynamically reduce the training set by more than 99% without significant loss of performance. Interestingly, the resulting subset of “high capacity” features is not consistent across different training runs, which is consistent with the theoretical claim that all training points should converge to the same asymptotic margin under SGD and in the presence of both batch normalization and weight decay.

From Associative Memories to Deep Networks

2021-01-12T00:00:00Z

From Associative Memories to Deep Networks Poggio, Tomaso About fifty years ago, holography was proposed as a model of associative memory. Associative memories with similar properties were soon after implemented as simple networks of threshold neurons by Willshaw and Longuet-Higgins. In these pages I will show that today’s deep nets are an incremental improvement of the original associative networks. Thinking about deep learning in terms of associative networks provides a more realistic and sober perspective on the promises of deep learning and on its role in eventually understanding human intelligence. As a bonus, this discussion also uncovers connections with several interesting topics in applied math: random features, random projections, neural ensembles, randomized kernels, memory and generalization, vector quantization and hierarchical vector quantization, random vectors and orthogonal basis, NTK and radial kernels.

Dreaming with ARC

2020-11-23T00:00:00Z

Dreaming with ARC Banburski, Andrzej; Ghandi, Anshula; Alford, Simon; Dandekar, Sylee; Chin, Peter; Poggio, Tomaso Current machine learning algorithms are highly specialized to whatever it is they are meant to do –– e.g. playing chess, picking up objects, or object recognition. How can we extend this to a system that could solve a wide range of problems? We argue that this can be achieved by a modular system –– one that can adapt to solving different problems by changing only the modules chosen and the order in which those modules are applied to the problem. The recently introduced ARC (Abstraction and Reasoning Corpus) dataset serves as an excellent test of abstract reasoning. Suited to the modular approach, the tasks depend on a set of human Core Knowledge inbuilt priors. In this paper we implement these priors as the modules of our system. We combine these modules using a neural-guided program synthesis.

Implicit dynamic regularization in deep networks

2020-08-17T00:00:00Z

Implicit dynamic regularization in deep networks Poggio, Tomaso; Liao, Qianli Square loss has been observed to perform well in classification tasks, at least as well as crossentropy. However, a theoretical justification is lacking. Here we develop a theoretical analysis for the square loss that also complements the existing asymptotic analysis for the exponential loss.

On the Capability of Neural Networks to Generalize to Unseen Category-Pose Combinations

2020-07-17T00:00:00Z

On the Capability of Neural Networks to Generalize to Unseen Category-Pose Combinations Madan, Spandan; Henry, Timothy; Dozier, Jamell; Ho, Helen; Bhandari, Nishchal; Sasaki, Tomotake; Durand, Fredo; Pfister, Hanspeter; Boix, Xavier Recognizing an object’s category and pose lies at the heart of visual understanding. Recent works suggest that deep neural networks (DNNs) often fail to generalize to category-pose combinations not seen during training. However, it is unclear when and how such generalization may be possible. Does the number of combinations seen during training impact generalization? Is it better to learn category and pose in separate networks, or in a single shared network? Furthermore, what are the neural mechanisms that drive the network’s generalization? In this paper, we answer these questions by analyzing state-of-the-art DNNs trained to recognize both object category and pose (position, scale, and 3D viewpoint) with quantitative control over the number of category-pose combinations seen during training. We also investigate the emergence of two types of specialized neurons that can explain generalization to unseen combinations—neurons selective to category and invariant to pose, and vice versa. We perform experiments on MNIST extended with position or scale, the iLab dataset with vehicles at different viewpoints, and a challenging new dataset for car model recognition and viewpoint estimation that we introduce in this paper, the Biased-Cars dataset. Our results demonstrate that as the number of combinations seen during training increases, networks generalize better to unseen category-pose combinations, facilitated by an increase in the selectivity and invariance of individual neurons. We find that learning category and pose in separate networks compared to a shared one leads to an increase in such selectivity and invariance, as separate networks are not forced to preserve information about both category and pose. This enables separate networks to significantly outperform shared ones at predicting unseen category-pose combinations.

Loss landscape: SGD can have a better view than GD

2020-07-01T00:00:00Z

Loss landscape: SGD can have a better view than GD Poggio, Tomaso; Cooper, Yaim Consider a loss function L = 􏰀ni=1 l2i with li = f(xi) − yi, where f(x) is a deep feedforward network with R layers, no bias terms and scalar output. Assume the network is overparametrized that is, d >> n, where d is the number of parameters and n is the number of data points. The networks are assumed to interpolate the training data (e.g. the minimum of L is zero). If GD converges, it will converge to a critical point of L, namely a solution of 􏰀ni=1 li∇li = 0. There are two kinds of critical points - those for which each term of the above sum vanishes individually, and those for which the expression only vanishes when all the terms are summed. The main claim in this note is that while GD can converge to both types of critical points, SGD can only converge to the first kind, which include all global minima.

Biologically Inspired Mechanisms for Adversarial Robustness

2020-06-23T00:00:00Z

Biologically Inspired Mechanisms for Adversarial Robustness Vuyyuru Reddy, Manish; Banburski, Andrzej; Plant, Nishka; Poggio, Tomaso A convolutional neural network strongly robust to adversarial perturbations at reasonable computational and performance cost has not yet been demonstrated. The primate visual ventral stream seems to be robust to small perturbations in visual stimuli but the underlying mechanisms that give rise to this robust perception are not understood. In this work, we investigate the role of two biologically plausible mechanisms in adversarial robustness. We demonstrate that the non-uniform sampling performed by the primate retina and the presence of multiple receptive fields with a range of receptive field sizes at each eccentricity improve the robustness of neural networks to small adversarial perturbations. We verify that these two mechanisms do not suffer from gradient obfuscation and study their contribution to adversarial robustness through ablation studies.

Hierarchically Local Tasks and Deep Convolutional Networks

2020-06-24T00:00:00Z

Hierarchically Local Tasks and Deep Convolutional Networks Deza, Arturo; Liao, Qianli; Banburski, Andrzej; Poggio, Tomaso The main success stories of deep learning, starting with ImageNet, depend on convolutional networks, which on certain tasks perform significantly better than traditional shallow classifiers, such as support vector machines. Is there something special about deep convolutional networks that other learning machines do not possess? Recent results in approximation theory have shown that there is an exponential advantage of deep convolutional-like networks in approximating functions with hierarchical locality in their compositional structure. These mathematical results, however, do not say which tasks are expected to have input-output functions with hierarchical locality. Among all the possible hierarchically local tasks in vision, text and speech we explore a few of them experimentally by studying how they are affected by disrupting locality in the input images. We also discuss a taxonomy of tasks ranging from local, to hierarchically local, to global and make predictions about the type of networks required to perform efficiently on these different types of tasks.

For interpolating kernel machines, the minimum norm ERM solution is the most stable

2020-06-22T00:00:00Z

For interpolating kernel machines, the minimum norm ERM solution is the most stable Rangamani, Akshay; Rosasco, Lorenzo; Poggio, Tomaso We study the average CVloo stability of kernel ridge-less regression and derive corresponding risk bounds. We show that the interpolating solution with minimum norm has the best CVloo stability, which in turn is controlled by the condition number of the empirical kernel matrix. The latter can be characterized in the asymptotic regime where both the dimension and cardinality of the data go to infinity. Under the assumption of random kernel matrices, the corresponding test error follows a double descent curve.

An Exit Strategy from the Covid-19 Lockdown based on Risk-sensitive Resource Allocation

2020-04-15T00:00:00Z

An Exit Strategy from the Covid-19 Lockdown based on Risk-sensitive Resource Allocation Shalev-Shwartz, Shai; Shashua, Amnon We propose an exit strategy from the COVID-19 lockdown, which is based on a risk-sensitive levels of social distancing. At the heart of our approach is the realization that the most effective, yet limited in number, resources should protect those at high risk rather than applied uniformly across the population. By generalizing the SEIR model to mixed populations, and based on existing data in Israel, we present an analysis of the maximal load on the health system and the total mortality. We argue that risk-sensitive resource allocation combined with risk-sensitive levels of social distancing enables to lower the overall mortality toll in parallel to resumption of economic activity.

Do Neural Networks for Segmentation Understand Insideness?

2020-04-04T00:00:00Z

Do Neural Networks for Segmentation Understand Insideness? Villalobos, Kimberly; Štih, Vilim; Ahmadinejad, Amineh; Sundaram, Shobhita; Dozier, Jamell; Francl, Andrew; Azevedo, Frederico; Sasaki, Tomotake; Boix, Xavier The insideness problem is an image segmentation modality that consists of determining which pixels are inside and outside a region. Deep Neural Networks (DNNs) excel in segmentation benchmarks, but it is unclear that they have the ability to solve the insideness problem as it requires evaluating long-range spatial dependencies. In this paper, the insideness problem is analyzed in isolation, without texture or semantic cues, such that other aspects of segmentation do not interfere in the analysis. We demonstrate that DNNs for segmentation with few units have sufficient complexity to solve insideness for any curve. Yet, such DNNs have severe problems to learn general solutions. Only recurrent networks trained with small images learn solutions that generalize well to almost any curve. Recurrent networks can decompose the evaluation of long-range dependencies into a sequence of local operations, and learning with small images alleviates the common difficulties of training recurrent networks with a large number of unrolling steps.

Can we Contain Covid-19 without Locking-down the Economy?

2020-03-26T00:00:00Z

Can we Contain Covid-19 without Locking-down the Economy? Shalev-Shwartz, Shai; Shashua, Amnon We present an analysis of a risk-based selective quarantine model where the population is divided into low and high-risk groups. The high-risk group is quarantined until the low-risk group achieves herd-immunity. We tackle the question of whether this model is safe, in the sense that the health system can contain the number of low-risk people that require severe ICU care (such as life support systems).

Stable Foundations for Learning: a foundational framework for learning theory in both the classical and modern regime.

2020-03-25T00:00:00Z

Stable Foundations for Learning: a foundational framework for learning theory in both the classical and modern regime. Poggio, Tomaso We consider here the class of supervised learning algorithms known as Empirical Risk Minimization (ERM). The classical theory by Vapnik and others characterize universal consistency of ERM in the classical regime in which the architecture of the learning network is fixed and n, the number of training examples, goes to infinity. According to the classical theory, the minimizer of the empirical risk is consistent if the hypothesis space has finite complexity. We do not have a similar general theory for the modern regime of interpolating regressors and over-parameterized deep networks, in which d > n and 𝑑/n remains constant as n goes to infinity. In this note I propose the outline of such a theory based on the specific notion of CVloo stability of the learning algorithm with respect to perturbations of the training set. The theory shows that for interpolating regressors and separating classifiers (either kernel machines or deep RELU networks) 1. minimizing CVloo stability minimizes the expected error 2. the most stable solutions are minimum norm solutions The hope is that this approach may lead to a unified theory encompassing both the modern regime and the classical one.

Double descent in the condition number

2019-12-04T00:00:00Z

Double descent in the condition number Poggio, Tomaso; Kur, Gil; Banburski, Andrzej In solving a system of n linear equations in d variables Ax=b, the condition number of the (n,d) matrix A measures how much errors in the data b affect the solution x. Bounds of this type are important in many inverse problems. An example is machine learning where the key task is to estimate an underlying function from a set of measurements at random points in a high dimensional space and where low sensitivity to error in the data is a requirement for good predictive performance. Here we report the simple observation that when the columns of A are random vectors, the condition number of A is highest, that is worse, when d=n, that is when the inverse of A exists. An overdetermined system (n>d) and especially an underdetermined system (n

Hippocampal Remapping as Hidden State Inference

2019-08-22T00:00:00Z

Hippocampal Remapping as Hidden State Inference Sanders, Honi; Wilson, Matthew A.; Gershman, Samueal J. Cells in the hippocampus tuned to spatial location (place cells) typically change their tuning when an animal changes context, a phenomenon known as remapping. A fundamental challenge to understanding remapping is the fact that what counts as a “context change” has never been precisely defined. Furthermore, different remapping phenomena have been classified on the basis of how much the tuning changes after different types and degrees of context change, but the relationship between these variables is not clear. We address these ambiguities by formalizing remapping in terms of hidden state inference. According to this view, remapping does not directly reflect objective, observable properties of the environment, but rather subjective beliefs about the hidden state of the environment. We show how the hidden state framework can resolve a number of puzzles about the nature of remapping.

Brain Signals Localization by Alternating Projections

2019-08-29T00:00:00Z

Brain Signals Localization by Alternating Projections Adler, Amir; Wax, Mati; Pantazis, Dimitrios We present a novel solution to the problem of localization of brain signals. The solution is sequential and iterative, and is based on minimizing the least-squares (LS) criterion by the alternating projection (AP) algorithm, well known in the context of array signal processing. Unlike existing solutions belonging to the linearly constrained minimum variance (LCMV) and to the multiple-signal classification (MUSIC) families, the algorithm is applicable even in the case of a single sample and in the case of synchronous sources. The performance of the solution is demonstrated via simulations.

Theoretical Issues in Deep Networks

2019-08-17T00:00:00Z

Theoretical Issues in Deep Networks Poggio, Tomaso; Banburski, Andrzej; Liao, Qianli While deep learning is successful in a number of applications, it is not yet well understood theoretically. A theoretical characterization of deep learning should answer questions about their approximation power, the dynamics of optimization by gradient descent and good out-of-sample performance --- why the expected error does not suffer, despite the absence of explicit regularization, when the networks are overparametrized. We review our recent results towards this goal. In {\it approximation theory} both shallow and deep networks are known to approximate any continuous functions on a bounded domain at a cost which is exponential (the number of parameters is exponential in the dimensionality of the function). However, we proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can have a linear dependence on dimensionality, unlike shallow networks. In characterizing {\it minimization} of the empirical exponential loss we consider the gradient descent dynamics of the weight directions rather than the weights themselves, since the relevant function underlying classification corresponds to the normalized network. The dynamics of the normalized weights implied by standard gradient descent turns out to be equivalent to the dynamics of the constrained problem of minimizing an exponential-type loss subject to a unit $L_2$ norm constraint. In particular, the dynamics of the typical, unconstrained gradient descent converges to the same critical points of the constrained problem. Thus, there is {\it implicit regularization} in training deep networks under exponential-type loss functions with gradient descent. The critical points of the flow are hyperbolic minima (for any long but finite time) and minimum norm minimizers (e.g. maxima of the margin). Though appropriately normalized networks can show a small generalization gap (difference between empirical and expected loss) even for finite $N$ (number of training examples) wrt the exponential loss, they do not generalize in terms of the classification error. Bounds on it for finite $N$ remain an open problem. Nevertheless, our results, together with other recent papers, characterize an implicit vanishing regularization by gradient descent which is likely to be a key prerequisite -- in terms of complexity control -- for the good performance of deep overparametrized ReLU classifiers.

An analysis of training and generalization errors in shallow and deep networks

2019-05-30T00:00:00Z

An analysis of training and generalization errors in shallow and deep networks Mhaskar, H.N.; Poggio, Tomaso This paper is motivated by an open problem around deep networks, namely, the apparent absence of overfitting despite large over-parametrization which allows perfect fitting of the training data. In this paper, we analyze this phenomenon in the case of regression problems when each unit evaluates a periodic activation function. We argue that the minimal expected value of the square loss is inappropriate to measure the generalization error in approximation of compositional functions in order to take full advantage of the compositional structure. Instead, we measure the generalization error in the sense of maximum loss, and sometimes, as a pointwise error. We give estimates on exactly how many parameters ensure both zero training error as well as a good generalization error. We prove that a solution of a regularization problem is guaranteed to yield a good training error as well as a good generalization error and estimate how much error to expect at which test data.

Biologically-plausible learning algorithms can scale to large datasets

2018-11-08T00:00:00Z

Biologically-plausible learning algorithms can scale to large datasets Xiao, Will; Chen, Honglin; Liao, Qianli; Poggio, Tomaso The backpropagation (BP) algorithm is often thought to be biologically implausible in the brain. One of the main reasons is that BP requires symmetric weight matrices in the feedforward and feedback pathways. To address this "weight transport problem" (Grossberg, 1987), two more biologically plausible algorithms, proposed by Liao et al. (2016) and Lillicrap et al. (2016), relax BP's weight symmetry requirements and demonstrate comparable learning capabilities to that of BP on small datasets. However, a recent study by Bartunov et al. (2018) evaluate variants of target-propagation (TP) and feedback alignment (FA) on MINIST, CIFAR, and ImageNet datasets, and find that although many of the proposed algorithms perform well on MNIST and CIFAR, they perform significantly worse than BP on ImageNet. Here, we additionally evaluate the sign-symmetry algorithm (Liao et al., 2016), which differs from both BP and FA in that the feedback and feedforward weights share signs but not magnitudes. We examine the performance of sign-symmetry and feedback alignment on ImageNet and MS COCO datasets using different network architectures (ResNet-18 and AlexNet for ImageNet, RetinaNet for MS COCO). Surprisingly, networks trained with sign-symmetry can attain classification performance approaching that of BP-trained networks. These results complement the study by Bartunov et al. (2018), and establish a new benchmark for future biologically plausible learning algorithms on more difficult datasets and more complex architectures.

What am I searching for?

2018-07-31T00:00:00Z

What am I searching for? Zhang, Mengmi; Feng, Jiashi; Lim, Joo Hwee; Zhao, Qi; Kreiman, Gabriel Can we infer intentions and goals from a person's actions? As an example of this family of problems, we consider here whether it is possible to decipher what a person is searching for by decoding their eye movement behavior. We conducted two human psychophysics experiments on object arrays and natural images where we monitored subjects' eye movements while they were looking for a target object. Using as input the pattern of "error" fixations on non-target objects before the target was found, we developed a model (InferNet) whose goal was to infer what the target was. "Error" fixations share similar features with the sought target. The Infernet model uses a pre-trained 2D convolutional architecture to extract features from the error fixations and computes a 2D similarity map between the error fixation and all locations across the search image by modulating the search image via convolution across layers. InferNet consolidates the modulated response maps across layers via max pooling to keep track of the sub-patterns highly similar to features at error fixations and integrates these maps across all error fixations. InferNet successfully identifies the subject's goal and outperforms all the competitive null models, even without any object-specific training on the inference task.

Spatiotemporal interpretation features in the recognition of dynamic images

2018-11-21T00:00:00Z

Spatiotemporal interpretation features in the recognition of dynamic images Ben-Yosef, Guy; Kreiman, Gabriel; Ullman, Shimon Objects and their parts can be visually recognized and localized from purely spatial information in static images and also from purely temporal information as in the perception of biological motion. Cortical regions have been identified, which appear to specialize in visual recognition based on either static or dynamic cues, but the mechanisms by which spatial and temporal information is integrated is only poorly understood. Here we show that visual recognition of objects and actions can be achieved by efficiently combining spatial and motion cues in configurations where each source on its own is insufficient for recognition. This analysis is obtained by the identification of minimal spatiotemporal configurations: these are short videos in which objects and their parts, along with an action being performed, can be reliably recognized, but any reduction in either space or time makes them unrecognizable. State-of-the-art computational models for recognition from dynamic images based on deep 2D and 3D convolutional networks cannot replicate human recognition in these configurations. Action recognition in minimal spatiotemporal configurations is invariably accompanied by full human interpretation of the internal components of the image and their inter-relations. We hypothesize that this gap is due to mechanisms for full spatiotemporal interpretation process, which in human vision is an integral part of recognizing dynamic event, but is not sufficiently represented in current DNNs.

Single units in a deep neural network functionally correspond with neurons in the brain: preliminary results

2018-11-02T00:00:00Z

Single units in a deep neural network functionally correspond with neurons in the brain: preliminary results Arend, Luke; Han, Yena; Schrimpf, Martin; Bashivan, Pouya; Kar, Kohitij; Poggio, Tomaso; DiCarlo, James J.; Boix, Xavier Deep neural networks have been shown to predict neural responses in higher visual cortex. The mapping from the model to a neuron in the brain occurs through a linear combination of many units in the model, leaving open the question of whether there also exists a correspondence at the level of individual neurons. Here we show that there exist many one-to-one mappings between single units in a deep neural network model and neurons in the brain. We show that this correspondence at the single- unit level is ubiquitous among state-of-the-art deep neural networks, and grows more pronounced for models with higher performance on a large-scale visual recognition task. Comparing matched populations—in the brain and in a model—we demonstrate a further correspondence at the level of the population code: stimulus category can be partially decoded from real neural responses using a classifier trained purely on a matched population of artificial units in a model. This provides a new point of investigation for phenomena which require fine-grained mappings between deep neural networks and the brain.

Biologically-Plausible Learning Algorithms Can Scale to Large Datasets

2018-09-27T00:00:00Z

Biologically-Plausible Learning Algorithms Can Scale to Large Datasets Xiao, Will; Chen, Honglin; Liao, Qianli; Poggio, Tomaso The backpropagation (BP) algorithm is often thought to be biologically implausible in the brain. One of the main reasons is that BP requires symmetric weight matrices in the feedforward and feed- back pathways. To address this “weight transport problem” (Grossberg, 1987), two more biologically plausible algorithms, proposed by Liao et al. (2016) and Lillicrap et al. (2016), relax BP’s weight symmetry requirements and demonstrate comparable learning capabilities to that of BP on small datasets. However, a recent study by Bartunov et al. (2018) evaluate variants of target-propagation (TP) and feedback alignment (FA) on MINIST, CIFAR, and ImageNet datasets, and find that although many of the proposed algorithms perform well on MNIST and CIFAR, they perform significantly worse than BP on ImageNet. Here, we additionally evaluate the sign-symmetry algorithm (Liao et al., 2016), which differs from both BP and FA in that the feedback and feedforward weights share signs but not magnitudes. We examine the performance of sign-symmetry and feedback alignment on ImageNet and MS COCO datasets using different network architectures (ResNet-18 and AlexNet for ImageNet, RetinaNet for MS COCO). Surprisingly, networks trained with sign-symmetry can attain classification performance approaching that of BP-trained networks. These results complement the study by Bartunov et al. (2018), and establish a new benchmark for future biologically plausible learning algorithms on more difficult datasets and more complex architectures.

Classical generalization bounds are surprisingly tight for Deep Networks

2018-07-11T00:00:00Z

Classical generalization bounds are surprisingly tight for Deep Networks Liao, Qianli; Miranda, Brando; Hidary, Jack; Poggio, Tomaso Deep networks are usually trained and tested in a regime in which the training classification error is not a good predictor of the test error. Thus the consensus has been that generalization, defined as convergence of the empirical to the expected error, does not hold for deep networks. Here we show that, when normalized appropriately after training, deep networks trained on exponential type losses show a good linear dependence of test loss on training loss. The observation, motivated by a previous theoretical analysis of overparameterization and overfitting, not only demonstrates the validity of classical generalization bounds for deep learning but suggests that they are tight. In addition, we also show that the bound of the classification error by the normalized cross entropy loss is empirically rather tight on the data sets we studied.

Theory IIIb: Generalization in Deep Networks

2018-06-29T00:00:00Z

Theory IIIb: Generalization in Deep Networks Poggio, Tomaso; Liao, Qianli; Miranda, Brando; Burbanski, Andrzej; Hidary, Jack The general features of the optimization problem for the case of overparametrized nonlinear networks have been clear for a while: SGD selects with high probability global minima vs local minima. In the overparametrized case, the key question is not optimization of the empirical risk but optimization with a generalization guarantee. In fact, a main puzzle of deep neural networks (DNNs) revolves around the apparent absence of “overfitting”, defined as follows: the expected error does not get worse when increasing the number of neurons or of iterations of gradient descent. This is superficially surprising because of the large capacity demonstrated by DNNs to fit randomly labeled data and the absence of explicit regularization. Several recent efforts, including our previous versions of this technical report, strongly suggest that good test performance of deep networks depend on constraining the norm of their weights. Here we prove that: • the loss functions of deep RELU networks under square loss and logistic loss on a compact domain are invex functions; • for such loss functions any equilibrium point is a global minimum; • convergence is fast, the minima are close to the origin; • the global minima have in general degenerate Hessians for which there is no direct control of the norm, apart from initialization close to the origin; • a simple variation of gradient descent techniques called norm-minimizing (NM) gradient descent guarantees minimum norm minimizers under both the square loss and the exponential loss, independently of initial conditions. A convenient norm for a deep network is the product of the Frobenius norms of the weight matrices. Control of the norm by NM ensures generalization for regression (because of the associated control of the Rademacher complexity). Margin bounds ensure control of classification error by maximization of the margin of f ̃ – the classifier with normalized Frobenius norms – obtained by the minimization of an exponential-type loss by NM iterations. 1 This replaces previous versions of Theory IIIa and Theory IIIb updating several vague or incorrect statements.

Deep Regression Forests for Age Estimation

2018-06-01T00:00:00Z

Deep Regression Forests for Age Estimation Shen, Wei; Guo, Yilu; Wang, Yan; Zhao, Kai; Wang, Bo; Yuille, Alan L. Age estimation from facial images is typically cast as a nonlinear regression problem. The main challenge of this problem is the facial feature space w.r.t. ages is inhomogeneous, due to the large variation in facial appearance across different persons of the same age and the non-stationary property of aging patterns. In this paper, we propose Deep Regression Forests (DRFs), an end-to-end model, for age estimation. DRFs connect the split nodes to a fully connected layer of a convolutional neural network (CNN) and deal with inhomogeneous data by jointly learning input-dependant data partitions at the split nodes and data abstractions at the leaf nodes. This joint learning follows an alternating strategy: First, by fixing the leaf nodes, the split nodes as well as the CNN parameters are optimized by Back-propagation; Then, by fixing the split nodes, the leaf nodes are optimized by iterating a step-size free update rule derived from Variational Bounding. We verify the proposed DRFs on three standard age estimation benchmarks and achieve state-of-the-art results on all of them.

Multi-stage Multi-recursive-input Fully Convolutional Networks for Neuronal Boundary Detection

2017-10-01T00:00:00Z

Multi-stage Multi-recursive-input Fully Convolutional Networks for Neuronal Boundary Detection Shen, Wei; Wang, Bin; Jiang, Yuan; Wang, Yan; Yuille, Alan L. In the field of connectomics, neuroscientists seek to identify cortical connectivity comprehensively. Neuronal boundary detection from the Electron Microscopy (EM) images is often done to assist the automatic reconstruction of neuronal circuit. But the segmentation of EM images is a challenging problem, as it requires the detector to be able to detect both filament-like thin and blob-like thick membrane, while suppressing the ambiguous intracellular structure. In this paper, we propose multi-stage multi-recursive-input fully convolutional networks to address this problem. The multiple recursive inputs for one stage, i.e., the multiple side outputs with different receptive field sizes learned from the lower stage, provide multi-scale contextual boundary information for the consecutive learning. This design is biologically-plausible, as it likes a human visual system to compare different possible segmentation solutions to address the ambiguous boundary issue. Our multi-stage networks are trained end-to-end. It achieves promising results on two public available EM segmentation datasets, the mouse piriform cortex dataset and the ISBI 2012 EM dataset.

Theory of Deep Learning IIb: Optimization Properties of SGD

2017-12-27T00:00:00Z

Theory of Deep Learning IIb: Optimization Properties of SGD Zhang, Chiyuan; Liao, Qianli; Rakhlin, Alexander; Miranda, Brando; Golowich, Noah; Poggio, Tomaso In Theory IIb we characterize with a mix of theory and experiments the optimization of deep convolutional networks by Stochastic Gradient Descent. The main new result in this paper is theoretical and experimental evidence for the following conjecture about SGD: SGD concentrates in probability - like the classical Langevin equation – on large volume, “flat” minima, selecting flat minimizers which are with very high probability also global minimizers.

Scene Graph Parsing as Dependency Parsing

2018-05-10T00:00:00Z

Scene Graph Parsing as Dependency Parsing Wang, Yu-Siang; Liu, Chenxi; Zeng, Xiaohui; Yuille, Alan L. In this paper, we study the problem of parsing structured knowledge graphs from textual descrip- tions. In particular, we consider the scene graph representation that considers objects together with their attributes and relations: this representation has been proved useful across a variety of vision and language applications. We begin by introducing an alternative but equivalent edge-centric view of scene graphs that connect to dependency parses. Together with a careful redesign of label and action space, we combine the two-stage pipeline used in prior work (generic dependency parsing followed by simple post-processing) into one, enabling end-to-end training. The scene graphs generated by our learned neural dependency parser achieve an F-score similarity of 49.67% to ground truth graphs on our evaluation set, surpassing best previous approaches by 5%. We further demonstrate the effective- ness of our learned parser on image retrieval applications.

Recurrent Multimodal Interaction for Referring Image Segmentation

2018-05-10T00:00:00Z

Recurrent Multimodal Interaction for Referring Image Segmentation Liu, Chenxi; Lin, Zhe; Shen, Xiaohui; Yang, Jimei; Lu, Xin; Yuille, Alan L. In this paper we are interested in the problem of image segmentation given natural language descriptions, i.e. referring expressions. Existing works tackle this problem by first modeling images and sentences independently and then segment images by combining these two types of representations. We argue that learning word-to-image interaction is more native in the sense of jointly modeling two modalities for the image segmentation task, and we propose convolutional multimodal LSTM to encode the sequential interactions between individual words, visual information, and spatial information. We show that our proposed model outperforms the baseline model on benchmark datasets. In addition, we analyze the intermediate output of the proposed multimodal LSTM approach and empirically explain how this approach enforces a more effective word-to-image interaction.

Image interpretation above and below the object level

2018-05-10T00:00:00Z

Image interpretation above and below the object level Ben-Yosef, Guy; Ullman, Shimon Computational models of vision have advanced in recent years at a rapid rate, rivaling in some areas human- level performance. Much of the progress to date has focused on analyzing the visual scene at the object level – the recognition and localization of objects in the scene. Human understanding of images reaches a richer and deeper image understanding both ‘below’ the object level, such as identifying and localizing object parts and sub-parts, as well as ‘above’ the object levels, such as identifying object relations, and agents with their actions and interactions. In both cases, understanding depends on recovering meaningful structures in the image, their components, properties, and inter-relations, a process referred here as ‘image interpretation’. In this paper we describe recent directions, based on human and computer vision studies, towards human-like image interpretation, beyond the reach of current schemes, both below the object level, as well as some aspects of image interpretation at the level of meaningful configurations beyond the recognition of individual objects, in particular, interactions between two people in close contact. In both cases the recognition process depends on the detailed interpretation of so-called 'minimal images', and at both levels recognition depends on combining ‘bottom-up’ processing, proceeding from low to higher levels of a processing hierarchy, together with ‘top-down’ processing, proceeding from high to lower levels stages of visual analysis.

Deep Nets: What have they ever done for Vision?

2018-05-10T00:00:00Z

Deep Nets: What have they ever done for Vision? Yuille, Alan L.; Liu, Chenxi This is an opinion paper about the strengths and weaknesses of Deep Nets. They are at the center of recent progress on Artificial Intelligence and are of growing importance in Cognitive Science and Neuroscience since they enable the development of computational models that can deal with a large range of visually realistic stimuli and visual tasks. They have clear limitations but they also have enormous successes. There is also gradual, though incomplete, understanding of their inner workings. It seems unlikely that Deep Nets in their current form will be the best long-term solution either for building general purpose intelligent machines or for understanding the mind/brain, but it is likely that many aspects of them will remain. At present Deep Nets do very well on specific types of visual tasks and on specific benchmarked datasets. But Deep Nets are much less general purpose, flexible, and adaptive than the human visual system. Moreover, methods like Deep Nets may run into fundamental difficulties when faced with the enormous complexity of natural images. To illustrate our main points, while keeping the references small, this paper is slightly biased towards work from our group.

Visual concepts and compositional voting

2018-03-27T00:00:00Z

Visual concepts and compositional voting Wang, Jianyu; Zhang, Zhishuai; Xie, Cihang; Zhou, Yuyin; Premachandran, Vittal; Zhu, Jun; Xie, Lingxi; Yuille, Alan L. It is very attractive to formulate vision in terms of pattern theory [26], where patterns are defined hierarchically by compositions of elementary building blocks. But applying pattern theory to real world images is very challenging and is currently less successful than discriminative methods such as deep networks. Deep networks, however, are black-boxes which are hard to interpret and, as we will show, can easily be fooled by adding occluding objects. It is natural to wonder whether by better under- standing deep networks we can extract building blocks which can be used to develop pattern theoretic models. This motivates us to study the internal feature vectors of a deep network using images of vehicles from the PASCAL3D+ dataset with the scale of objects fixed. We use clustering algorithms, such as K-means, to study the population activity of the features and extract a set of visual concepts which we show are visually tight and correspond to semantic parts of the vehicles. To analyze this in more detail, we annotate these vehicles by their semantic parts to create a new dataset which we call VehicleSemanticParts, and evaluate visual concepts as unsupervised semantic part detectors. Our results show that visual concepts perform fairly well but are outperformed by supervised discriminative methods such as Support Vector Machines. We next give a more detailed analysis of visual concepts and how they relate to semantic parts. Following this analysis, we use the visual concepts as building blocks for a simple pattern theoretical model, which we call compositional voting. In this model several visual concepts combine to detect semantic parts. We show that this approach is significantly better than discriminative methods like Support Vector machines and deep networks trained specifically for semantic part detection. Finally, we return to studying occlusion by creating an annotated dataset with occlusion, called Vehicle Occlusion, and show that compositional voting outperforms even deep networks when the amount of occlusion becomes large.

DeepVoting: A Robust and Explainable Deep Network for Semantic Part Detection under Partial Occlusion

2018-06-19T00:00:00Z

DeepVoting: A Robust and Explainable Deep Network for Semantic Part Detection under Partial Occlusion Zhang, Zhishuai; Xie, Cihang; Wang, Jianyu; Xie, Lingxi; Yuille, Alan L. In this paper, we study the task of detecting semantic parts of an object, e.g., a wheel of a car, under partial occlusion. We propose that all models should be trained without seeing occlusions while being able to transfer the learned knowledge to deal with occlusions. This setting alleviates the diffi- culty in collecting an exponentially large dataset to cover occlusion patterns and is more essential. In this scenario, the proposal-based deep networks, like RCNN-series, often produce unsatisfactory re- sults, because both the proposal extraction and classification stages may be confused by the irrelevant occluders. To address this, [25] proposed a voting mechanism that combines multiple local visual cues to detect semantic parts. The semantic parts can still be detected even though some visual cues are missing due to occlusions. However, this method is manually-designed, thus is hard to be optimized in an end-to-end manner. In this paper, we present DeepVoting, which incorporates the robustness shown by [25] into a deep network, so that the whole pipeline can be jointly optimized. Specifically, it adds two layers after the intermediate features of a deep network, e.g., the pool-4 layer of VGGNet. The first layer extracts the evidence of local visual cues, and the second layer performs a voting mechanism by utilizing the spatial relationship between visual cues and semantic parts. We also propose an improved version DeepVoting+ by learning visual cues from context outside objects. In experiments, DeepVoting achieves significantly better performance than several baseline methods, including Faster-RCNN, for semantic part detection under occlusion. In addition, DeepVoting enjoys explainability as the detection results can be diagnosed via looking up the voting cues.

Single-Shot Object Detection with Enriched Semantics

2018-06-19T00:00:00Z

Single-Shot Object Detection with Enriched Semantics Zhang, Zhishuai; Qiao, Siyuan; Xie, Cihang; Shen, Wei; Wang, Bo; Yuille, Alan L. We propose a novel single shot object detection network named Detection with Enriched Semantics (DES). Our motivation is to enrich the semantics of object detection features within a typical deep detector, by a semantic segmentation branch and a global activation module. The segmentation branch is supervised by weak segmentation ground-truth, i.e., no extra annotation is required. In conjunction with that, we employ a global activation module which learns relationship between channels and object classes in a self-supervised manner. Comprehensive experimental results on both PASCAL VOC and MS COCO detection datasets demonstrate the effectiveness of the proposed method. In particular, with a VGG16 based DES, we achieve an mAP of 81.7 on VOC2007 test and an mAP of 32.8 on COCO test-dev with an inference speed of 31.5 milliseconds per image on a Titan Xp GPU. With a lower resolution version, we achieve an mAP of 79.7 on VOC2007 with an inference speed of 13.0 milliseconds per image.

Detecting Semantic Parts on Partially Occluded Objects

2017-09-04T00:00:00Z

Detecting Semantic Parts on Partially Occluded Objects Wang, Jianyu; Xe, Cihang; Zhang, Zhishuai; Zhu, Jun; Xie, Lingxi; Yuille, Alan L. In this paper, we address the task of detecting semantic parts on partially occluded objects. We consider a scenario where the model is trained using non-occluded images but tested on occluded images. The motivation is that there are infinite number of occlusion patterns in real world, which cannot be fully covered in the training data. So the models should be inherently robust and adaptive to occlusions instead of fitting / learning the occlusion patterns in the training data. Our approach detects semantic parts by accumulating the confidence of local visual cues. Specifically, the method uses a simple voting method, based on log-likelihood ratio tests and spatial constraints, to combine the evidence of local cues. These cues are called visual concepts, which are derived by clustering the internal states of deep networks. We evaluate our voting scheme on the VehicleSemanticPart dataset with dense part annotations. We randomly place two, three or four irrelevant objects onto the target object to generate testing images with various occlusions. Experiments show that our algorithm outperforms several competitors in semantic part detection when occlusions are present.

Constant Modulus Algorithms via Low-Rank Approximation

2018-04-12T00:00:00Z

Constant Modulus Algorithms via Low-Rank Approximation Adler, Amir; Wax, Mati We present a novel convex-optimization-based approach to the solutions of a family of problems involving constant modulus signals. The family of problems includes the constant modulus and the constrained constant modulus, as well as the modified constant modulus and the constrained modified constant modulus. The usefulness of the proposed solutions is demonstrated for the tasks of blind beamforming and blind multiuser detection. The performance of these solutions, as we demonstrate by simulated data, is superior to existing methods.

An analysis of training and generalization errors in shallow and deep networks

2018-02-20T00:00:00Z

An analysis of training and generalization errors in shallow and deep networks Mhaskar, Hrushikesh; Poggio, Tomaso An open problem around deep networks is the apparent absence of over-fitting despite large over-parametrization which allows perfect fitting of the training data. In this paper, we explain this phenomenon when each unit evaluates a trigonometric polynomial. It is well understood in the theory of function approximation that approximation by trigonometric polynomials is a “role model” for many other processes of approximation that have inspired many theoretical constructions also in the context of approximation by neural and RBF networks. In this paper, we argue that the maximum loss functional is necessary to measure the generalization error. We give estimates on exactly how many parameters ensure both zero training error as well as a good generalization error, and how much error to expect at which test data. An interesting feature of our new method is that the variance in the training data is no longer an insurmountable lower bound on the generalization error.

Theory of Intelligence with Forgetting: Mathematical Theorems Explaining Human Universal Forgetting using “Forgetting Neural Networks”

2017-12-05T00:00:00Z

Theory of Intelligence with Forgetting: Mathematical Theorems Explaining Human Universal Forgetting using “Forgetting Neural Networks” Cano-Córdoba, Felipe; Sarma, Sanjay; Subirana, Brian In [42] we suggested that any memory stored in the human/animal brain is forgotten following the Ebingghaus curve – in this follow-on paper, we define a novel algebraic structure, a Forgetting Neural Network, as a simple mathematical model based on assuming parameters of a neuron in a neural network are forgotten using the Ebbinghaus forgetting curve. We model neural networks in Sobolev spaces using [35] as our departure point and demonstrate four novel theorems of Forgetting Neural Networks: theorem of non-instantaneous forgetting, theorem of universal forgetting, curse of forgetting theorem, and center of mass theorem. We also proof the novel decreasing inference theorem which we feel is relevant beyond Ebbinghaus forgetting: compositional deep neural networks cannot arbitrarily combine low level “features” – meaning only certain arrangements of features calculated in intermediate levels can show up in higher levels. This proof leads us to present the possibly most efficient representation of neural networks’ “minimal polynomial basis layer” (MPBL) since our basis construct can generate n polynomials of order m using only 2m + 1 + n neurons. As we briefly discuss in the conclusion, there are about 10 similarities between forgetting neural networks and human forgetting and our research elicits more questions than it answers and may have implications for neuroscience research including our understanding of how babies learn (or, perhaps, forget), including what we call the baby forgetting conjecture.

Theory of Deep Learning III: explaining the non-overfitting puzzle

2017-12-30T00:00:00Z

Theory of Deep Learning III: explaining the non-overfitting puzzle Poggio, Tomaso; Kawaguchi, Kenji; Liao, Qianli; Miranda, Brando; Rosasco, Lorenzo; Boix, Xavier; Hidary, Jack; Mhaskar, Hrushikesh THIS MEMO IS REPLACED BY CBMM MEMO 90 A main puzzle of deep networks revolves around the absence of overfitting despite overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamical systems associated with gradient descent minimization of nonlinear networks behave near zero stable minima of the empirical error as gradient system in a quadratic potential with degenerate Hessian. The proposition is supported by theoretical and numerical results, under the assumption of stable minima of the gradient. Our proposition provides the extension to deep networks of key properties of gradient descent methods for linear networks, that as, suggested in (1), can be the key to understand generalization. Gradient descent enforces a form of implicit regular- ization controlled by the number of iterations, and asymptotically converging to the minimum norm solution. This implies that there is usually an optimum early stopping that avoids overfitting of the loss (this is relevant mainly for regression). For classification, the asymptotic convergence to the minimum norm solution implies convergence to the maximum margin solution which guarantees good classification error for “low noise” datasets. The implied robustness to overparametrization has suggestive implications for the robustness of deep hierarchically local networks to variations of the architecture with respect to the curse of dimensionality.

3D Object-Oriented Learning: An End-to-end Transformation-Disentangled 3D Representation

2017-12-31T00:00:00Z

3D Object-Oriented Learning: An End-to-end Transformation-Disentangled 3D Representation Liao, Qianli; Poggio, Tomaso We provide more detailed explanation of the ideas behind a recent paper on “Object-Oriented Deep Learning” [1] and extend it to handle 3D inputs/outputs. Similar to [1], every layer of the system takes in a list of “objects/symbols”, processes it and outputs another list of objects/symbols. In this report, the properties of the objects/symbols are extended to contain 3D information — including 3D orientations (i.e., rotation quaternion or yaw, pitch and roll) and one extra coordinate dimension (z-axis or depth). The resultant model is a novel end-to-end interpretable 3D representation that systematically factors out common 3D transformations such as translation and 3D rotation. As first proposed by [1] and discussed in more detail in [2], it offers a “symbolic disentanglement” solution to the problem of transformation invariance/equivariance. To demonstrate the effectiveness of the model, we show that it can achieve perfect performance on the task of 3D invariant recognition by training on one rotation of a 3D object and test it on 3D rotations (i.e., at arbitrary angles of yaw, pitch and roll). Furthermore, in a more realistic case where depth information is not given (similar to viewpoint invariant object recognition from 2D vision) our model generalizes reasonably well to novel viewpoints while ConvNets fail to generalize.

Exact Equivariance, Disentanglement and Invariance of Transformations

2017-12-31T00:00:00Z

Exact Equivariance, Disentanglement and Invariance of Transformations Liao, Qianli; Poggio, Tomaso Invariance, equivariance and disentanglement of transformations are important topics in the field of representation learning. Previous models like Variational Autoencoder [1] and Generative Adversarial Networks [2] attempted to learn disentangled representations from data with different levels of successes. Convolutional Neural Networks are approximately equivariant and invariant (if pooling is performed) to input translations. In this report, we argue that the recently proposed Object-Oriented Learning framework [3] offers a new solution to the problem of Equivariance, Invariance and Disentanglement: it systematically factors out common transformations like translation and rotation in inputs and achieves “exact equivariance” to these transformations — that is, when the input is translated and/or rotated by some amount, the output and all intermediate representations of the network are also translated and rotated by exactly the same amount. The transformations are “exactly disentangled” in the sense that the translations and rotations can be read out directly from a few known variables of the system without any approximation. Invariance can be achieved by reading other variables that are known not to be affected by the transformations. No learning is needed to achieve these properties. Exact equivariance and disentanglement are useful properties that augment the expressive power of neural networks. We believe it will enable new applications including but not limited to precise visual localization of objects and measuring of motion and angles.

Object-Oriented Deep Learning

2017-10-31T00:00:00Z

Object-Oriented Deep Learning Liao, Qianli; Poggio, Tomaso We investigate an unconventional direction of research that aims at converting neural networks, a class of distributed, connectionist, sub-symbolic models into a symbolic level with the ultimate goal of achieving AI interpretability and safety. To that end, we propose Object-Oriented Deep Learning, a novel computational paradigm of deep learning that adopts interpretable “objects/symbols” as a basic representational atom instead of N-dimensional tensors (as in traditional “feature-oriented” deep learning). For visual processing, each “object/symbol” can explicitly package common properties of visual objects like its position, pose, scale, probability of being an object, pointers to parts, etc., providing a full spectrum of interpretable visual knowledge throughout all layers. It achieves a form of “symbolic disentanglement”, offering one solution to the important problem of disentangled representations and invariance. Basic computations of the network include predicting high-level objects and their properties from low-level objects and binding/aggregating relevant objects together. These computations operate at a more fundamental level than convolutions, capturing convolution as a special case while being significantly more general than it. All operations are executed in an input-driven fashion, thus sparsity and dynamic computation per sample are naturally supported, complementing recent popular ideas of dynamic networks and may enable new types of hardware accelerations. We experimentally show on CIFAR-10 that it can perform flexible visual processing, rivaling the performance of ConvNet, but without using any convolution. Furthermore, it can generalize to novel rotations of images that it was not trained for.

On the Forgetting of College Academice: at "Ebbinghaus Speed"?

2017-06-20T00:00:00Z

On the Forgetting of College Academice: at "Ebbinghaus Speed"? Subirana, Brian; Bagiati, Aikaterini; Sarma, Sanjay How important are Undergraduate College Academics after graduation? How much do we actually remember after we leave the college classroom, and for how long? Taking a look at major University ranking methodologies one can easily observe they consistently lack any objective measure of what content knowledge and skills students retain from college education in the long term. Is there any rigorous scholarly published evidence on retention of long-term unused academic content knowledge? We have found no such evidence based on a preliminary literature review. Furthermore, findings in all research papers reviewed in this study were consistent with the following assertion: the Ebbinghaus forgetting curve [Ebbinghaus 1880-1885] is a fundamental law of human nature – in fact, of the whole animal kingdom and applies to memory of all types: verbal, visual, abstract, social and autobiographical. This fundamental law of nature, when examined within the context of academic learning retention, manifests itself as an exponential curve halving memory saliency about every two years (what we call "Ebbinghaus Speed"). This paper presents the research group’s initial hypothesis and conjectures for college level education programming and curriculum development, suggestions for instructional design enhancing learning durability, as well as future research directions.

Do Deep Neural Networks Suffer from Crowding?

2017-06-26T00:00:00Z

Do Deep Neural Networks Suffer from Crowding? Volokitin, Anna; Roig, Gemma; Poggio, Tomaso Crowding is a visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. In this work, we study the effect of crowding in artificial Deep Neural Networks for object recognition. We analyze both standard deep convolutional neural networks (DCNNs) as well as a new version of DCNNs which is 1) multi-scale and 2) with size of the convolution filters change depending on the eccentricity wrt to the center of fixation. Such networks, that we call eccentricity-dependent, are a computational model of the feedforward path of the primate visual cortex. Our results reveal that the eccentricity-dependent model, trained on target objects in isolation, can recognize such targets in the presence of flankers, if the targets are near the center of the image, whereas DCNNs cannot. Also, for all tested networks, when trained on targets in isolation, we find that recognition accuracy of the networks decreases the closer the flankers are to the target and the more flankers there are. We find that visual similarity between the target and flankers also plays a role and that pooling in early layers of the network leads to more crowding. Additionally, we show that incorporating the flankers into the images of the training set does not improve performance with crowding.

Symmetry Regularization

2017-05-26T00:00:00Z

Symmetry Regularization Anselmi, Fabio; Evangelopoulos, Georgios; Rosasco, Lorenzo; Poggio, Tomaso The properties of a representation, such as smoothness, adaptability, generality, equivari- ance/invariance, depend on restrictions imposed during learning. In this paper, we propose using data symmetries, in the sense of equivalences under transformations, as a means for learning symmetry- adapted representations, i.e., representations that are equivariant to transformations in the original space. We provide a sufficient condition to enforce the representation, for example the weights of a neural network layer or the atoms of a dictionary, to have a group structure and specifically the group structure in an unlabeled training set. By reducing the analysis of generic group symmetries to per- mutation symmetries, we devise an analytic expression for a regularization scheme and a permutation invariant metric on the representation space. Our work provides a proof of concept on why and how to learn equivariant representations, without explicit knowledge of the underlying symmetries in the data.

On the Robustness of Convolutional Neural Networks to Internal Architecture and Weight Perturbations

2017-04-03T00:00:00Z

On the Robustness of Convolutional Neural Networks to Internal Architecture and Weight Perturbations Cheney, Nicholas; Schrimpf, Martin; Kreiman, Gabriel Deep convolutional neural networks are generally regarded as robust function approximators. So far, this intuition is based on perturbations to external stimuli such as the images to be classified. Here we explore the robustness of convolutional neural networks to perturbations to the internal weights and architecture of the network itself. We show that convolutional networks are surprisingly robust to a number of internal perturbations in the higher convolutional layers but the bottom convolutional layers are much more fragile. For instance, Alexnet shows less than a 30% decrease in classification performance when randomly removing over 70% of weight connections in the top convolutional or dense layers but performance is almost at chance with the same perturbation in the first convolutional layer. Finally, we suggest further investigations which could continue to inform the robustness of convolutional networks to internal perturbations.

Musings on Deep Learning: Properties of SGD

2017-04-04T00:00:00Z

Musings on Deep Learning: Properties of SGD Zhang, Chiyuan; Liao, Qianli; Rakhlin, Alexander; Sridharan, Karthik; Miranda, Brando; Golowich, Noah; Poggio, Tomaso [previously titled "Theory of Deep Learning III: Generalization Properties of SGD"] In Theory III we characterize with a mix of theory and experiments the generalization properties of Stochastic Gradient Descent in overparametrized deep convolutional networks. We show that Stochastic Gradient Descent (SGD) selects with high probability solutions that 1) have zero (or small) empirical error, 2) are degenerate as shown in Theory II and 3) have maximum generalization.

Theory II: Landscape of the Empirical Risk in Deep Learning

2017-03-30T00:00:00Z

Theory II: Landscape of the Empirical Risk in Deep Learning Poggio, Tomaso; Liao, Qianli Previous theoretical work on deep learning and neural network optimization tend to focus on avoiding saddle points and local minima. However, the practical observation is that, at least for the most successful Deep Convolutional Neural Networks (DCNNs) for visual processing, practitioners can always increase the network size to fit the training data (an extreme example would be [1]). The most successful DCNNs such as VGG and ResNets are best used with a small degree of "overparametrization". In this work, we characterize with a mix of theory and experiments, the landscape of the empirical risk of overparametrized DCNNs. We first prove the existence of a large number of degenerate global minimizers with zero empirical error (modulo inconsistent equations). The zero-minimizers -- in the case of classification -- have a non-zero margin. The same minimizers are degenerate and thus very likely to be found by SGD that will furthermore select with higher probability the zero-minimizer with larger margin, as discussed in Theory III (to be released). We further experimentally explored and visualized the landscape of empirical risk of a DCNN on CIFAR-10 during the entire training process and especially the global minima. Finally, based on our theoretical and experimental results, we propose an intuitive model of the landscape of DCNN's empirical loss surface, which might not be as complicated as people commonly believe.

Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

2017-03-01T00:00:00Z

Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning Lotter, William; Kreiman, Gabriel; Cox, David While great strides have been made in using deep learning algorithms to solve supervised learning tasks, the problem of unsupervised learning—leveraging unlabeled examples to learn about the structure of a domain — remains a difficult unsolved challenge. Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. We describe a predictive neural network (“PredNet”) architecture that is inspired by the concept of “predictive coding” from the neuroscience literature. These networks learn to predict future frames in a video sequence, with each layer in the network making local predictions and only forwarding deviations from those predictions to subsequent network layers. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects, and that in doing so, the networks learn internal representations that are useful for decoding latent object parameters (e.g. pose) that support object recognition with fewer training views. We also show that these networks can scale to complex natural image streams (car-mounted camera videos), capturing key aspects of both egocentric movement and the movement of objects in the visual scene, and the representation learned in this setting is useful for estimating the steering angle. Altogether, these results suggest that prediction represents a powerful framework for unsupervised learning, allowing for implicit learning of object and scene structure.

Discriminate-and-Rectify Encoders: Learning from Image Transformation Sets

2017-03-13T00:00:00Z

Discriminate-and-Rectify Encoders: Learning from Image Transformation Sets Tachetti, Andrea; Voinea, Stephen; Evangelopoulos, Georgios The complexity of a learning task is increased by transformations in the input space that preserve class identity. Visual object recognition for example is affected by changes in viewpoint, scale, illumination or planar transformations. While drastically altering the visual appearance, these changes are orthogonal to recognition and should not be reflected in the representation or feature encoding used for learning. We introduce a framework for weakly supervised learning of image embeddings that are robust to transformations and selective to the class distribution, using sets of transforming examples (orbit sets), deep parametrizations and a novel orbit-based loss. The proposed loss combines a discriminative, contrastive part for orbits with a reconstruction error that learns to rectify orbit transformations. The learned embeddings are evaluated in distance metric-based tasks, such as one-shot classification under geometric transformations, as well as face verification and retrieval under more realistic visual variability. Our results suggest that orbit sets, suitably computed or observed, can be used for efficient, weakly-supervised learning of semantically relevant image embeddings.

Full interpretation of minimal images

2017-02-08T00:00:00Z

Full interpretation of minimal images Ben-Yosef, Guy; Assif, Liav; Ullman, Shimon The goal in this work is to model the process of ‘full interpretation’ of object images, which is the ability to identify and localize all semantic features and parts that are recognized by human observers. The task is approached by dividing the interpretation of the complete object to the interpretation of multiple reduced but interpretable local regions. In such reduced regions, interpretation is simpler, since the number of semantic components is small, and the variability of possible configurations is low. We model the interpretation process by identifying primitive components and relations that play a useful role in local interpretation by humans. To identify useful components and relations used in the interpretation process, we consider the interpretation of ‘minimal configurations’: these are reduced local regions, which are minimal in the sense that further reduction renders them unrecognizable and uninterpretable. We show that such minimal interpretable images have useful properties, which we use to identify informative features and relations used for full interpretation. We describe our interpretation model, and show results of detailed interpretations of minimal configurations, produced automatically by the model. Finally, we discuss implications of full interpretation to difficult visual tasks, such as recognizing human activities or interactions, which are beyond the scope of current models of visual recognition.

Learning Mid-Level Auditory Codes from Natural Sound Statistics

2017-01-25T00:00:00Z

Learning Mid-Level Auditory Codes from Natural Sound Statistics Mlynarski, Wiktor; McDermott, Josh Interaction with the world requires an organism to transform sensory signals into representations in which behaviorally meaningful properties of the environment are made explicit. These representations are derived through cascades of neuronal processing stages in which neurons at each stage recode the output of preceding stages. Explanations of sensory coding may thus involve understanding how low-level patterns are combined into more complex structures. Although models exist in the visual domain to explain how mid-level features such as junctions and curves might be derived from oriented filters in early visual cortex, little is known about analogous grouping principles for mid-level auditory representations. We propose a hierarchical generative model of natural sounds that learns combina- tions of spectrotemporal features from natural stimulus statistics. In the first layer the model forms a sparse convolutional code of spectrograms using a dictionary of learned spectrotemporal kernels. To generalize from specific kernel activation patterns, the second layer encodes patterns of time-varying magnitude of multiple first layer coefficients. Because second-layer features are sensitive to combi- nations of spectrotemporal features, the representation they support encodes more complex acoustic patterns than the first layer. When trained on corpora of speech and environmental sounds, some second-layer units learned to group spectrotemporal features that occur together in natural sounds. Others instantiate opponency between dissimilar sets of spectrotemporal features. Such groupings might be instantiated by neurons in the auditory cortex, providing a hypothesis for mid-level neuronal computation.

Measuring and modeling the perception of natural and unconstrained gaze in humans and machines

2016-11-28T00:00:00Z

Measuring and modeling the perception of natural and unconstrained gaze in humans and machines Harari, Daniel; Gao, Tao; Kanwisher, Nancy; Tenenbaum, Joshua; Ullman, Shimon Humans are remarkably adept at interpreting the gaze direction of other individuals in their surroundings. This skill is at the core of the ability to engage in joint visual attention, which is essential for establishing social interactions. How accurate are humans in determining the gaze direction of others in lifelike scenes, when they can move their heads and eyes freely, and what are the sources of information for the underlying perceptual processes? These questions pose a challenge from both empirical and computational perspectives, due to the complexity of the visual input in real-life situations. Here we measure empirically human accuracy in perceiving the gaze direction of others in lifelike scenes, and study computationally the sources of information and representations underlying this cognitive capacity. We show that humans perform better in face-to-face conditions compared with recorded conditions, and that this advantage is not due to the availability of input dynamics. We further show that humans are still performing well when only the eyes-region is visible, rather than the whole face. We develop a computational model, which replicates the pattern of human performance, including the finding that the eyes-region contains on its own, the required information for estimating both head orientation and direction of gaze. Consistent with neurophysiological findings on task-specific face regions in the brain, the learned computational representations reproduce perceptual effects such as the Wollaston illusion, when trained to estimate direction of gaze, but not when trained to recognize objects or faces.

Theory I: Why and When Can Deep Networks Avoid the Curse of Dimensionality?

2016-11-23T00:00:00Z

Theory I: Why and When Can Deep Networks Avoid the Curse of Dimensionality? Poggio, Tomaso; Mhaskar, Hrushikesh; Rosasco, Lorenzo; Miranda, Brando; Liao, Qianli [formerly titled "Why and When Can Deep – but Not Shallow – Networks Avoid the Curse of Dimensionality: a Review"] The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.

Where do hypotheses come from?

2016-10-24T00:00:00Z

Where do hypotheses come from? Dasgupta, Ishita; Schulz, Eric; Gershman, Samuel J. Why are human inferences sometimes remarkably close to the Bayesian ideal and other times systematically biased? One notable instance of this discrepancy is that tasks where the candidate hypotheses are explicitly available result in close to rational inference over the hypothesis space, whereas tasks requiring the self-generation of hypotheses produce systematic deviations from rational inference. We propose that these deviations arise from algorithmic processes approximating Bayes' rule. Specifically in our account, hypotheses are generated stochastically from a sampling process, such that the sampled hypotheses form a Monte Carlo approximation of the posterior. While this approximation will converge to the true posterior in the limit of infinite samples, we take a small number of samples as we expect that the number of samples humans take is limited by time pressure and cognitive resource constraints. We show that this model recreates several well-documented experimental findings such as anchoring and adjustment, subadditivity, superadditivity, the crowd within as well as the self-generation effect, the weak evidence, and the dud alternative effects. Additionally, we confirm the model's prediction that superadditivity and subadditivity can be induced within the same paradigm by manipulating the unpacking and typicality of hypotheses, in 2 experiments.

Streaming Normalization: Towards Simpler and More Biologically-plausible Normalizations for Online and Recurrent Learning

2016-10-19T00:00:00Z

Streaming Normalization: Towards Simpler and More Biologically-plausible Normalizations for Online and Recurrent Learning Liao, Qianli; Kawaguchi, Kenji; Poggio, Tomaso We systematically explored a spectrum of normalization algorithms related to Batch Normalization (BN) and propose a generalized formulation that simultaneously solves two major limitations of BN: (1) online learning and (2) recurrent learning. Our proposal is simpler and more biologically-plausible. Unlike previous approaches, our technique can be applied out of the box to all learning scenarios (e.g., online learning, batch learning, fully-connected, convolutional, feedforward, recurrent and mixed — recurrent and convolutional) and compare favorably with existing approaches. We also propose Lp Normalization for normalizing by different orders of statistical moments. In particular, L1 normalization is well-performing, simple to implement, fast to compute, more biologically-plausible and thus ideal for GPU or hardware implementations.

Anchoring and Agreement in Syntactic Annotations

2016-09-21T00:00:00Z

Anchoring and Agreement in Syntactic Annotations Berzak, Yevgeni; Huang, Yan; Barbu, Andrei; Korhonen, Anna; Katz, Boris Published in the Proceedings of EMNLP 2016 We present a study on two key characteristics of human syntactic annotations: anchoring and agreement. Anchoring is a well-known cognitive bias in human decision making, where judgments are drawn towards preexisting values. We study the influence of anchoring on a standard approach to creation of syntactic resources where syntactic annotations are obtained via human editing of tagger and parser output. Our experiments demonstrate a clear anchoring effect and reveal unwanted consequences, including overestimation of parsing performance and lower quality of annotations in comparison with human-based annotations. Using sentences from the Penn Treebank WSJ, we also report systematically obtained inter-annotator agreement estimates for English dependency parsing. Our agreement results control for parser bias, and are consequential in that they are on par with state of the art parsing performance for English newswire. We discuss the impact of our findings on strategies for future annotation efforts and parser evaluations.

Deep vs. shallow networks : An approximation theory perspective

2016-08-12T00:00:00Z

Deep vs. shallow networks : An approximation theory perspective Mhaskar, Hrushikesh; Poggio, Tomaso The paper briefly reviews several recent results on hierarchical architectures for learning from examples, that may formally explain the conditions under which Deep Convolutional Neural Networks perform much better in function approximation problems than shallow, one-hidden layer architectures. The paper announces new results for a non-smooth activation function – the ReLU function – used in present-day neural networks, as well as for the Gaussian networks. We propose a new definition of relative dimension to encapsulate different notions of sparsity of a function class that can possibly be exploited by deep networks but not by shallow ones to drastically reduce the complexity required for approximation and learning.

The infancy of the human brain

2015-10-07T00:00:00Z

The infancy of the human brain Dehaene-Lambertz, G.; Spelke, Elizabeth S. The human infant brain is the only known machine able to master a natural language and develop explicit, symbolic, and communicable systems of knowledge that deliver rich representations of the external world. With the emergence of non-invasive brain imaging, we now have access to the unique neural machinery underlying these early accomplishments. After describing early cognitive capacities in the domains of language and number, we review recent findings that underline the strong continuity between human infants’ and adults’ neural architecture, with notably early hemispheric asymmetries and involvement of frontal areas. Studies of the strengths and limitations of early learning, and of brain dynamics in relation to regional maturational stages, promise to yield a better understanding of the sources of human cognitive achievements.

Universal Dependencies for Learner English

2016-08-01T00:00:00Z

Universal Dependencies for Learner English Berzak, Yevgeni; Kenney, Jessica; Spadine, Carolyn; Wang, Jing Xian; Lam, Lucia; Mori, Keiko Sophie; Garza, Sebastian; Katz, Boris We introduce the Treebank of Learner English (TLE), the first publicly available syntactic treebank for English as a Second Language (ESL). The TLE provides manually annotated POS tags and Universal Dependency (UD) trees for 5,124 sentences from the Cambridge First Certificate in English (FCE) corpus. The UD annotations are tied to a pre-existing error annotation of the FCE, whereby full syntactic analyses are provided for both the original and error corrected versions of each sentence. Further on, we delineate ESL annotation guidelines that allow for consistent syntactic treatment of ungrammatical English. Finally, we benchmark POS tagging and dependency parsing performance on the TLE dataset and measure the effect of grammatical errors on parsing accuracy. We envision the treebank to support a wide range of linguistic and computational research o n second language acquisition as well as automatic processing of ungrammatical language.

Do You See What I Mean? Visual Resolution of Linguistic Ambiguities

2016-06-10T00:00:00Z

Do You See What I Mean? Visual Resolution of Linguistic Ambiguities Berzak, Yevgeni; Barbu, Andrei; Harari, Daniel; Katz, Boris; Ullman, Shimon Understanding language goes hand in hand with the ability to integrate complex contextual information obtained via perception. In this work, we present a novel task for grounded language understanding: disambiguating a sentence given a visual scene which depicts one of the possible interpretations of that sentence. To this end, we introduce a new multimodal corpus containing ambiguous sentences, representing a wide range of syntactic, semantic and discourse ambiguities, coupled with videos that visualize the different interpretations for each sentence. We address this task by extending a vision model which determines if a sentence is depicted by a video. We demonstrate how such a model can be adjusted to recognize different interpretations of the same underlying sentence, allowing to disambiguate sentences in a unified fashion across the different ambiguity types.

Contrastive Analysis with Predictive Power: Typology Driven Estimation of Grammatical Error Distributions in ESL

2016-06-05T00:00:00Z

Contrastive Analysis with Predictive Power: Typology Driven Estimation of Grammatical Error Distributions in ESL Berzak, Yevgeni; Reichart, Roi; Katz, Boris This work examines the impact of crosslinguistic transfer on grammatical errors in English as Second Language (ESL) texts. Using a computational framework that formalizes the theory of Contrastive Analysis (CA), we demonstrate that language specific error distributions in ESL writing can be predicted from the typological properties of the native language and their relation to the typology of English. Our typology driven model enables to obtain accurate estimates of such distributions without access to any ESL data for the target languages. Furthermore, we present a strategy for adjusting our method to low-resource languages that lack typological documentation using a bootstrapping approach which approximates native language typology from ESL texts. Finally, we show that our framework is instrumental for linguistic inquiry seeking to identify first language factors that contribute to a wide range of difficulties in second language acquisition.

View-tolerant face recognition and Hebbian learning imply mirror-symmetric neural tuning to head orientation

2016-06-03T00:00:00Z

View-tolerant face recognition and Hebbian learning imply mirror-symmetric neural tuning to head orientation Leibo, Joel Z.; Liao, Qianli; Freiwald, Winrich; Anselmi, Fabio; Poggio, Tomaso The primate brain contains a hierarchy of visual areas, dubbed the ventral stream, which rapidly computes object representations that are both specific for object identity and relatively robust against identity-preserving transformations like depth-rotations [ 33 , 32 , 23 , 13 ]. Current computational models of object recognition, including recent deep learning networks, generate these properties through a hierarchy of alternating selectivity-increasing filtering and tolerance-increasing pooling operations, similar to simple-complex cells operations [ 46 , 8 , 44 , 29 ]. While simulations of these models recapitulate the ventral stream’s progression from early view-specific to late view-tolerant representations, they fail to generate the most salient property of the intermediate representation for faces found in the brain: mirror-symmetric tuning of the neural population to head orientation [ 16 ]. Here we prove that a class of hierarchical architectures and a broad set of biologically plausible learning rules can provide approximate invariance at the top level of the network. While most of the learning rules do not yield mirror-symmetry in the mid-level representations, we characterize a specific biologically-plausible Hebb-type learning rule that is guaranteed to generate mirror-symmetric tuning to faces tuning at intermediate levels of the architecture.

Probing the compositionality of intuitive functions

2016-05-26T00:00:00Z

Probing the compositionality of intuitive functions Schulz, Eric; Tenenbaum, Joshua B.; Duvenaud, David; Speekenbrink, Maarten; Gershman, Samuel J. How do people learn about complex functional structure? Taking inspiration from other areas of cognitive science, we propose that this is accomplished by harnessing compositionality: complex structure is decomposed into simpler building blocks. We formalize this idea within the framework of Bayesian regression using a grammar over Gaussian process kernels. We show that participants prefer compositional over non-compositional function extrapolations, that samples from the human prior over functions are best described by a compositional model, and that people perceive compositional functions as more predictable than their non-compositional but otherwise similar counterparts. We argue that the compositional nature of intuitive functions is consistent with broad principles of human cognition.

Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex

2016-04-12T00:00:00Z

Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex Liao, Qianli; Poggio, Tomaso We discuss relations between Residual Networks (ResNet), Recurrent Neural Networks (RNNs) and the primate visual cortex. We begin with the observation that a shallow RNN is exactly equivalent to a very deep ResNet with weight sharing among the layers. A direct implementation of such a RNN, although having orders of magnitude fewer parameters, leads to a performance similar to the corresponding ResNet. We propose 1) a generalization of both RNN and ResNet architectures and 2) the conjecture that a class of moderately deep RNNs is a biologically-plausible model of the ventral stream in visual cortex. We demonstrate the effectiveness of the architectures by testing them on the CIFAR-10 dataset.

Building machines that learn and think like people

2016-04-01T00:00:00Z

Building machines that learn and think like people Lake, Brenden M.; Ullman, Tomer D.; Tenenbaum, Joshua B.; Gershman, Samuel J. Recent progress in artificial intelligence (AI) has renewed interest in building systems that learn and think like people. Many advances have come from using deep neural networks trained end-to-end in tasks such as object recognition, video games, and board games, achieving performance that equals or even beats humans in some respects. Despite their biological inspiration and performance achievements, these systems differ from human intelligence in crucial ways. We review progress in cognitive science suggesting that truly human-like learning and thinking machines will have to reach beyond current engineering trends in both what they learn, and how they learn it. Specifically, we argue that these machines should (a) build causal models of the world that support explanation and understanding, rather than merely solving pattern recognition problems; (b) ground learning in intuitive theories of physics and psychology, to support and enrich the knowledge that is learned; and (c) harness compositionality and learning-to-learn to rapidly acquire and generalize knowledge to new tasks and situations. We suggest concrete challenges and promising routes towards these goals that can combine the strengths of recent neural network advances with more structured cognitive models.

Learning Real and Boolean Functions: When Is Deep Better Than Shallow

2016-03-08T00:00:00Z

Learning Real and Boolean Functions: When Is Deep Better Than Shallow Mhaskar, Hrushikesh; Liao, Qianli; Poggio, Tomaso We describe computational tasks - especially in vision - that correspond to compositional/hierarchical functions. While the universal approximation property holds both for hierarchical and shallow networks, we prove that deep (hierarchical) networks can approximate the class of compositional functions with the same accuracy as shallow networks but with exponentially lower VC-dimension as well as the number of training parameters. This leads to the question of approximation by sparse polynomials (in the number of independent parameters) and, as a consequence, by deep networks. We also discuss connections between our results and learnability of sparse Boolean functions, settling an old conjecture by Bengio.

Foveation-based Mechanisms Alleviate Adversarial Examples

2016-01-19T00:00:00Z

Foveation-based Mechanisms Alleviate Adversarial Examples Lou, Yan; Boix, Xavier; Roig, Gemma; Poggio, Tomaso; Zhao, Qi We show that adversarial examples, i.e., the visually imperceptible perturbations that result in Convolutional Neural Networks (CNNs) fail, can be alleviated with a mechanism based on foveations---applying the CNN in different image regions. To see this, first, we report results in ImageNet that lead to a revision of the hypothesis that adversarial perturbations are a consequence of CNNs acting as a linear classifier: CNNs act locally linearly to changes in the image regions with objects recognized by the CNN, and in other regions the CNN may act non-linearly. Then, we corroborate that when the neural responses are linear, applying the foveation mechanism to the adversarial example tends to significantly reduce the effect of the perturbation. This is because, hypothetically, the CNNs for ImageNet are robust to changes of scale and translation of the object produced by the foveation, but this property does not generalize to transformations of the perturbation. As a result, the accuracy after a foveation is almost the same as the accuracy of the CNN without the adversarial perturbation, even if the adversarial perturbation is calculated taking into account a foveation.

Fast, invariant representation for human action in the visual system

2016-01-06T00:00:00Z

Fast, invariant representation for human action in the visual system Isik, Leyla; Tacchetti, Andrea; Poggio, Tomaso The ability to recognize the actions of others from visual input is essential to humans' daily lives. The neural computations underlying action recognition, however, are still poorly understood. We use magnetoencephalography (MEG) decoding and a computational model to study action recognition from a novel dataset of well-controlled, naturalistic videos of five actions (run, walk, jump, eat drink) performed by five actors at five viewpoints. We show for the first that that actor- and view-invariant representations for action arise in the human brain as early as 200 ms. We next extend a class of biologically inspired hierarchical computational models of object recognition to recognize actions from videos and explain the computations underlying our MEG findings. This model achieves 3D viewpoint-invariance by the same biologically inspired computational mechanism it uses to build invariance to position and scale. These results suggest that robustness to complex transformations, such as 3D viewpoint invariance, does not require special neural architectures, and further provide a mechanistic explanation of the computations driving invariant action recognition.

How Important is Weight Symmetry in Backpropagation?

2015-11-29T00:00:00Z

How Important is Weight Symmetry in Backpropagation? Liao, Qianli; Leibo, Joel Z.; Poggio, Tomaso Gradient backpropagation (BP) requires symmetric feedforward and feedback connections—the same weights must be used for forward and backward passes. This “weight transport problem” [1] is thought to be one of the main reasons of BP’s biological implausibility. Using 15 different classification datasets, we systematically study to what extent BP really depends on weight symmetry. In a study that turned out to be surprisingly similar in spirit to Lillicrap et al.’s demonstration [2] but orthogonal in its results, our experiments indicate that: (1) the magnitudes of feedback weights do not matter to performance (2) the signs of feedback weights do matter—the more concordant signs between feedforward and their corresponding feedback connections, the better (3) with feedback weights having random magnitudes and 100% concordant signs, we were able to achieve the same or even better performance than SGD. (4) some normalizations/stabilizations are indispensable for such asymmetric BP to work, namely Batch Normalization (BN) [3] and/or a “Batch Manhattan” (BM) update rule.

Group Invariant Deep Representations for Image Instance Retrieval

2016-01-11T00:00:00Z

Group Invariant Deep Representations for Image Instance Retrieval Morère, Olivier; Veillard, Antoine; Lin, Jie; Petta, Julie; Chandrasekhar, Vijay; Poggio, Tomaso Most image instance retrieval pipelines are based on comparison of vectors known as global image descriptors between a query image and the database images. Due to their success in large scale image classification, representations extracted from Convolutional Neural Networks (CNN) are quickly gaining ground on Fisher Vectors (FVs) as state-of-the-art global descriptors for image instance retrieval. While CNN-based descriptors are generally remarked for good retrieval performance at lower bitrates, they nevertheless present a number of drawbacks including the lack of robustness to common object transformations such as rotations compared with their interest point based FV counterparts. In this paper, we propose a method for computing invariant global descriptors from CNNs. Our method implements a recently proposed mathematical theory for invariance in a sensory cortex modeled as a feedforward neural network. The resulting global descriptors can be made invariant to multiple arbitrary transformation groups while retaining good discriminativeness. Based on a thorough empirical evaluation using several publicly available datasets, we show that our method is able to significantly and consistently improve retrieval results every time a new type of invariance is incorporated. We also show that our method which has few parameters is not prone to over fitting: improvements generalize well across datasets with different properties with regard to invariances. Finally, we show that our descriptors are able to compare favourably to other state-of-theart compact descriptors in similar bitranges, exceeding the highest retrieval results reported in the literature on some datasets. A dedicated dimensionality reduction step –quantization or hashing– may be able to further improve the competitiveness of the descriptors.

I-theory on depth vs width: hierarchical function composition

2015-12-29T00:00:00Z

I-theory on depth vs width: hierarchical function composition Poggio, Tomaso; Anselmi, Fabio; Rosasco, Lorenzo Deep learning networks with convolution, pooling and subsampling are a special case of hierar- chical architectures, which can be represented by trees (such as binary trees). Hierarchical as well as shallow networks can approximate functions of several variables, in particular those that are com- positions of low dimensional functions. We show that the power of a deep network architecture with respect to a shallow network is rather independent of the specific nonlinear operations in the network and depends instead on the the behavior of the VC-dimension. A shallow network can approximate compositional functions with the same error of a deep network but at the cost of a VC-dimension that is exponential instead than quadratic in the dimensionality of the function. To complete the argument we argue that there exist visual computations that are intrinsically compositional. In particular, we prove that recognition invariant to translation cannot be computed by shallow networks in the presence of clutter. Finally, a general framework that includes the compositional case is sketched. The key con- dition that allows tall, thin networks to be nicer that short, fat networks is that the target input-output function must be sparse in a certain technical sense.

UNSUPERVISED LEARNING OF VISUAL STRUCTURE USING PREDICTIVE GENERATIVE NETWORKS

2015-12-15T00:00:00Z

UNSUPERVISED LEARNING OF VISUAL STRUCTURE USING PREDICTIVE GENERATIVE NETWORKS Lotter, William; Kreiman, Gabriel; Cox, David The ability to predict future states of the environment is a central pillar of intelligence. At its core, effective prediction requires an internal model of the world and an understanding of the rules by which the world changes. Here, we explore the internal models developed by deep neural networks trained using a loss based on predicting future frames in synthetic video sequences, using an Encoder-Recurrent-Decoder framework (Fragkiadaki et al., 2015). We first show that this architecture can achieve excellent performance in visual sequence prediction tasks, including state-of-the-art performance in a standard “bouncing balls” dataset (Sutskever et al., 2009). We then train on clips of out-of-the-plane rotations of computer-generated faces, using both mean-squared error and a generative adversarial loss (Goodfellow et al., 2014), extending the latter to a recurrent, conditional setting. Despite being trained end-to-end to predict only pixel-level information, our Predictive Generative Networks learn a representation of the latent variables of the underlying generative process. Importantly, we find that this representation is naturally tolerant to object transformations, and generalizes well to new tasks, such as classification of static images. Similar models trained solely with a reconstruction loss fail to generalize as effectively. We argue that prediction can serve as a powerful unsupervised loss for learning rich internal representations of high-level object features.

Holographic Embeddings of Knowledge Graphs

2015-11-16T00:00:00Z

Holographic Embeddings of Knowledge Graphs Nickel, Maximilian; Rosasco, Lorenzo; Poggio, Tomaso Learning embeddings of entities and relations is an efficient and versatile method to perform machine learning on relational data such as knowledge graphs. In this work, we propose holographic embeddings (HolE) to learn compositional vector space representations of entire knowledge graphs. The proposed method is related to holographic models of associative memory in that it employs circular correlation to create compositional representations. By using correlation as the compositional operator, HolE can capture rich interactions but simultaneously remains efficient to compute, easy to train, and scalable to very large datasets. In extensive experiments we show that holographic embeddings are able to outperform state-of-the-art methods for link prediction in knowledge graphs and relational learning benchmark datasets.

Predicting Actions Before They Occur

2015-10-26T00:00:00Z

Predicting Actions Before They Occur Vaziri-Pashkam, Maryam; Cormiea, Sarah; Nakayama, Ken Humans are experts at reading others’ actions in social contexts. They efficiently process others’ movements in real-time to predict intended goals. Here we designed a two-person reaching task to investigate real-time body reading in a naturalistic setting. Two Subjects faced each other separated by a plexiglass screen. One (Attacker) was instructed to tap one of two targets on the screen and the other (Blocker) was told to tap the same target as quickly as possible. Reaction times were fast, much faster than reaction times to a dot projected on the screen moving in the same manner. This suggests Blockers use subtle preparatory movements of Attackers to predict their goal. Next, using video recordings of an Attacker, we showed that removing the preparatory cues slows reaction times and changing them could trick the Blockers to choose the wrong target. We then occluded various body parts of the Attacker and showed that reaction times slow down only when most of the body of the Attacker is occluded. This suggests that preparatory cues are distributed over the body of the Attacker. We saw no evidence of learning during the experiment as reaction times remained constant over the duration of the session. Taken together, these results suggest that in social contexts humans are able to use their knowledge of the biomechanical constraints on the human body to efficiently process preparatory cues from the body of their interaction partner in order to predict their intentions well before movement begins.

Notes on Hierarchical Splines, DCLNs and i-theory

2015-09-29T00:00:00Z

Notes on Hierarchical Splines, DCLNs and i-theory Poggio, Tomaso; Rosasco, Lorenzo; Shashua, Amnon; Cohen, Nadav; Anselmi, Fabio We define an extension of classical additive splines for multivariate function approximation that we call hierarchical splines. We show that the case of hierarchical, additive, piece-wise linear splines includes present-day Deep Convolutional Learning Networks (DCLNs) with linear rectifiers and pooling (sum or max). We discuss how these observations together with i-theory may provide a framework for a general theory of deep networks.

Deep Convolutional Networks are Hierarchical Kernel Machines

2015-08-05T00:00:00Z

Deep Convolutional Networks are Hierarchical Kernel Machines Anselmi, Fabio; Rosasco, Lorenzo; Tan, Cheston; Poggio, Tomaso We extend i-theory to incorporate not only pooling but also rectifying nonlinearities in an extended HW module (eHW) designed for supervised learning. The two operations roughly correspond to invariance and selectivity, respectively. Under the assumption of normalized inputs, we show that appropriate linear combinations of rectifying nonlinearities are equivalent to radial kernels. If pooling is present an equivalent kernel also exist. Thus present-day DCNs (Deep Convolutional Networks) can be exactly equivalent to a hierarchy of kernel machines with pooling and non-pooling layers. Finally, we describe a conjecture for theoretically understanding hierarchies of such modules. A main consequence of the conjecture is that hierarchies of eHW modules minimize memory requirements while computing a selective and invariant representation.

Parsing Occluded People by Flexible Compositions

2015-06-01T00:00:00Z

Parsing Occluded People by Flexible Compositions Chen, Xianjie; Yuille, Alan L. This paper presents an approach to parsing humans when there is significant occlusion. We model humans using a graphical model which has a tree structure building on recent work [32, 6] and exploit the connectivity prior that, even in presence of occlusion, the visible nodes form a connected subtree of the graphical model. We call each connected subtree a flexible composition of object parts. This involves a novel method for learning occlusion cues. During inference we need to search over a mixture of different flexible models. By exploiting part sharing, we show that this inference can be done extremely efficiently requiring only twice as many computations as searching for the entire object (i.e., not modeling occlusion). We evaluate our model on the standard benchmarked “We Are Family" Stickmen dataset and obtain significant performance improvements over the best alternative algorithms.

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

2015-05-07T00:00:00Z

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) Mao, Junhua; Xu, Wei; Yang, Yi; Wang, Jiang; Huang, Zhiheng; Yuille, Alan L. In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated according to this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.

Semantic Part Segmentation using Compositional Model combining Shape and Appearance

2015-06-08T00:00:00Z

Semantic Part Segmentation using Compositional Model combining Shape and Appearance Wang, Jianyu; Yuille, Alan L. In this paper, we study the problem of semantic part segmentation for animals. This is more challenging than standard object detection, object segmentation and pose estimation tasks because semantic parts of animals often have similar appearance and highly varying shapes. To tackle these challenges, we build a mixture of compositional models to represent the object boundary and the boundaries of semantic parts. And we incorporate edge, appearance, and semantic part cues into the compositional model. Given part-level segmentation annotation, we develop a novel algorithm to learn a mixture of compositional models under various poses and viewpoints for certain animal classes. Furthermore, a linear complexity algorithm is offered for efficient inference of the compositional model using dynamic programming. We evaluate our method for horse and cow using a newly annotated dataset on Pascal VOC 2010 which has pixelwise part labels. Experimental results demonstrate the effectiveness of our method.

Complexity of Representation and Inference in Compositional Models with Part Sharing

2015-05-05T00:00:00Z

Complexity of Representation and Inference in Compositional Models with Part Sharing Yuille, Alan L.; Mottaghi, Roozbeh This paper performs a complexity analysis of a class of serial and parallel compositional models of multiple objects and shows that they enable efficient representation and rapid inference. Compositional models are generative and represent objects in a hierarchically distributed manner in terms of parts and subparts, which are constructed recursively by part-subpart compositions. Parts are represented more coarsely at higher level of the hierarchy, so that the upper levels give coarse summary descriptions (e.g., there is a horse in the image) while the lower levels represents the details (e.g., the positions of the legs of the horse). This hierarchically distributed representation obeys the executive summary principle, meaning that a high level executive only requires a coarse summary description and can, if necessary, get more details by consulting lower level executives. The parts and subparts are organized in terms of hierarchical dictionaries which enables part sharing between different objects allowing efficient representation of many objects. The first main contribution of this paper is to show that compositional models can be mapped onto a parallel visual architecture similar to that used by bio-inspired visual models such as deep convolutional networks but more explicit in terms of representation, hence enabling part detection as well as object detection, and suitable for complexity analysis. Inference algorithms can be run on this architecture to exploit the gains caused by part sharing and executive summary. Effectively, this compositional architecture enables us to perform exact inference simultaneously over a large class of generative models of objects.The second contribution is an analysis of the complexity of compositional models in terms of computation time (for serial computers) and numbers of nodes (e.g., ``neurons") for parallel computers. In particular, we compute the complexity gains by part sharing and executive summary and their dependence on how the dictionary scales with the level of the hierarchy. We explore three regimes of scaling behavior where the dictionary size (i) increases exponentially with the level of the hierarchy, (ii) is determined by an unsupervised compositional learning algorithm applied to real data, (iii) decreases exponentially with scale. This analysis shows that in some regimes the use of shared parts enables algorithms which can perform inference in time linear in the number of levels for an exponential number of objects. In other regimes part sharing has little advantage for serial computers but can enable linear processing on parallel computers.

Towards a Programmer’s Apprentice (Again)

2015-04-03T00:00:00Z

Towards a Programmer’s Apprentice (Again) Shrobe, Howard; Katz, Boris; Davis, Randall Programmers are loathe to interrupt their workflow to document their design rationale, leading to frequent errors when software is modified—often much later and by different programmers. A Pro- grammer’s Assistant could interact with the programmer to capture and preserve design rationale, in a natural way that would make rationale capture “cost less than it’s worth”, and could also detect common flaws in program design. Such a programmer’s assistant was not practical when it was first proposed decades ago, but advances over the years make now the time to revisit the concept, as our prototype shows.

On Invariance and Selectivity in Representation Learning

2015-03-23T00:00:00Z

On Invariance and Selectivity in Representation Learning Anselmi, Fabio; Rosasco, Lorenzo; Poggio, Tomaso We discuss data representation which can be learned automatically from data, are invariant to transformations, and at the same time selective, in the sense that two points have the same representation only if they are one the transformation of the other. The mathematical results here sharpen some of the key claims of i-theory, a recent theory of feedforward processing in sensory cortex.

A Review of Relational Machine Learning for Knowledge Graphs

2015-03-23T00:00:00Z

A Review of Relational Machine Learning for Knowledge Graphs Nickel, Maximilian; Murphy, Kevin; Tresp, Volker; Gabrilovich, Evgeniy Relational machine learning studies methods for the statistical analysis of relational, or graph-structured, data. In this paper, we provide a review of how such statistical models can be “trained” on large knowledge graphs, and then used to predict new facts about the world (which is equivalent to predicting new edges in the graph). In particular, we discuss two different kinds of statistical relational models, both of which can scale to massive datasets. The first is based on tensor factorization methods and related latent variable models. The second is based on mining observable patterns in the graph. We also show how to combine these latent and observable models to get improved modeling power at decreased computational cost. Finally, we discuss how such statistical models of graphs can be combined with text-based information extraction methods for automatically constructing knowledge graphs from the Web. In particular, we discuss Google’s Knowledge Vault project.

A Nonparametric Bayesian Approach to Uncovering Rat Hippocampal Population Codes During Spatial Navigation

2014-12-01T00:00:00Z

A Nonparametric Bayesian Approach to Uncovering Rat Hippocampal Population Codes During Spatial Navigation Linderman, Scott W.; Johnson, Matthew J.; Wilson, Matthew A.; Chen, Zhe Rodent hippocampal population codes represent important spatial information about the environment during navigation. Several computational methods have been developed to uncover the neural representation of spatial topology embedded in rodent hippocampal ensemble spike activity. Here we extend our previous work and propose a nonparametric Bayesian approach to infer rat hippocampal population codes during spatial navigation. To tackle the model selection problem, we leverage a nonparametric Bayesian model. Specifically, to analyze rat hippocampal ensemble spiking activity, we apply a hierarchical Dirichlet process-hidden Markov model (HDP-HMM) using two Bayesian inference methods, one based on Markov chain Monte Carlo (MCMC) and the other based on variational Bayes (VB). We demonstrate the effectiveness of our Bayesian approaches on recordings from a freely-behaving rat navigating in an open field environment. We find that MCMC-based inference with Hamiltonian Monte Carlo (HMC) hyperparameter sampling is flexible and efficient, and outperforms VB and MCMC approaches with hyperparameters set by empirical Bayes. This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216.

Representation Learning in Sensory Cortex: a theory

2014-11-14T00:00:00Z

Representation Learning in Sensory Cortex: a theory Anselmi, Fabio; Poggio, Tomaso We review and apply a computational theory of the feedforward path of the ventral stream in visual cortex based on the hypothesis that its main function is the encoding of invariant representations of images. A key justification of the theory is provided by a theorem linking invariant representations to small sample complexity for recognition – that is, invariant representations allows learning from very few labeled examples. The theory characterizes how an algorithm that can be implemented by a set of ”simple” and ”complex” cells – a ”HW module” – provides invariant and selective representations. The invariance can be learned in an unsupervised way from observed transformations. Theorems show that invariance implies several properties of the ventral stream organization, including the eccentricity dependent lattice of units in the retina and in V1, and the tuning of its neurons. The theory requires two stages of processing: the first, consisting of retinotopic visual areas such as V1, V2 and V4 with generic neuronal tuning, leads to representations that are invariant to translation and scaling; the second, consisting of modules in IT, with class- and object-specific tuning, provides a representation for recognition with approximate invariance to class specific transformations, such as pose (of a body, of a face) and expression. In the theory the ventral stream main function is the unsupervised learning of ”good” representations that reduce the sample complexity of the final supervised learning stage.

When Computer Vision Gazes at Cognition

2014-12-12T00:00:00Z

When Computer Vision Gazes at Cognition Gao, Tao; Harari, Daniel; Tenenbaum, Joshua; Ullman, Shimon Joint attention is a core, early-developing form of social interaction. It is based on our ability to discriminate the third party objects that other people are looking at. While it has been shown that people can accurately determine whether another person is looking directly at them versus away, little is known about human ability to discriminate a third person gaze directed towards objects that are further away, especially in unconstraint cases where the looker can move her head and eyes freely. In this paper we address this question by jointly exploring human psychophysics and a cognitively motivated computer vision model, which can detect the 3D direction of gaze from 2D face images. The synthesis of behavioral study and computer vision yields several interesting discoveries. (1) Human accuracy of discriminating targets 8{\deg}-10{\deg} of visual angle apart is around 40% in a free looking gaze task; (2) The ability to interpret gaze of different lookers vary dramatically; (3) This variance can be captured by the computational model; (4) Human outperforms the current model significantly. These results collectively show that the acuity of human joint attention is indeed highly impressive, given the computational challenge of the natural looking task. Moreover, the gap between human and model performance, as well as the variability of gaze interpretation across different lookers, require further understanding of the underlying mechanisms utilized by humans for this challenging task.

Abstracts of the 2014 Brains, Minds, and Machines Summer School

2014-09-26T00:00:00Z

Abstracts of the 2014 Brains, Minds, and Machines Summer School Amir, Nadav; Besold, Tarek R.; Camoriano, Rafaello; Erdogan, Goker; Flynn, Thomas; Gillary, Grant; Gomez, Jesse; Herbert-Voss, Ariel; Hotan, Gladia; Kadmon, Jonathan; Linderman, Scott W.; Liu, Tina T.; Marantan, Andrew; Olson, Joseph; Orchard, Garrick; Pal, Dipan K.; Pasquale, Giulia; Sanders, Honi; Silberer, Carina; Smith, Kevin A.; de Brito, Carols Stein N.; Suchow, Jordan W.; Tessler, M. H.; Viejo, Guillaume; Walker, Drew; Wehbe, Leila A compilation of abstracts from the student projects of the 2014 Brains, Minds, and Machines Summer School, held at Woods Hole Marine Biological Lab, May 29 - June 12, 2014.

Unsupervised learning of clutter-resistant visual representations from natural videos

2015-04-27T00:00:00Z

Unsupervised learning of clutter-resistant visual representations from natural videos Liao, Qianli; Leibo, Joel Z; Poggio, Tomaso Populations of neurons in inferotemporal cortex (IT) maintain an explicit code for object identity that also tolerates transformations of object appearance e.g., position, scale, viewing angle [1, 2, 3]. Though the learning rules are not known, recent results [4, 5, 6] suggest the operation of an unsupervised temporal-association-based method e.g., Foldiak’s trace rule [7]. Such methods exploit the temporal continuity of the visual world by assuming that visual experience over short timescales will tend to have invariant identity content. Thus, by associating representations of frames from nearby times, a representation that tolerates whatever transformations occurred in the video may be achieved. Many previous studies verified that such rules can work in simple situations without background clutter, but the presence of visual clutter has remained problematic for this approach. Here we show that temporal association based on large class-specific filters (templates) avoids the problem of clutter. Our system learns in an unsupervised way from natural videos gathered from the internet, and is able to perform a difficult unconstrained face recognition task on natural images (Labeled Faces in the Wild [8]).

Learning An Invariant Speech Representation

2014-06-15T00:00:00Z

Learning An Invariant Speech Representation Evangelopoulos, Georgios; Voinea, Stephen; Zhang, Chiyuan; Rosasco, Lorenzo; Poggio, Tomaso Recognition of speech, and in particular the ability to generalize and learn from small sets of labelled examples like humans do, depends on an appropriate representation of the acoustic input. We formulate the problem of finding robust speech features for supervised learning with small sample complexity as a problem of learning representations of the signal that are maximally invariant to intraclass transformations and deformations. We propose an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain and empirically evaluate its validity for voiced speech sound classification. Our version of the theory requires the memory-based, unsupervised storage of acoustic templates — such as specific phones or words — together with all the transformations of each that normally occur. A quasi-invariant representation for a speech segment can be obtained by projecting it to each template orbit, i.e., the set of transformed signals, and computing the associated one-dimensional empirical probability distributions. The computations can be performed by modules of filtering and pooling, and extended to hierarchical architectures. In this paper, we apply a single-layer, multicomponent representation for phonemes and demonstrate improved accuracy and decreased sample complexity for vowel classification compared to standard spectral, cepstral and perceptual features.

Neural tuning size is a key factor underlying holistic face processing

2014-06-14T00:00:00Z

Neural tuning size is a key factor underlying holistic face processing Tan, Cheston; Poggio, Tomaso Faces are a class of visual stimuli with unique significance, for a variety of reasons. They are ubiquitous throughout the course of a person’s life, and face recognition is crucial for daily social interaction. Faces are also unlike any other stimulus class in terms of certain physical stimulus characteristics. Furthermore, faces have been empirically found to elicit certain characteristic behavioral phenomena, which are widely held to be evidence of “holistic” processing of faces. However, little is known about the neural mechanisms underlying such holistic face processing. In other words, for the processing of faces by the primate visual system, the input and output characteristics are relatively well known, but the internal neural computations are not. The main aim of this work is to further the fundamental understanding of what causes the visual processing of faces to be different from that of objects. In this computational modeling work, we show that a single factor – “neural tuning size” – is able to account for three key phenomena that are characteristic of face processing, namely the Composite Face Effect (CFE), Face Inversion Effect (FIE) and Whole ‐ Part Effect (WPE). Our computational proof ‐ of ‐ principle provides specific neural tuning properties that correspond to the poorly ‐ understood notion of holistic face processing, and connects these neural properties to psychophysical behavior. Overall, our work provides a unified and parsimonious theoretical account for the disparate empirical data on face ‐ specific processing, deepening the fundamental understanding of face processing.

Human-Machine CRFs for Identifying Bottlenecks in Holistic Scene Understanding

2014-06-15T00:00:00Z

Human-Machine CRFs for Identifying Bottlenecks in Holistic Scene Understanding Mottaghi, Roozbeh; Fidler, Sanja; Yuille, Alan L.; Urtasun, Raquel; Parikh, Devi Recent trends in image understanding have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning, and local appearance based classifiers. In this work, we are interested in understanding the roles of these different tasks in improved scene understanding, in particular semantic segmentation, object detection and scene recognition. Towards this goal, we “plug-in” human subjects for each of the various components in a state-of-the-art conditional random field model. Comparisons among various hybrid human-machine CRFs give us indications of how much “head room” there is to improve scene understanding by focusing research efforts on various individual tasks.

The Genesis Story Understanding and Story Telling System A 21st Century Step toward Artificial Intelligence

2014-06-10T00:00:00Z

The Genesis Story Understanding and Story Telling System A 21st Century Step toward Artificial Intelligence Winston, Patrick Henry Story understanding is an important differentiator of human intelligence, perhaps the most important differentiator. The Genesis system was built to model and explore aspects of story understanding using simply expressed, 20-100 sentence stories drawn from sources ranging from fairy tales to Shakespeare’s plays. I describe Genesis at work as it reflects on its reading, searching for concepts, reads stories with controllable allegiances and cultural biases, models personality traits, answers basic questions about why and when, notes concept onsets, anticipating trouble, calculates similarity using concepts, models question-driven interpretation, aligns similar stories for analogical reasoning, develops summaries, and tells and persuades using a reader model. I conclude with thoughts on how Genesis would describe people in pictures and video, thus engaging with the CBMM challenge problem.

Parsing Semantic Parts of Cars Using Graphical Models and Segment Appearance Consistency

2014-06-13T00:00:00Z

Parsing Semantic Parts of Cars Using Graphical Models and Segment Appearance Consistency Lu, Wenhao; Lian, Xiaochen; Yuille, Alan L. This paper addresses the problem of semantic part parsing (segmentation) of cars, i.e.assigning every pixel within the car to one of the parts (e.g.body, window, lights, license plates and wheels). We formulate this as a landmark identification problem, where a set of landmarks specifies the boundaries of the parts. A novel mixture of graphical models is proposed, which dynamically couples the landmarks to a hierarchy of segments. When modeling pairwise relation between landmarks, this coupling enables our model to exploit the local image contents in addition to spatial deformation, an aspect that most existing graphical models ignore. In particular, our model enforces appearance consistency between segments within the same part. Parsing the car, including finding the optimal coupling between landmarks and segments in the hierarchy, is performed by dynamic programming. We evaluate our method on a subset of PASCAL VOC 2010 car images and on the car subset of 3D Object Category dataset (CAR3D). We show good results and, in particular, quantify the effectiveness of using the segment appearance consistency in terms of accuracy of part localization and segmentation.

Computational role of eccentricity dependent cortical magnification

2014-06-06T00:00:00Z

Computational role of eccentricity dependent cortical magnification Poggio, Tomaso; Mutch, Jim; Isik, Leyla We develop a sampling extension of M-theory focused on invariance to scale and translation. Quite surprisingly, the theory predicts an architecture of early vision with increasing receptive field sizes and a high resolution fovea — in agreement with data about the cortical magnification factor, V1 and the retina. From the slope of the inverse of the magnification factor, M-theory predicts a cortical “fovea” in V1 in the order of 40 by 40 basic units at each receptive field size — corresponding to a foveola of size around 26 minutes of arc at the highest resolution, ≈6 degrees at the lowest resolution. It also predicts uniform scale invariance over a fixed range of scales independently of eccentricity, while translation invariance should depend linearly on spatial frequency. Bouma’s law of crowding follows in the theory as an effect of cortical area-by-cortical area pooling; the Bouma constant is the value expected if the signature responsible for recognition in the crowding experiments originates in V2. From a broader perspective, the emerging picture suggests that visual recognition under natural conditions takes place by composing information from a set of fixations, with each fixation providing recognition from a space-scale image fragment — that is an image patch represented at a set of increasing sizes and decreasing resolutions.

Simultaneous whole‐animal 3D imaging of neuronal activity using light‐field microscopy

2014-05-18T00:00:00Z

Simultaneous whole‐animal 3D imaging of neuronal activity using light‐field microscopy Prevedel, Robert; Yoon, Young-Gyu; Hoffman, Maximilian; Pak, Nikita; Wetzstein, Gordon; Kato, Saul; Schrödel, Tina; Raskar, Ramesh; Zimmer, Manuel; Boyden, Edward S.; Vaziri, Alipasha High-speed, large-scale three-dimensional (3D) imaging of neuronal activity poses a major challenge in neuroscience. Here we demonstrate simultaneous functional imaging of neuronal activity at single-neuron resolution in an entire Caenorhabditis elegans and in larval zebrafish brain. Our technique captures the dynamics of spiking neurons in volumes of ~700 μm × 700 μm × 200 μm at 20 Hz. Its simplicity makes it an attractive tool for high-speed volumetric calcium imaging. Notes: Robert Prevedel*, Young‐Gyu Yoon*, Maximilian Hoffmann, Nikita Pak, Gordon Wetzstein, Saul Kato, Tina Schrödel, Ramesh Raskar, Manuel Zimmer, Edward S Boyden** & Alipasha Vaziri** (* equal contributions, ** co-corresponding authors)

Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts

2014-06-10T00:00:00Z

Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts Chen, Xianjie; Mottaghi, Roozbeh; Liu, Xiaobai; Fidler, Sanja; Urtasun, Raquel; Yuille, Alan L. Detecting objects becomes difficult when we need to deal with large shape deformation, occlusion and low resolution. We propose a novel approach to i) handle large deformations and partial occlusions in animals (as examples of highly deformable objects), ii) describe them in terms of body parts, and iii) detect them when their body parts are hard to detect (e.g., animals depicted at low resolution). We represent the holistic object and body parts separately and use a fully connected model to arrange templates for the holistic object and body parts. Our model automatically decouples the holistic object or body parts from the model when they are hard to detect. This enables us to represent a large number of holistic object and body part combinations to better deal with different “detectability” patterns caused by deformations, occlusion and/or low resolution. We apply our method to the six animal categories in the PASCAL VOC dataset and show that our method significantly improves state-of-the-art (by 4.1% AP) and provides a richer representation for objects. During training we use annotations for body parts (e.g., head, torso, etc), making use of a new dataset of fully annotated object parts for PASCAL VOC 2010, which provides a mask for each part.

The Secrets of Salient Object Segmentation

2014-06-13T00:00:00Z

The Secrets of Salient Object Segmentation Li, Yin; Hou, Xiaodi; Koch, Christof; Rehg, James M.; Yuille, Alan L. In this paper we provide an extensive evaluation of fixation prediction and salient object segmentation algorithms as well as statistics of major datasets. Our analysis identifies serious design flaws of existing salient object benchmarks, called the dataset design bias, by over emphasising the stereotypical concepts of saliency. The dataset design bias does not only create the discomforting disconnection between xations and salient object segmentation, but also misleads the algorithm designing. Based on our analysis, we propose a new high quality dataset that offers both fixation and salient object segmentation ground-truth. With fixations and salient object being presented simultaneously, we are able to bridge the gap between fixations and salient objects, and propose a novel method for salient object segmentation. Finally, we report significant benchmark progress on three existing datasets of segmenting salient objects.

Robust Estimation of 3D Human Poses from a Single Image

2014-06-10T00:00:00Z

Robust Estimation of 3D Human Poses from a Single Image Wang, Chunyu; Wang, Yizhou; Lin, Zhouchen; Yuille, Alan L.; Gao, Wen Human pose estimation is a key step to action recognition. We propose a method of estimating 3D human poses from a single image, which works in conjunction with an existing 2D pose/joint detector. 3D pose estimation is challenging because multiple 3D poses may correspond to the same 2D pose after projection due to the lack of depth information. Moreover, current 2D pose estimators are usually inaccurate which may cause errors in the 3D estimation. We address the challenges in three ways: (i) We represent a 3D pose as a linear combination of a sparse set of bases learned from 3D human skeletons. (ii) We enforce limb length constraints to eliminate anthropomorphically implausible skeletons. (iii) We estimate a 3D pose by minimizing the L1 -norm error between the projection of the 3D pose and the corresponding 2D detection. The L1-norm loss term is robust to inaccurate 2D joint estimations. We use the alternating direction method (ADM) to solve the optimization problem efficiently. Our approach outperforms the state-of-the-arts on three benchmark datasets.

Seeing is Worse than Believing: Reading People’s Minds Better than Computer-Vision Methods Recognize Actions

2015-12-10T00:00:00Z

Seeing is Worse than Believing: Reading People’s Minds Better than Computer-Vision Methods Recognize Actions Barbu, Andrei; Barrett, Daniel P.; Chen, Wei; Narayanaswamy, Siddharth; Xiong, Caiming; Corso, Jason J.; Fellbaum, Christiane D.; Hanson, Catherine; Hanson, Stephen Jose; Helie, Sebastien; Malaia, Evguenia; Pearlmutter, Barak A.; Siskind, Jeffrey Mark; Talavage, Thomas Michael; Wilbur, Ronnie B. We had human subjects perform a one-out-of-six class action recognition task from video stimuli while undergoing functional magnetic resonance imaging (fMRI). Support-vector machines (SVMs) were trained on the recovered brain scans to classify actions observed during imaging, yielding average classification accuracy of 69.73% when tested on scans from the same subject and of 34.80% when tested on scans from different subjects. An apples-to-apples comparison was performed with all publicly available software that implements state-of-the-art action recognition on the same video corpus with the same cross-validation regimen and same partitioning into training and test sets, yielding classification accuracies between 31.25% and 52.34%. This indicates that one can read people’s minds better than state-of-the-art computer-vision methods can perform action recognition.

The Compositional Nature of Event Representations in the Human Brain

2014-07-14T00:00:00Z

The Compositional Nature of Event Representations in the Human Brain Barbu, Andrei; Narayanaswamy, Siddharth; Xiong, Caiming; Corso, Jason J.; Fellbaum, Christiane D.; Hanson, Catherine; Hanson, Stephen Jose; Helie, Sebastien; Malaia, Evguenia; Pearlmutter, Barak A.; Siskind, Jeffrey Mark; Talavage, Thomas Michael; Wilbur, Ronnie B. How does the human brain represent simple compositions of constituents: actors, verbs, objects, directions, and locations? Subjects viewed videos during neuroimaging (fMRI) sessions from which sentential descriptions of those videos were identified by decoding the brain representations based only on their fMRI activation patterns. Constituents (e.g., fold and shirt) were independently decoded from a single presentation. Independent constituent classification was then compared to joint classification of aggregate concepts (e.g., fold -shirt); results were similar as measured by accuracy and correlation. The brain regions used for independent constituent classification are largely disjoint and largely cover those used for joint classification. This allows recovery of sentential descriptions of stimulus videos by composing the results of the independent constituent classifiers. Furthermore, classifiers trained on the words one set of subjects think of when watching a video can recognize sentences a different subject thinks of when watching a different video.

Concepts in a Probabilistic Language of Thought

2014-06-14T00:00:00Z

Concepts in a Probabilistic Language of Thought Goodman, Noah D.; Tenenbaum, Joshua B.; Gerstenberg, Tobias Knowledge organizes our understanding of the world, determining what we expect given what we have already seen. Our predictive representations have two key properties: they are productive, and they are graded. Productive generalization is possible because our knowledge decomposes into concepts—elements of knowledge that are combined and recombined to describe particular situations. Gradedness is the observable effect of accounting for uncertainty—our knowledge encodes degrees of belief that lead to graded probabilistic predictions. To put this a different way, concepts form a combinatorial system that enables description of many different situations; each such situation specifies a distribution over what we expect to see in the world, given what we have seen. We may think of this system as a probabilistic language of thought (PLoT) in which representations are built from language-like composition of concepts and the content of those representations is a probability distribution on world states. The purpose of this chapter is to formalize these ideas in computational terms, to illustrate key properties of the PLoT approach with a concrete example, and to draw connections with other views of conceptual structure. Note: The book chapter is reprinted courtesy of The MIT Press, from the forthcoming edited collection “The Conceptual Mind: New Directions in the Study of Concepts” edited by Eric Margolis and Stephen Laurence, print date Spring 2015.

A role for recurrent processing in object completion: neurophysiological, psychophysical and computational evidence.

2014-04-26T00:00:00Z

A role for recurrent processing in object completion: neurophysiological, psychophysical and computational evidence. Tang, Hanlin; Buia, Calin; Madsen, Joseph R.; Anderson, William S.; Kreiman, Gabriel Recognition of objects from partial information presents a significant challenge for theories of vision because it requires spatial integration and extrapolation from prior knowledge. We combined neurophysiological recordings in human cortex with psychophysical measurements and computational modeling to investigate the mechanisms involved in object completion. We recorded intracranial field potentials from 1,699 electrodes in 18 epilepsy patients to measure the timing and selectivity of responses along human visual cortex to whole and partial objects. Responses along the ventral visual stream remained selective despite showing only 9>25 of the object. However, these visually selective signals emerged ~100 ms later for partial versus whole objects. The processing delays were particularly pronounced in higher visual areas within the ventral stream, suggesting the involvement of additional recurrent processing. In separate psychophysics experiments, disrupting this recurrent computation with a backward mask at ~75ms significantly impaired recognition of partial, but not whole, objects. Additionally, computational modeling shows that the performance of a purely bottom>up architecture is impaired by heavy occlusion and that this effect can be partially rescued via the incorporation of top>down connections. These results provide spatiotemporal constraints on theories of object recognition that involve recurrent processing to recognize objects from partial information.

A normalization model of visual search predicts single trial human fixations in an object search task.

2014-04-25T00:00:00Z

A normalization model of visual search predicts single trial human fixations in an object search task. Miconi, Thomas; Groomes, Laura; Kreiman, Gabriel When searching for an object in a scene, how does the brain decide where to look next? Theories of visual search suggest the existence of a global attentional map, computed by integrating bottom-up visual information with top-down, target-specific signals. Where, when and how this integration is performed remains unclear. Here we describe a simple mechanistic model of visual search that is consistent with neurophysiological and neuroanatomical constraints, can localize target objects in complex scenes, and predicts single-trial human behavior in a search task among complex objects. This model posits that target-specific modulation is applied at every point of a retinotopic area selective for complex visual features and implements local normalization through divisive inhibition. The combination of multiplicative modulation and divisive normalization creates an attentional map in which aggregate activity at any location tracks the correlation between input and target features, with relative and controllable independence from bottom-up saliency. We first show that this model can localize objects in both composite images and natural scenes and demonstrate the importance of normalization for successful search. We next show that this model can predict human fixations on single trials, including error and target-absent trials. We argue that this simple model captures non-trivial properties of the attentional system that guides visual search in humans.

Reconstructing Native Language Typology from Foreign Language Usage

2014-04-25T00:00:00Z

Reconstructing Native Language Typology from Foreign Language Usage Berzak, Yevgeni; Reichart, Roi; Katz, Boris Linguists and psychologists have long been studying cross-linguistic transfer, the influence of native language properties on linguistic performance in a foreign language. In this work we provide empirical evidence for this process in the form of a strong correlation between language similarities derived from structural features in English as Second Language (ESL) texts and equivalent similarities obtained directly from the typological features of the native languages. We leverage this finding to recover native language typological similarity structure directly from ESL text, and perform prediction of typological features in an unsupervised fashion with respect to the target languages. Our method achieves 72.2% accuracy on the typology prediction task, a result that is highly competitive with equivalent methods that rely on typological resources.

Sensitivity to Timing and Order in Human Visual Cortex.

2014-04-25T00:00:00Z

Sensitivity to Timing and Order in Human Visual Cortex. Singer, Jedediah M.; Madsen, Joseph R.; Anderson, William S.; Kreiman, Gabriel Visual recognition takes a small fraction of a second and relies on the cascade of signals along the ventral visual stream. Given the rapid path through multiple processing steps between photoreceptors and higher visual areas, information must progress from stage to stage very quickly. This rapid progression of information suggests that fine temporal details of the neural response may be important to the how the brain encodes visual signals. We investigated how changes in the relative timing of incoming visual stimulation affect the representation of object information by recording intracranial field potentials along the human ventral visual stream while subjects recognized objects whose parts were presented with varying asynchrony. Visual responses along the ventral stream were sensitive to timing differences between parts as small as 17 ms. In particular, there was a strong dependency on the temporal order of stimulus presentation, even at short asynchronies. This sensitivity to the order of stimulus presentation provides evidence that the brain may use differences in relative timing as a means of representing information.

Seeing What You’re Told: Sentence-Guided Activity Recognition In Video

2014-05-29T00:00:00Z

Seeing What You’re Told: Sentence-Guided Activity Recognition In Video Siddharth, Narayanaswamy; Barbu, Andrei; Siskind, Jeffrey Mark We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, thereby providing a medium, not only for top-down and bottom-up integration, but also for multi-modal integration between vision and language. We show how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions) in the form of whole sentential descriptions mediated by a grammar, guides the activity-recognition process. Further, the utility and expressiveness of our framework is demonstrated by performing three separate tasks in the domain of multi-activity videos: sentence-guided focus of attention, generation of sentential descriptions of video, and query-based video search, simply by leveraging the framework in different manners.

The Invariance Hypothesis Implies Domain-Specific Regions in Visual Cortex

2015-04-26T00:00:00Z

The Invariance Hypothesis Implies Domain-Specific Regions in Visual Cortex Leibo, Joel Z; Liao, Qianli; Anselmi, Fabio; Poggio, Tomaso Is visual cortex made up of general-purpose information processing machinery, or does it consist of a collection of specialized modules? If prior knowledge, acquired from learning a set of objects is only transferable to new objects that share properties with the old, then the recognition system’s optimal organization must be one containing specialized modules for different object classes. Our analysis starts from a premise we call the invariance hypothesis: that the computational goal of the ventral stream is to compute an invariant-to-transformations and discriminative signature for recognition. The key condition enabling approximate transfer of invariance without sacrificing discriminability turns out to be that the learned and novel objects transform similarly. This implies that the optimal recognition system must contain subsystems trained only with data from similarly-transforming objects and suggests a novel interpretation of domain-specific regions like the fusiform face area (FFA). Furthermore, we can define an index of transformation-compatibility, computable from videos, that can be combined with information about the statistics of natural vision to yield predictions for which object categories ought to have domain-specific regions. The result is a unifying account linking the large literature on view-based recognition with the wealth of experimental evidence concerning domain-specific regions.

Can a biologically-plausible hierarchy e ectively replace face detection, alignment, and recognition pipelines?

2014-03-27T00:00:00Z

Can a biologically-plausible hierarchy e ectively replace face detection, alignment, and recognition pipelines? Liao, Qianli; Leibo, Joel Z; Mroueh, Youssef; Poggio, Tomaso The standard approach to unconstrained face recognition in natural photographs is via a detection, alignment, recognition pipeline. While that approach has achieved impressive results, there are several reasons to be dissatisfied with it, among them is its lack of biological plausibility. A recent theory of invariant recognition by feedforward hierarchical networks, like HMAX, other convolutional networks, or possibly the ventral stream, implies an alternative approach to unconstrained face recognition. This approach accomplishes detection and alignment implicitly by storing transformations of training images (called templates) rather than explicitly detecting and aligning faces at test time. Here we propose a particular locality-sensitive hashing based voting scheme which we call “consensus of collisions” and show that it can be used to approximate the full 3-layer hierarchy implied by the theory. The resulting end-to-end system for unconstrained face recognition operates on photographs of faces taken under natural conditions, e.g., Labeled Faces in the Wild (LFW), without aligning or cropping them, as is normally done. It achieves a drastic improvement in the state of the art on this end-to-end task, reaching the same level of performance as the best systems operating on aligned, closely cropped images (no outside training data). It also performs well on two newer datasets, similar to LFW, but more difficult: LFW-jittered (new here) and SUFR-W.

A Deep Representation for Invariance And Music Classification

2015-05-03T00:00:00Z

A Deep Representation for Invariance And Music Classification Zhang, Chiyuan; Evangelopoulos, Georgios; Voinea, Stephen; Rosasco, Lorenzo; Poggio, Tomaso Representations in the auditory cortex might be based on mechanisms similar to the visual ventral stream; modules for building invariance to transformations and multiple layers for compositionality and selectivity. In this paper we propose the use of such computational modules for extracting invariant and discriminative audio representations. Building on a theory of invariance in hierarchical architectures, we propose a novel, mid-level representation for acoustical signals, using the empirical distributions of projections on a set of templates and their transformations. Under the assumption that, by construction, this dictionary of templates is composed from similar classes, and samples the orbit of variance-inducing signal transformations (such as shift and scale), the resulting signature is theoretically guaranteed to be unique, invariant to transformations and stable to deformations. Modules of projection and pooling can then constitute layers of deep networks, for learning composite representations. We present the main theoretical and computational aspects of a framework for unsupervised learning of invariant audio representations, empirically evaluated on music genre classification.

Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning?

2014-03-12T00:00:00Z

Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning? The present phase of Machine Learning is characterized by supervised learning algorithms relying on large sets of labeled examples (n → ∞). The next phase is likely to focus on algorithms capable of learning from very few labeled examples (n → ∞), like humans seem able to do. We propose an approach to this problem and describe the underlying theory, based on the unsupervised, automatic learning of a "good" representation for supervised learning, characterized by small sample complexity (n). We consider the case of visual object recognition though the theory applies to other domains. The starting point is the conjecture, proved in specific cases, that image representations which are invariant to translations, scaling and other transformations can considerably reduce the sample complexity of learning. We prove that an invariant and unique (discriminative) signature can be computed for each image patch, I, in terms of empirical distributions of the dot-products between I and a set of templates stored during unsupervised learning. A module performing filtering and pooling, like the simple and complex cells described by Hubel and Wiesel, can compute such estimates. Hierarchical architectures consisting of this basic Hubel-Wiesel moduli inherit its properties of invariance, stability, and discriminability while capturing the compositional organization of the visual world in terms of wholes and parts. The theory extends existing deep learning convolutional architectures for image and speech recognition. It also suggests that the main computational goal of the ventral stream of visual cortex is to provide a hierarchical representation of new objects/images which is invariant to transformations, stable, and discriminative for recognition|and that this representation may be continuously learned in an unsupervised way during development and visual experience.