Received: 11 June 2021 Revised: 13 October 2021 Accepted: 12 November 2021
DOI: 10.1002/ail2.60
L E T T E R
Reframing explanation as an interactive medium: The
EQUAS (Explainable QUestion Answering System) project
William Ferguson1 | Dhruv Batra2 | Raymond Mooney3 | Devi Parikh2 |
Antonio Torralba4 | David Bau4 | David Diller1 | Josh Fasching1 |
Jaden Fiotto-Kaufman1 | Yash Goyal2 | Jeff Miller1 | Kerry Moffitt1 |
Alex Montes de Oca1 | Ramprasaath R. Selvaraju5 | Ayush Shrivastava2 |
Jialin Wu3 | Stefan Lee6
1Raytheon BBN Technologies, Cambridge,
Massachusetts, USA Abstract
2Georgia Tech, Atlanta, Georgia, USA This letter is a retrospective analysis of our team's research for the Defense
3University of Texas at Austin, Austin, Advanced Research Projects Agency Explainable Artificial Intelligence project. Our
Texas, USA initial approach was to use salience maps, English sentences, and lists of feature
4Massachusetts Institute of Technology- names to explain the behavior of deep-learning-based discriminative systems, with
CSAIL, Cambridge, Massachusetts, USA
particular focus on visual question answering systems. We found that presenting
5Salesforce.com, San Francisco,
California, USA static explanations along with answers led to limited positive effects. By exploring
6Oregon State University, Corvallis, various combinations of machine and human explanation production and con-
Oregon, USA sumption, we evolved a notion of explanation as an interactive process that takes
place usually between humans and artificial intelligence systems but sometimes
Correspondence
William Ferguson, Raytheon BBN within the software system. We realized that by interacting via explanations people
Technologies, Cambridge, MA, USA. could task and adapt machine learning (ML) agents. We added affordances for
Email: wferguson@bbn.com
editing explanations and modified the ML system to act in accordance with the
Funding information edits to produce an interpretable interface to the agent. Through this interface,
Air Force Research Laboratory, Grant/ editing an explanation can adapt a system's performance to new, modified pur-
Award Number: FA8750-18-C-0004;
Defense Advanced Research Projects poses. This deep tasking, wherein the agent knows its objective and the explana-
Agency tion for that objective, will be critical to enable higher levels of autonomy.
KEYWORD S
explainable artificial intelligence (XAI), human/computer interaction (HCI), tasking and
adapting agents, visual question answering (VQA)
1 | INTRODUCTION
In this letter, the EQUAS (Explainable QUestion Answering System) team shares insights from our three-and-a-half-year
effort in Defense Advanced Research Projects Agency's Explainable Artificial Intelligence (XAI) program. Through our
early research, we learned that simply presenting explanations in the form of static chunks of information (such as
heatmaps, feature lists, and diagrams) had a limited but mostly positive effect on human/machine interaction. But if
explanations can be augmented to support human interaction—for example, by adding editing or exploratory
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided
the original work is properly cited.
© 2021 Raytheon BBN Technologies Corp. Applied AI Letters published by John Wiley & Sons Ltd.
Applied AI Letters. 2021;2:e60. wileyonlinelibrary.com/journal/ail2 1 of 11
https://doi.org/10.1002/ail2.60
2 of 11 FERGUSON ET AL.
affordances—we found that the utility of explanation rises considerably. Indeed, our research suggests that explanation-
based interaction creates the common ground that is needed for meaningful human–machine collaboration.
The research began with visual question answering (VQA), salience maps, and named features. Early on, we found
that presenting simple explanations helped people with estimating system competence and predicting its success on
particular examples, but the latter effect was slight. Seeking stronger effects, we explored other possible roles for expla-
nation in enhancing human interaction with machine learning (ML) systems and within the internal operations of ML
systems. We looked at ML systems using explanations to teach people about the world, rather than about the ML sys-
tem (section 2). We used human's explanations of their choices to help an ML system perform better (section 3). To
explore the possible impact of global explanation, we looked at the effect of a human generated explanation on other
humans (section 4). In sections 5 and 6, we look at the value of system's using their explanatory capacity internally to
enhance or expand their functionality.
In our later work, we explored ways that interacting with explanations can enable people to task and adapt ML
agents. Our reasoning went like this: Human interaction with any system requires an interpretable interface. An inter-
pretable interface to an ML agent can be built by adding editing affordances to its statically presented explanations and
then augmenting the ML system to act in accordance with any modifications that a human interactant makes. In this
way, editing an explanation can adapt the system to some new, modified purpose (section 8).
Furthermore, since ML systems can predict their own future behavior (a kind of planning), then if this prediction
can be explained and the explanation made editable as above, then the system's “plan” and the rationale for that plan
can be changed. This can allow the agent to be effectively and robustly tasked. (Preliminary work on this technique
appears in section 7.) We posit that this “deep tasking,” wherein the agent knows both its objective and the explanation
for that objective, will be critical to enable higher levels of autonomy.
In the following sections, we will sample the journey that we took to map out explanatory competence. Along the
way, we explored every combination of explanations that are produced and consumed by the system and/or by humans
(see Table 1). Sections 2, 3, 5, and 6 report work that has been published elsewhere but which illustrate distinct combi-
nations of explanation consumption and production. Sections 4, 7, and 8 report original work, with the latter two still
in the formative stages, as we began to explore explanation as an interactive medium.
2 | CONTRASTIVE EXPLANATION FOR TEACHING
In this work, we studied discriminative explanations of the form “For input X, why did the model predict Y instead of
Z?” One way to answer this question was through counterfactual reasoning, that is, how should I change the input min-
imally such that the outcome changes from Y to Z. In the context of the task of image classification, given a “query”
TABLE 1 The subprojects described in this letter. The two columns on the left show a fairly exhaustive exploration of the options for the
consumers and producers of explanation and a progression from the initial focus of the system producing explanations for people to the
notion of the explanation as the medium that supports their interaction
Explanation Explanation
Section Summary producer consumer
2 System generates contrastive explanations to teach people to identify bird types System Human
3 People generate attention maps to teach the system what to attend to Human System
4 People (our team) generate a single, global explanation that helps people predict system Human Human
performance
5 System (implicitly) explains how a person wants it to change and then uses implicit System System
self-explanation to alter itself
6 System is trained on human explanations to generate and evaluate its own explanations Both System
to help it perform and explain better
7 System generates explanations of alternative ways to accomplish a (navigation) goal; System Both
human picks best explanation and the system executes it
8 System explains how it would identify a new aircraft type from one example; people Both Both
explain why the system would make errors and the system ties to use those
explanations to improve its sensor
FERGUSON ET AL. 3 of 11
F IGURE 1 Our approach explains why the query image (left) was classified as 1 (top row) or Eared Grebe (bottom row) rather than 4 or
Horned Grebe. Our method finds paired critical regions (red boxes) in a distractor image (middle) and the query image such that if the
highlighted region in the distractor image replaces the highlighted region in the query image, the entire, resulting composite image (right)
would be classified more confidently with the type of the distractor image. This means that the pair of small regions is central for
distinguishing the two types
image I1 for which a classification model predicts class c1, a counterfactual visual explanation identifies how I1 could
change such that the system would output a different specified class c2.
To generate these counterfactual visual explanations, we developed a technique where we first selected a “dis-
tractor” image I2 that the model predicted as class c2. Then, we identified spatial regions in I1 and I2 such that replacing
the identified region in I1 with the identified region in I2 would push the model toward classifying I1 as c2 (refer to
Figure 1). We applied our approach to multiple image classification datasets, which generated qualitative results dis-
playing the interpretability and discriminative nature of our counterfactual explanations. More details about the
approach and results can be found in Goyal et al.1
We investigated if our counterfactual explanations could help in teaching a fine-grained bird classification task to
lay people. To evaluate this, we designed a machine teaching interface (Figure 2) where we first trained the subjects for
the fine-grained task using our counterfactual explanations and then tested them on new instances. We compared our
human study with two baselines, which only differed in terms of the feedback shown to human subjects—either no
explanation or a noncounterfactual, feature-attribution explanation generated via GradCAM2 in place of our counter-
factual explanations.
The mean test accuracy with counterfactual explanations was 78.77%; with only GradCAM explanations, it was
74.29%; the mean test accuracy without any explanations was 71.09%. Therefore, our studies indicated that our counter-
factual explanations were more effective in this machine teaching task as compared to noncounterfactual feature attri-
bution explanations or no explanations at all.
3 | LEVERAGING EXPLANATIONS TO MAKE VISION AND LANGUAGE
MODELS MORE GROUNDED
Today's state-of-the-art deep models—especially for vision and language tasks—are known to rely heavily on superficial
correlations in training data.3,4 As a result, these models are often biased by language priors, and do not make predic-
tions that are sufficiently grounded in the image content.5-7 This section describes the work that effectively inverts the
4 of 11 FERGUSON ET AL.
FIGURE 2 Our machine teaching interface. During the training phase (A), if the participants choose an incorrect class, they are shown
feedback (B) highlighting the fine-grained differences between the two classes. At test time (C), they must classify the birds from memory
popular process of explaining to humans what parts of an image a model is paying attention to by generating salience
maps. We did this by collecting ground truth data from humans about which regions of images were most important
during task performance and used that data during training as a kind of explanation to guide the model's attention
within the image. This improved the model's performance via more robust visual grounding.
This work established a generic approach called Human Importance-aware Network Tuning (HINT) by extending
insights gained from earlier work on Gradient-weighted Class Activation Mapping (Grad-CAM).2 Grad-CAM uses the
gradient information flowing into the last convolutional layer of a convolutional neural network to assign importance
values to each neuron and thereby generate a salience map. HINT encourages deep networks to be sensitive to the same
input regions as humans by optimizing the alignment between human attention maps and gradient-based network
importance—ensuring that models learn not just to look at but rather rely on visual concepts that humans found rele-
vant for a task when making predictions. We applied HINT to VQA and image captioning tasks and outperformed top
approaches on splits that penalize over-reliance on language priors using human attention demonstrations for just 6%
of the training data.
We evaluated HINT by using VQA-CP8—a restructuring of VQAv2 that is designed such that the answer distribu-
tion in the training set differs significantly from that of the test set. Without proper visual grounding, models trained on
this dataset will generalize poorly to the test distribution. The HINTed model significantly improved over its base archi-
tecture alone by a several percentage point gain in overall accuracy. Furthermore, it outperformed existing approaches
based on the same base architecture (41.17 vs 46.73), setting a new state of the art for this problem, at the time of publi-
cation. We do note that our approach used additional supervision in the form of human attention maps for 6% of train-
ing images. Please refer to Selvaraju et al.9 for more details.
4 | GLOBAL EXPLANATION
Although most of the EQUAS (and XAI) work focuses on “local” explanations—regarding one specific task instance at
a time and how the model performed it—we also explored global explanation: an overall explanation of how a model
works in general. As an initial test as to whether global explanation is possible and useful, we hand-built a textual,
global explanation for a VQA system.
Using the bottom–up top–down VQA system10 and VQA v2 dataset, we established an exploratory interface to allow
easy and powerful exploration of the questions, images, ground truth, and model answers to all entries in the dataset.
This interactive dataset exploration mechanism offered many sorting and display options, which allowed insights into
the general flavor of typical questions, and also what kinds of questions proved to be particularly straightforward or par-
ticularly challenging for the system.
We then worked with the database explorer to derive the following global explanation:
FERGUSON ET AL. 5 of 11
1. If the answer is a sport, an animal, or a location then the system will almost always be right.
2. If answering requires reading letters or numbers, the system will almost always be wrong.
3. The system usually gets “how many” questions right if the answer is 2 (or sometimes 1) and is usually wrong
otherwise.
4. If these rules do not apply, use your best judgment knowing that the system is right a little more often than it is
wrong.
Then, in a Mechanical Turk experiment, subjects saw an image and a question and were asked whether they
thought the system would answer correctly. After they made their prediction, we told them whether their prediction
was correct (but we did not tell them what the system answered, in order to minimize their ability to establish knowl-
edge about system performance details on their own). For the control case, that was all. For the experimental case, we
initially showed them the global explanation and we repeated this explanation on each screen where we asked them to
predict system accuracy. We ran about 200 distinct subjects in both the test and control conditions. No subject could be
in both conditions, since we used the same image question pairs, so all results were between subjects.
Subjects' accuracy rose from 61.4% in the control condition to 63.9% in the experimental condition—a small but sta-
tistically significant improvement suggesting that global approaches may warrant additional inquiry. This work was
also our exploration of human generated explanations for human consumption. Only global explanations make sense
to construct by hand since a single, static explanation* applies to the entire system's behavior.
5 | REWRITING RULES IN A GENERATIVE MODEL
While many explanation methods focus on explaining a single prediction at a time, we can also investigate how a user
can understand and manipulate the general rules of a model. For instance, can a user directly manipulate and change
the internals of a generative model for images by understanding its structure directly, and without training the model
on any new images? The goal is not to alter just a single image, but to edit the generalized computational rules encoded
within the model, even when training data are unavailable. For example, can a person directly manipulate a trained
model's rule for the appearance of the top of a tower so that all the towers have trees growing from them? See Figure 3.
To enable such direct model editing, we developed a method11 for rewriting a single layer of a deep generative
model as a linear associative memory. Our method views the weights of a layer as a matrix storage (an optimal linear
associative memory12) that associates vector input keys with vector output values and allows insertion of a new memory
by performing an error-minimizing rank-one change in the weights of the matrix. To enable a user to perform such an
update in an interpretable way, we created a user interface (see Figure 4) that allows a user to write into a specific
memory by delineating a handful of examples that were used to infer the location and the new content for the memory
to replace within the layer.
We demonstrated our method on Progressive Generative Adversarial Network (GAN)13 and StyleGAN14 models
trained to model a variety of image datasets, and we benchmarked our method against prior methods for propagating
edits from one image to other images by user's evaluations of image realism of edited human faces. Our method pro-
vides more realistic changes than the previous leading image-propagation editing method: “neural best buddies”15—
90.2% of our edits of human face outputs are considered more realistic. Our method also reveals the structure of the
model. For example, a human face model separately memorizes rules for parts of child faces and parts of adult faces, so
that editing a single memorized rule can edit child faces without modifying the faces of adults at all.
6 | IMPROVING VQA AND ITS EXPLANATIONS BY COMPARING
COMPETING EXPLANATIONS
We have also developed a novel framework that uses explanations for competing answers to help VQA systems select
the correct answer.16 Instead of using explanations to elucidate reasons for the system's answer to a human user, this
approach uses them to allow the system itself to more deeply examine the rationales for competing potential answers,
*The idea for testing a hand generated text explanation came for Gary Klein a fellow researcher in the XAI program.
6 of 11 FERGUSON ET AL.
FIGURE 3 Our method enables a user to edit a generalized rule in a generative model, and can be used to successively: remove
watermarks, alter an existing pattern such as the density of crowds; or insert a new rule such as trees growing out of tops of towers. Since the
user is changing the model's rules, the change affects all images generated by the model in the same semantic way
and reweight them based on this additional information. We have shown that this improves both system accuracy as
well as the quality of its explanations as evaluated by humans.
Our framework is end-to-end trainable and therefore applicable to any differentiable VQA system. It learns better
representations for the questions and visual content by training to retrieve explanations as well as answers. It also
learns to rate the joint its confidence in image, question, answer, and explanation quadruples. Our experiments showed
improvements for our approach applied to up–down (UpDn)10 and Learning Cross-Modality Encoder Representations
from Transformers (LXMERT)17 for the VQA-X dataset,18 which comes with human textual explanations for the
answers.
After the base VQA system computes the top-k answers, our approach retrieves the most supportive explanations
for each answer from the training set to construct the set of competing explanations. These explanations are used to
help generate explanations for the current question. We learn to predict verification scores that indicate how well the
retrieved or generated explanations support the predictions given the input question and visual content. The final
answer is determined by jointly considering both the original answer probabilities and these verification scores. An
example of the system comparing textual explanations and using them to reweight competing answers to a VQA prob-
lem is shown in Figure 5. Details of the neural architecture and how it is trained are given in the full paper.16
With respect to question-answering accuracy, we improved the original UpDn and LXMERT by 4.5 and 1.2 percent-
age points, respectively. UpDn benefited more from using competing explanations than LXMERT, but both improve.
By using transformers, LXMERT already created better, but less flexible, representations that were harder to improve
upon by using explanations.
FERGUSON ET AL. 7 of 11
F IGURE 4 Flow of user interface that allows a user to modify a single memorized association within a deep model. The user selects a
region of an image (A) containing a new pattern they wish to insert in a new place in the model. The user selects several examples of
contexts (B, C) where they wish the new pattern to appear. Our method inserts the new key-value pair in a layer of the model in order to
change one rule (D, E)
We evaluated the generated explanations using both automatic evaluation metrics comparing them to human expla-
nations, and human evaluation employing crowdsourced judges. We compared to our previous state of the art VQA
explanation system19 as a baseline. Explanations from our new approach achieved better automatic evaluation scores,
and more importantly, higher human ratings. In particular, human judges rated our explanations as good as or better
than human explanations 55.6% of the time, whereas the baseline scored 49.5% by this metric.
7 | INSTRUCTION FOLLOWING EXPERIMENT WITH HUMAN IN
THE LOOP
In this section, we study the performance of instruction following navigation agents with human in the loop. Vision
and language navigation (VLN)20 is an instantiation of the instruction following task where an agent is placed in a
photo-realistic reconstruction of an indoor environment. The agent is given a natural language navigation instruction
and asked to follow the trajectory. In this work, we examined the following: if the models were to show a human the
different ways in which it might execute the instruction, can a human identify which way is better, more accurately
than the model's own beliefs? The visualization of how a model might execute the instruction can be thought of as a
mode of explanation.
We used instructions from the unseen validation split of the VLN dataset,20 extracted 30 trajectories using21
and scored the compatibility between each trajectory and the instruction using the VLN-Bidirectional Encoder
Representations from Transformers (VLNBERT model22). Note that this compatibility score forms the model's
own belief about which way of executing the instruction is best. For each instruction, we took the top five
ranked trajectories and asked humans to select the best trajectory (out of five) based on how closely it follows
the instruction mentioned. Using the human selected trajectory as the prediction, we evaluated the VLNBERT
+ human performance on the nDTW metric,23 which measures how closely the instruction was followed by the
trajectory.
8 of 11 FERGUSON ET AL.
FIGURE 5 An example of utilizing explanations to correct a VQA prediction. Although the original VQA confidence of the correct
answer “Yes” is lower, the explanations for “Yes” fit the image or support their answer better, resulting in a higher verification score and a
final correct decision
7.1 | Results
VLNBERT achieved 60.2% on nDTW. The upper-bound Oracle performance that we got by selecting the best trajectory
among five trajectories was 84.6% nDTW. When we evaluated the VLNBERT + human model by considering all users
individually, it got 69.0% nDTW (9.2% increase).
In this study, we observed that the performance of a state-of-the-art VLN agent can be increased by 13% on nDTW
when paired with a human to select the best trajectory based on a visualization of the different ways in which the agent
can execute the instructions.
8 | ONE SHOT IMAGE DETECTION AND COLLABORATION
One area where “explanation as a medium” can be useful is in the domain of one-shot image detection. In this
constrained setting, there is only one instance of a previously unseen target class (eg, a new type of airplane) and
the user and the system, working together, adapt the system to recognize this class in future data. We evaluated
these new detectors by using them to find more instances of the target class in a held back labeled corpus of data.
(Such a corpus is, by definition, not available when the detector is defined—as it would be in a classic, image
retrieval problem.) User explanation is especially useful here as there are no more representatives of the target
class (and therefore no false negatives to identify for the system) but the human user must still impart domain
knowledge to the system.
FERGUSON ET AL. 9 of 11
F IGURE 6 One shot image detection interface. Left column presents the images from the database determined most like the aspect
annotated from the images on the right. The yellow lasso in the left column indicates the region of activation for the aspect
We explored two modes of providing a one shot detection system with additional information from the user. The
first mode augmented a linear classifier trained on the one shot image and allowed the user to pick from five features
chosen by a heuristic. We wanted to see if the user could select a feature such that if that features weight in the linear
classifier were reassigned to be the most negative feature weight, the user would improve the classifier's F1 score for
the new target class. We found it was possible (Turkers were correct 38% of the time vs random selection at 20%), but
the user's impact on the F1 score was minimal due to the amount of “damage” done to the classifier by Turker errors in
other trials. The overall average effect was 0.006 change in F1.
The second mode we explored created a one shot detector from scratch by allowing the user to paint “aspects” on
their one positive example. An aspect for the target class is defined as some distinguishing feature of that class such as a
distinctive component (eg, a wing tip). Refinement of these aspects was assisted by the aid of a user interface (see
Figure 6) where the user was presented with a list of images ranked according to how they were like positively anno-
tated image regions and different from negatively annotated image regions.
The user was able to refine their search by incorporating images from the returned query set, marking either positive or
negative regions. These aspects were then used to build a two-layer network classifier. Work in this direction is still ongoing
with further emphasis on how best to define negative features for annotation to better separate the decision space.
We learned several lessons by exploring “explanation as a medium” for human/ML system collaboration. “Expert”
users who have some familiarity with the underlying system appear to do better than uninitiated users (Turkers). Giv-
ing users more freedom allows for collaboration in unexpected ways. They seem to develop their own imperfect mental
model of how the system works and try to use that knowledge for improvement gains.
One unintended behavior in the second mode was that users found improvement in the overall classifier by only
refining an aspect with negative images from the query set. By employing user studies, it might be possible for human/
ML system designers to uncover more robust ways of having the pair work together.
9 | DISCUSSION
Our work over the course of the XAI program has led us to posit that there are three functional or ontological charac-
terizations of explanation that provide a progression of sophistication and potential utility (Table 2).
10 of 11 FERGUSON ET AL.
TABLE 2 This table shows the progression of explanations in complexity and power that we discover over the course of the program.
The columns show how aspects of the explanations change as they as we evolve their characterization from messages (level 1) to
communicative acts (level two) to an actively maintained, shared context (level 3)
Level Purpose Challenge Value Focus
1 Reveal or display the Translate system internal representation Enable system debugging; enable The explainer
workings of the and mechanism into a human trustworthiness assessment; (the AI system)
decision mechanism meaningful, compact, unified form enable mental modeling
2 Allow users of the Adapt explanations to anticipate the Build appropriate trust; enable The “explainee”
system to make information that the user needs behavior prediction; provide (typically, the
better choices user satisfaction human user)
3 Allow system and user System would need some theory-of-mind Enable mutual reliance, The interaction
to collaborate modeling to track common ground teamwork, co-teaching;
maintains common grounda
aCommon ground maintenance may be the main reason that appropriate explanations engender trust in the first place. It is not simply that they assure the
user that the system is working for sensible reasons. Providing appropriate explanation allows users to begin to believe that the system is “taking responsibility”
to be understood. This is a profound, implicit commitment that all human interlocutors make to each other, and which is necessary to facilitate
communication.
Performers in the XAI program began focused on level one. Many moved on to level two in order to improve
human/machine task performance. In the last year, some (including us) began to explore level three by considering
how the user could adapt or correct the system's performance by changing its explanations or by adding rules. This deep
tasking, wherein the agent knows its objective and the explanation for that objective will be critical to enable higher
levels of autonomy.
Our work and the work of others in the XAI program has shown that people can learn something about an ML sys-
tem's competence from statically presented explanations and make use of that knowledge. For example, as shown in
the work presented above, salience explanations help people predict competence and contrastive explanations help peo-
ple learn discriminative tasks. Our later work suggests that in the future, explanation as a medium for interaction shows
promise for enabling human adaptation and correction of ML agents.
ACKNOWLEDGEMENTS
The information provided in this document is derived from an effort sponsored by the Defense Advanced Research Pro-
jects Agency (DARPA) and the Air Force Research Laboratory (AFRL) and awarded to Raytheon BBN Technologies
under Contract Number FA8750-18-C-0004.
DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available in the repositories referenced in the paper.
ORCID
William Ferguson https://orcid.org/0000-0001-7491-5041
Jialin Wu https://orcid.org/0000-0003-4684-5212
REFERENCES
1. Goyal Y, Wu Z, Ernst J, Batra D, Parikh D & Lee S Counterfactual visual explanations. ICML; 2019.
2. Selvaraju R, R Das, Vedantam R, Cogswell M, Parikh D, Batra D. Grad-CAM: why did you say that? Visual explanations from deep net-
works via gradient-based localization. ICCV; 2017.
3. Agrawal A, Batra D, Parikh D. Analyzing the behavior of visual question answering models. EMNLP; 2016
4. Rohrbach A, Hendricks LA, Burns K, Darrell T, Saenko K. Object hallucination in image captioning. EMNLP; 2018.
5. Goyal Y, Mohapatra A, Parikh D, Batra D. Towards transparent AI systems: interpreting visual question answering models. Paper pres-
ented at: International Conference on Machine Learning (ICML) Workshop on Visualization for Deep Learning; 2016.
6. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: visual explanations from deep networks via gradient-
based localization. ICCV; 2017.
7. Hendricks LA, Burns K, Saenko K, Darrell T, Rohrbach A. Women also snowboard: overcoming bias in captioning models. ECCV; 2018.
FERGUSON ET AL. 11 of 11
8. Agrawal A, Batra D, Parikh D, Kembhavi A. Don't just assume; look and answer: overcoming priors for visual question answering. Paper
presented at: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2018: 4971-4980.
9. Selvaraju R, Lee S, Shen Y, et al. Taking a HINT: leveraging explanations to make vision and language models more grounded. Paper
presented at: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2019: 2591-2600.
10. Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. Paper pres-
ented at: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2018: 6077-6086.
11. Bau D, Liu S, Wang T, Zhu J, Torralba A. Rewriting a deep generative model. European Conference on Computer Vision. Cham: Springer;
2020:351-369.
12. Kohonen T. Correlation matrix memories. IEEE Trans Comput. 1972;100(4):353-359.
13. Karras T, Aila T, Laine S, Lehtinen J. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:
1710.10196; 2017.
14. Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T. Analyzing and improving the image quality of stylegan. Paper presented at:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020: 8110-8119.
15. Aberman K, Liao J, Shi M, Lischinski D, Chen B, Cohen-Or D. Neural best-buddies: sparse cross-domain correspondence. ACM Trans
Graph (TOG). 2018;37(4):1-14.
16. Wu J, Mooney RJ. Improving VQA and its explanations by comparing competing explanations. Paper presented at: Proceedings of the
AAAI Workshop on Explainable Agency in AI; 2021.
17. Tan H, Bansal M. LXMERT: learning cross-modality encoder representations from transformers. EMNLP; 2019.
18. Park DH, Hendricks LA, Akata Z, et al. Multi-modal explanations: justifying decisions and pointing to the evidence. CVPR; 2018.
19. Wu J, Mooney RJ. Faithful multimodal explanation for visual question answering. ACL BlackboxNLP Workshop; 2019.
20. Anderson P, Wu Q, Teney D, et al. Vision-and-language navigation: interpreting visually-grounded navigation instructions in real envi-
ronments. Paper presented at: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018.
21. Fried D, Hu R, Cirik V, et al. Speaker-follower models for vision-and-language navigation. arXiv preprint arXiv:1806.02724; 2018.
22. Majumdar A, Shrivastava A, Lee S, et al. Improving vision-and-language navigation with image-text pairs from the web. Paper presented
at: European Conference on Computer Vision. Springer, Cham; 2020.
23. Ilharco G, Jain V, Ku A, Ie E, Baldridge J. General evaluation for instruction conditioned navigation using dynamic time warping. arXiv
preprint arXiv:1907.05446; 2019.
How to cite this article: Ferguson W, Batra D, Mooney R, et al. Reframing explanation as an interactive
medium: The EQUAS (Explainable QUestion Answering System) project. Applied AI Letters. 2021;2(4):e60.
doi:10.1002/ail2.60