Speech, Signal, Symptom:
Machine Listening and the Remaking of Psychiatric Assessment
by
Beth Michelle Semel
M.A., Anthropology
Brandeis University, 2013
B.A., Writing, Literature, and Publishing
Emerson College, 2010
Submitted to the Program in Science, Technology, and Society
In Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy in History, Anthropology, and Science, Technology and Society
at the
Massachusetts Institute of Technology
September 2019
© 2019 Beth Semel. All Rights Reserved.
The author hereby grants to MIT permission to reproduce and distribute publicly paper and
electronic copies of this thesis document in whole or in part in any medium now known or
hereafter created.
,,Signaturer edacted
Signature of Author:
Ijifory, Athropolbgy, and Science, Technology and Society
August 22, 2019
Signature redacted_____
Certified by:
Graham M. Jones
Associate Professor of Anthropology
Thesis Supervisor
Signature redacted
Certified by:
MASSACHUETTSINS TrUTE1 Stefan Helmreich
Y Elting E. Morison Professor of Anthropology
9i Thesis Committee MemberOCT 03201
LIBRARIE -M
Signature redacted
Certified by:
Amy Moran-Thomas
Alfred Henry and Jean Morrison Hayes Career Development Assistant Professor of
Anthropology
Thesis Committee Member
Signature redacted
Certified by:
Heather Paxson
William R. Kenan, Jr. Professor of Anthropology
Thesis Committee Member
Accepted by: Signatureredacted
Tanalis Padilla
Associate Professor, History
Director of Graduate Studies, History, Anthropology, and STS
Accepted by: Signatureredacted
Jennifer S. Light
Professor of Science, Technology, and Society
Professor of Urban Studies and Planning
Department Head, Program in Science, Technology and Society
3
Speech, Signal, Symptom:
Machine Listening and the Remaking of Psychiatric Assessment
by
Beth Michelle Semel
Submitted to the Program in Science, Technology, and Society on August 31, 2019 in Partial
Fulfillment of the Requirements for the Degree of Doctor of Philosophy in History,
Anthropology, and Science, Technology and Society
ABSTRACT
This multi-sited, ethnographic dissertation follows teams of psychiatric and engineering
professionals collaborating to tackle one of Western psychiatry's longest standing issues: the
subjective nature of mental illness. Situated at three different U.S.-based universities, the teams
are driven by a conviction that conventional methods of psychiatric screening are fallible if not
altogether inaccurate, since they depend upon a mental health care worker's ability to interpret
the semantic content of a patient's speech. Through research studies involving human subjects,
the teams hope to develop more biologically based and resource-efficient screening techniques
that instead analyze paralinguistic, acoustic components of speech-such as pitch, speaking rate,
and breathiness-which they argue are more directly linked to the internal mechanisms that drive
mental illness. By turning to the expertise of computer scientists and engineers, they seek to
build "machine listening" prototypes for psychiatric assessment: technologies that use a
microphone to capture sound and artificial intelligence (AI) to analyze sound.
While their studies are premised on the notion that AI can listen beyond the human by
attending to sounds of speech that have psychopathological significance supposedly set aside
from linguistic meaning and human difference, in order to gather and classify the data necessary
for building their technologies, researchers must rely on the very components of language that
they seek to overcome: its interactional, sociocultural dimensions. I show how the connections
between spoken utterances and inner states that researchers design their systems to make
"autonomously" depend upon a tightly managed but oftentimes hidden infrastructure of human
labor, including the labor of research subjects. The division of labor within the teams replicates
hierarchies of value within mental health care professions, which place diagnosis and treatment
at the top as expert, biomedically and legally ratified forms of judgment, and place the data entry
and triage work of assessment at the bottom, as skilless, para-professional, and mechanized
tasks.
In describing the vexed status and ethics of listening, language, labor, and care in
contemporary U.S. mental health care, the dissertation tells a larger story about the stakes of
framing mental illness as a scientific, bureaucratic problem calling for a technological
intervention.
Thesis supervisor: Graham M. Jones
Title: Associate Professor of Anthropology, Margaret MacVicar Faculty Fellow
4
Table of Contents
A cknow ledgm ents........................................................................................ . 5
Introduction ............................................................................................... . 7
C hapter 1 ............................................................................................... . . 7 1
ComputationalP sychiatry's Coded Past
C hapter 2 ............................................................................................. . . 124
Talking Heads: Brains, Bodies, and Vocal Biomarkers
C h apt er 3 ..................................................................................................... 18 8
Do Androids Dream ofElectric Speech?
C h apter 4 ..................................................................................................... 2 53
Listening Like a Computer
C onclusion ............................................................................................ . 324
An Ironic Dream ofa Common Language
5
Acknowledgments
Dissertations are truly collaborative documents, and many people have labored with and
alongside me to bring this particular document into the world. First, I acknowledge the funding
that made this project possible. Then, I acknowledge the friendship.
Research for this dissertation was supported by a Society for Psychological
Society/Robert Lemelson Foundation Fellowship in 2015; a Dissertation Fieldwork Grant from
the Wenner-Gren Foundation in 2016; and a Doctoral Dissertation Research Improvement Grant
from the Cultural Anthropology Division of the National Science Foundation in 2016. Special
thanks to Jeffery Mantz for his help and encouragement throughout my fieldwork years. Writing
for this dissertation was supported by a Weatherhead Fellowship from the School for Advanced
Research in Santa Fe, New Mexico, where I was in residence from 2018 to 2019.
Immeasurable thanks are due to my ethnographic interlocutors, whose trust, friendship,
conversations, and insight form the basis of this dissertation. I would not have been able to
critically read their technologies-and the nature of our collaboration-without their guidance
and teachings. To my game-changers and confidants: thank you for giving me the honor of
uttering your own critiques for you. Thank you for helping me feel at home while also
challenging me to question my surroundings.
During my undergraduate years, Roy Kamada and Murray Schwartz went out of their
way to mentor me and offer up their precious time to review my (at times overly) ambitious
writing. Roy taught me how to write an abstract, and how to love reading in a new way. Murray
introduced me to psychoanalytic theory. It was in meetings with both of them after graduating
college that I dreamt up the idea to pursue anthropology.
I completed an MA in Anthropology at Brandeis University, and several faculty members
deserve thanks for training me and building me up into the scholar and thinker that I am today,
particularly Elizabeth Ferry, Anita Hannig, Janet McIntosh, and Richard Parmentier. The
friendship of Katherine Morely Eramo and Olivia Spaletta likewise played an important role in
my time at Brandeis. Katherine's support and encouragement sustains me still.
To use my friend Danielle's phrase, thank you to everyone who has continually sent me
postcards from the outside world in the years leading up to and during my PhD at MIT: helping
me to get outside of my own head and enjoy the world around me, including and especially their
company. I thank my cohort-mates, Richard Fadok, Clare Kim, Lauren Kapsalakis, Alison
Laurence, and Peter Oviatt, for establishing an atmosphere of collaboration, candor, and
compassion from the start. I feel grateful to have made this journey alongside you all. Beyond
my cohort, at HASTS, I thank Marc Aidinoff, Rende Marie Blackburn, Ashawari Chaudhuri,
Grace Kim, Steve Gonzalez, Shreeharsh Kelkar, Crystal Lee, Jia Hui Lee, Lucas Mueller, Canay
Ozden-Schilling, Tom Ozden-Schiling, Luisa Reis Castro, Elena Sobrino, Mitali Thakor, and
Claire Webb. Elena and Grace's names are worth repeating-both came to my aid during
medical emergencies. Outside of MIT, I have made several colleagues and friends along the way
whose thinking and influence is palpable in these pages. I thank especially Marisa Brandt,
Danielle Judith Carr, Anar Parikh, Nick Seaver, and Luke Stark. In New Mexico, my fellow
fellows and the interns at SAR-scholars, activists, artists, mentors, and friends-made me
laugh, made me feel loved, and helped me to find the spirit to keep writing. I thank all of them
for sharing their ghost stories, jokes, recipes, scholarship, organizing work, and lives with me
during our 9 months of residency: John Arroyo, Monika Banach, Gio B'atz', Ixq'anil Banach-
6
B'atz', William Calvo, Nick Estes, Mayanthi Fernando, Felica Garcia, Frida Garcia, Terran Last
Gun, Samantha Tracy, Melanie Yazzie, and Wilma Yazzie-Estes.
Several friends and loved ones have been by my side since long before I began studying
anthropology. I thank them for helping me to face the world with openness and excitement, and
to imagine other, possible futures for myself and for so many others. These people include Meryl
Bennett, Kayla, Hillary, and Laurie Fortin, Hannah Nyren, Claudia Kretschmer, Melissa Siebert,
Alexandra Tate, Chelsea Thomas, and Tau Zaman. Alexander Kranzusch produced the
illustrations used to describe the technologies discussed in Chapter 3.
The mentorship of Graham Jones at MIT has fueled me throughout the six years of my
PhD. Graham's intellectual creativity, his kindness, his encouragement, and our marathon
meetings and phone calls have kept me afloat and have shaped this project in invaluable ways.
Likewise, I thank my committee for their scholarship, their input, and for their generosity and
openness: Stefan Helmreich, Heather Paxson, and Amy Moran-Thomas. Several other faculty
members at MIT played a significant role throughout the PhD, in terms of finessing my project
and in producing scholarship that informs my own, including Dwai Banerjee, Erica James, and
Robin Wolfe Scheffler. The development of this project has also benefited immensely from
generous, generative conversations (in person or otherwise) with several scholars whose work
has also inspired this dissertation, including Felicity Aulino, Nick Harkness, Matthew Hull,
Alaina Lemon, Michael Lempert, Natasha Schall, and Jason Throop. I had the honor of briefly
meeting Chuck Goodwin, whose 1994 article, "Professional Vision," changed the course of my
thinking when I read it as an MA student. We spent the day talking and walking together as if we
had known each other for years. His immediate faith in and excitement over my project meant
the world to me. His passing during the writing of this dissertation impacted me deeply.
Much of my dissertation is aimed at honoring work that is vital to the production of
scientific knowledge but often goes unnoticed: namely, the work of administrative laborers. I
would be remiss, then, to miss the opportunity to thank the various administrative workers who
have held things together for me. Thank you to Karen Gardener, an advocate and a friend, for the
many big and smalls ways in which you helped me complete the PhD. Many thanks as well to
Carolyn Carson in STS, and to Irene Hartford, Barbara Keller, and Amberly Steward in
Anthropology.
I thank my family for their unending patience and their unending care, in both its real and
para-forms. This includes my cat, Millie Semel, for keeping me company during late nights and
early mornings of writing and reading. I thank my twin sister, Sarah, for giving me a first-hand
experience in theorizing resemblance, and her partner, Sam Levine. I thank my older sister,
Hillary, for her artful eye and caring heart, and for pushing me to be the best teacher I can be. To
my parents, Donna and Scott Semel, I truly owe everything. My mother, a former speech
therapist, taught me how to listen. My father, a lawyer, taught me how to make a good argument
and how to love sci-fi. With love and gratitude, I also thank my aunt, Lisa Semel, and her
husband, Jonathan Guthart, along with my cousins: Scott, Mercedes, Amanda, and Brandon
Holtzman, and Eileen, Jake, and Sara Wasserman. My grandmother, Joan Semel, passed away
before I could complete this dissertation. She often bragged that I was going to become the first
doctor in the Semel family. I hope to continue to make her proud.
Last but certainly not the least, it is difficult to find the words to adequately thank my
partner, Ryo Morimoto, for nurturing my ideas and nurturing me, for cheering me on when I
needed it the most and when I didn't know I needed it at all. You are my biggest inspiration, my
favorite thinker, and my favorite person.
7
INTRODUCTION
"'Yes,' said Steamer [...]'we have great plans to use information theory to augment psychiatry.
I'm sure you know that the tone of peoples' voices tells a listener a great deal about their
emotional state. We have recorded some speech from a psychoanalytic interview and by infinite
clipping have been able to remove all the emotional content. By processing what we remove, we
expect to be able to identify those characteristics that carry the emotional information.' [...]
'What are you going to do now?' asked George.
'We're going to more sophisticated processing, but we're still looking for ideas,' said Steamer.
'Do you think I could get thesis out of this work?'
'Of course! This stuff really strikes people's imagination. We have working arrangements with
several psychiatrists in town, and you could help them in unraveling this business.'
'I would think,' said George, 'that one should know something about the nature of speech before
taking on such a project.'
'Maybe so,' said Steamer, 'but remember you're an engineer, not a phonetician or a linguist.
You could attack the problem from an engineering viewpoint."'
- (David, E.E. Jr. 1962. "Bionics or Electrology? An Introduction to the Sensory
Information Processing Issue." Pp. 74)
It is early in September of 2015, and I am sitting in a chromatic colored conference hall on the
top floor of MIT's Media Lab among rows and rows of folding chairs. One of the room's giant
windows offers a view of the Charles River, and across the water, the tops of skyscrapers glint in
the morning sun as conference attendees file into the room. Students in flip-flops and cargo
shorts share elbow space with technology company executives and start-up employees in
business suits and blazers, mostly from the Boston area but some from as far as South Korea.
We're gathering in this grey and black room, after having picked up our nametags and a bag of
promotional gifts and pamphlets, to listen to the same thing: the opening plenary of the first-ever
Emotion and Artificial Intelligence (AI) Summit, sponsored by a company called Affectiva.
Affectiva was born out of collaboration between a former MIT Media Lab student and
Rosalind Picard, a computer scientist responsible for establishing the field of "affective
computing" who runs a research group of the same name. Affective computing is dedicated to
building computers and algorithmic systems that can interpret and respond to displays of human
8
emotion (Picard 1995; 1997; 2003). Many consider Professor Picard a pioneer for insisting that
emotion is not opposed to reason and instead plays a key role in the "intelligence" that computer
scientists seek to replicate in technologies meant to aid and assist in human activities. Affectiva
wraps principles of affective computing into the development of software packages that offer, as
their company website states, "insight into unfiltered consumer responses to ads, videos, and TV
programming," and, most recently, responses to the user interfaces in autonomous vehicles. The
cover of Affectiva's promotional pamphlet shows a photograph of a Black woman smiling,
framed by a yellow square to indicate that Affectiva is analyzing her resplendent face and
capturing proof that whatever product she is viewing brings her great joy. Affectiva specializes
in automated image recognition: their software packages rely on a camera to pick up small
movements in people's facial musculature as they watch a commercial, interact with a product,
or sit at the wheel of a self-driving car. An algorithm-"a sequence of computational steps that
transforms the input into an output" (Cormen et al 2009: 5)-calculates the statistical
relationship between the movements of the user's facial face and entries in a database of facial
expressions that a human has labeled with an emotion, drawn from a set list of possible emotions
that another person has assembled. By Affectiva's definition, this means that their products can
autonomously "recognize" human emotions.
According to Affectiva, the goal of the Summit is to explore "how Emotion Al can move
us to deeper connections with technology, with business and with the people we care about."
The summit's mix of commercial, corporate, and academic audience members is a familiar one
for me. I've just returned from my fieldwork with groups of psychiatric and engineering
professionals collaborating to build voice analysis technologies for psychiatric screening, and I
attended similar workshops and conferences during my twelve months of sustained participant-
9
observation. It wasn't until the opening plenary began that I realized just how close the Summit,
and Affectiva, would come to my fieldwork. The company's founder takes to the stage to
announce the reveal of a project that has been years in the making, one that they would be
demonstrating live for the first time ever, live: voice analysis. Affectiva hopes to use voice
analysis to strengthen their existing prototypes. At the podium, the head of the company explains
that the voice adds another layer of emotional data, rich with information about a consumer's
response to the world.
I have encountered a variety of other voice analysis and detection systems throughout my
fieldwork. The designers, funders, makers and stewards of these systems build them to recognize
vocally expressed interior states based on how a person sounds rather than the content of what
they say. For the demonstration, the head of the company calls her colleague to the podium, and
he joins her to narrate a story about attempting to cook a turkey for Christmas dinner, their faces
projected on a multitude of screens hung throughout the room so that all in the audience can
witness the technology at work. Their faces are also framed by tiny square outlines like the
woman in the promotional material, but instead of yellow, the squares are pink and blue: blue for
the male speaker, and pink for the head of the company, a woman, apparently indicating that the
software can also detect gender. The story is banal, lighthearted, with a twist: the turkey explodes
in the oven, startling everyone in the house, especially the cook. As he speaks and the head of the
company listens, small script letters appear on the screen next to their faces, emotion words that
more or less coincide with the tragi-comedy arc of the turkey tale: happiness, humor, surprise, a
flash of humiliation, happiness. These adjectives and their immediate appearance as the story
progresses are meant to indicate the prototype at work. Together, the demo gives the impression
10
of immediacy, that their states are known and displayed in real time, almost as if they are being
directly translated from words tofeelings, as if insides have been turned out.
It is a convincing and persuasive demonstration. But like many of the other demos I have
witnessed, it comes without a discussion or explanation of how it all works. What the demo
shows-that Affectiva's new prototype could recognize how the man sounded, tracing the
emotional contours of his voice as he told the story-is the punch line, the self-evident point of
the drama. As Lucy Suchman notes, the demo is a distinct genre of performance in the human-
computer interaction (HCI) sector. "Like other conventional documentary productions," she
writes, "these representations are framed and narrated and instruct the viewer in what to see" or
to hear (2007: 237-238). The demo is one of many rhetorical devices that threads together the
analogy between the computational process underlying automated technologies and human
behavior, supporting the human-likeness of the technology while strategically leaving out the
humans whose judgment, sensing, and choices enabled its functionality. In the process of
building human-like technologies-like software that can detect the emotional texture of a
person's voice, only better, faster, and more accurately than a human ever could-technologists
and their collaborators must articulate and concretize their ideas about what it means to be
human.
From my vantage point in the audience come a series of questions that I have only
learned to ask from the people I have worked among an studied-my ethnographic
interlocutors-after having observed and assisted them with building voice analysis technologies
for psychiatric screening in the context of academic studies, and after having stood in an
exhibition hall alongside them to give strategic performances of our own, showcasing all the
prowess-and none of the pitfalls-of their technological prototypes. What dataset did they use
S1I
to train the algorithm-to determine what counts as a happy sounding voice, a surprised
sounding voice, a humiliated sounding voice? Did they build their own corpus, asking paid
volunteers to emote vocally, and then have another paid volunteer listen to and label excerpts of
this emotive speech? Or did they use one of the many pre-existing "emotional" speech corpuses
that have already been labeled, and that usually consist of an actor performing an emotion? Did
their dataset only consist of speakers of American English-could the software work just as
smoothly and convincingly with a person who did not speak English as their first language?
What exactly about the speakers' bodies and voices put them in the categories of "male" or
"female"? Who made the call as to what counts as "male" or "female" facial features or vocal
qualities? How might the system respond to speakers who do not live inside neat, bounded,
binary boxes of gender, like trans or gender non-conforming people, or anyone else who stands
outside of what MIT critical computer scientist Joy Buolamwini (2016) might call Affectiva's
"coded gaze," beyond the "embedded views [and voices] that are propagated by those who have
the power to code systems"?
By the afternoon, my questions are still unanswered and I'm beginning to grow sleepy. I
try to keep myself alert with a complimentary bar of chocolate laced with espresso beans,
wrapped in sky blue paper featuring a drowsy-faced emoticon that declares AWAKE in block
letters. I want to keep my eyes-and ears-open for a panel I've been anticipating, entitled "The
Future of Al: Ethics, Morality, and the Work Force." For this panel, the moderator presents
panelists with a series of ethical conundrums, asking them to explore how these speculative
fictions relate to problems that Al and computational technologies currently present. The first
ethical scenario takes the form of a trolley problem, a classic thought experiment in ethics that
presents the audience with a choice over whose lives to sacrifice to the path of a runaway trolley
12
throttling down one of two possible paths. In this story, an autonomous vehicle with a driver
asleep behind the wheel and careening toward a girl chasing a ball takes the place of the trolley.
It is only when the moderator begins reading the second scenario that I snap to attention, sitting
straight up in my folding chair for a story that is both familiar yet strange:
The year is 2024. The country's last 50 remaining truck drivers converge on Washington
to protest the loss of theirjobs to robots. They block Connecticut Avenue and they drive
their rigs onto the mall, where they arejoinedb y a small army of unemployed
accountants, nurses, adjunctp rofessors, Wall Street analysts andjournalists,a ll of whom
have been put out of work by Al. The Washington PoliceD epartment considers sending
robo-cops or high-pressure hoses to disperse the protesters, but in an uncharacteristic
act of compassion, instead send in a phalanx of mental health clinicians, especially
trained to be sympathetic to people in distress. The clinicians, of course, are robots. Al's
supposed to make as much as 38% of the U.S. workforce obsolete within the next coming
decades. Is this outcome inevitable, and if not, how do we prevent it?
On the surface, the conundrum involves the impending threat of job loss due to advancements in
Al that enable machines to perform tasks that humans used to, like gauging a patient's blood
pressure, turning on a high-pressure hose, or delivering the nightly news.
The panelists' answers both lean into and contest that fear, offering perspectives that
jump across the spectrum of perilous evil and potential good. One panelist argues that human
obsolesce is the inevitable conclusion of a society under capitalism. When the bottom line is the
expansion of production in pursuit of economic growth, he quips, it is only a matter of time
before employers dispose of human labor in favor of more cost efficient automated labor that
does not require bathroom breaks, cannot become pregnant or injured, and will never ask for
higher wages. On the more optimistic side, another panelist insists that emotional Al can help
democratize access to psychiatric care. Where she grew up in the Middle East, there are more
patients than care providers, and even the most sympathetic and hardworking of nurses become
burnt-out, hardened, and jaded. What if we had nurses or clinicians, she wonders, who oversaw
10 or 20 or 100 mental health robots or avatars or virtual assistants? They could use an interface
13
to control these human-like technologies from afar to conduct triage work, helping the human
doctors manage their caseload by determining which patients are in direst need of care. The
human nurse, she muses, would only get tapped if the system says it's a big deal-otherwise, the
avatar takes care of the patient.
The ironic denouement of the story is that the clinicians are not human, suggesting a
future in which even therapy-the provision of sympathetic, psychiatric care-can be mimicked
and performed by a human-like machine. The slight gasp from the audience and the subtle smile
of the panelists indicate the story's success. The plot twist plays with the figuration of the
machine as a foil for the human and operates through a time worn Cartesian binary: humans
possess the spark of spirit, of the psyche, and therefore, the capacity for intersubjectivity,
whereas machines are inert matter, unaware and lacking consciousness. How might it be possible
for a robot to quell a political uprising, sooth the angry masses, or mend the wounded psyches of
people whose professions are no longer valued?
Nevertheless, what stuck out to me about this story, and what set me sitting rigid in my
folding chair, was not the robots, but the absence of a particular set of humans from the
fabulated, dystopian protest. Nurses-administrative healthcare workers-careen down
Connecticut Ave., alongside truck drivers, journalists, and adjuncts, a brigade of professionals
that the McKinsey Global Institute projected in 2017 to be the most at-risk for eventual job
replacement due to automation in the United States (Manyika et al 2017). But if the clinicians
administering sympathetic care are, "of course," robots, then where are the human clinicians? If
the robots have been deployed to act like clinicians, then why are the clinicians not protesting?
14
As I will show in this dissertation, the answer to the question relates to the demonstration that
came before it: automated speech analysis, especially speech analysis in the context of
psychiatric encounters between patients and care providers.
These two moments at the Summit connect Affectiva, and the broader contemporary
milieu in the U.S. that is replete with machine listening technologies, together with my
informants, and with the technological prototypes, sociotechnical imaginaries (Jasanoff and Kim
2015), and ideas about speech, mind, and self that my interlocutors pursue at one turn and
contest at another. The two moments represent the major topics that I treat in this dissertation:
language, listening, labor and care. My interlocutors' research projects are sites through which
these topics intersect. My aim is to sketch out a theoretical framework for thinking through this
critical nexus.
"DO YOU THINK I COULD GET A THESIS OUT OF THIS WORK?"
I begin with Affectiva both because its approach is so different from those pursued by the groups
with whom I conducted fieldwork, but also because of the striking similarities they exhibit.
These similarities and differences help to illustrate the scope of my fieldwork, the stakes of the
research projects it focuses on, and the larger, theoretical themes that studying these projects
ethnographically has led me toward.
Conducted between 2015 and 2017, my fieldwork followed three interlinked,
interdisciplinary research teams of psychiatric and engineering professionals at three different
U.S.-based universities collaborating to develop automated listening technologies for psychiatric
assessment, sometimes referred to as psychiatric screening. This included a twelve-month span
of consecutive fieldwork, during which I spent four months at each of the three universities,
15
working as a research assistant on the teams, conducting interviews, participating in and
observing activities that spanned the research and development pipeline while also attending
weekly group meetings, courses, conferences, workshop, and symposia with my informants. I
played a hands-on role in, among other things, developing experimental stimuli used in the study
(even co-writing and acting in a film), creating training manuals and leading training sessions to
incoming lab members, revising grant proposals and article drafts.
Spread across the United States, the teams at East Coast University (ECU), West Coast
University (WCU) and Midwestern University (MWU)' are all part of the same network. They
have shared academic pedigrees and individual team members know each other. Some have
trained together under the same supervisors, who are also on the teams, and they run into each
other at conferences if they are not already speaking on the same panel together. All three teams
are working on technologies like cell phone applications and software packages that can be
installed in variety of user interfaces, including humanoid robots. While each team focuses on a
different diagnostic category-depression (ECU), post-traumatic stress disorder (WCU), and
bipolar disorder (MWU)-they share an overarching goal. The teams design their devices in
order to connect the sounds of speech with what they take to be people's inner, psychological
states, states they hope to access by attending only to the acoustic qualities of speech, rather than
its linguistic or semantic meaning. They strive for their technologies to only analyze what are
' I use pseudonyms throughout the dissertation to refer to institutions, people, and technologies. The use of
pseudonyms is a common practice in anthropology, employed to respect and protect the privacy and anonymity of
their research subjects. This is especially important given the fact that many of my interlocutors expressed critical
opinions and attitudes toward their research projects, even as they worked within them. Anthropologists studying
science and technology have taken up the convention of anonymizing the names of graduate students as opposed to
PIs, given the precarious nature of graduate students' position within the academy, and given that the Pis of the
studies tend to be big names and prime movers in their fields that would be pointless to try and anonymize (see
Gusterson 1996; Sundar-Rajan 2006; Roosth 2017). While the PIs with whom I worked will be recognizable to each
other-the small, shared space of a sub-sub-subfield-they are not as recognizable to the general public as the PIs in
some of these other ethnographic studies. For these reasons, I have chosen to keep them anonymous as well.
16
sometimes called paralinguistic components of speech: e.g., pitch, energy, rate of articulation,
breathiness. Their aim is to reconfigure the conventions of a crucial practice in U.S. mental
health care: psychiatric assessment.
Like the entire enterprise of Western psychiatry itself, psychiatric assessment is
suspended at the level of semantic meaning. There are no blood tests for mental illness, no
thermometers. There are only conversations. Many of my informants argue that the technologies
they are developing to do the sorting work of psychiatric assessment will not only make the
process more objective by enabling them to change the way that speech is interpreted,
transforming speech from personal narratives to neurobiological signs-using Al (artificial
intelligence) to circumvent semantic meaning and overcome all of the subjective (and cultural)
things about language. They also argue that it will also save money, time, and save mental health
care workers from burning out in emotionally laborious jobs, although quite a few other of my
informants are cynical or skeptical about the success of their endeavors, even as they work
toward this goal.
As noted, unlike Affectiva, the teams are all situated in the academic realm, and they
develop their prototypes and research findings in the context of academic studies. Unlike
Affectiva, the teams must have their research approved by their universities' Institutional Review
Board (IRB), organizational forms that oversee and regulate research conducted with human
subjects in accordance with federal standards established in 1981 and revised in 2019. The
researchers must adhere to the university IRB's ethical protocols and bureaucratic requirements
aimed at ensuring safe and non-coercive informed consent and at protecting the anonymity and
confidentiality of research subjects, while minimizing the harm that participation might incur.2
2 45 CFR Part 46. 1981. (HSS and FDA 1981.) https://www.hhs.gov/ohrp/regulations-and-
policy/regulations/common-rule/index.html
17
Because Affectiva sells consumer products rather than health care interventions, they have no
such oversight with which to contend. Under a neoliberal model of consumer choice, it is up to
the user to decide whether or not they agree to Affectiva's Terms of Service, and once they click
submit, the user (and their data) is at Affectiva's whim.
In a scientific, academic study, the goal is not to sell a product or to grow financially. The
goal is to produce knowledge, although knowledge production has a price. Researchers must
seek out grant money and other forms of funding to sustain their work, and, as scholars such as
Scott Vrecko (2010) have observed, trends and changes in funding institutions shape and
transform the path of a team's research and the nature of the facts about mental illness that the
studies ultimately produces. At East Coast University, the team relies on federal funding, seeking
out and securing grants from federal institutes like the National Science Foundation (NSF) or the
National Institute of Health (NIH), focusing their efforts on basic science research aimed at
contributing to biomedical understandings of mental illness. At West Coast University, in
addition to academic federal institutions, researchers rely on military funding, which is abundant
but inconsistent, and the team must petition every year to have their funding renewed. Their
prototype must have a dual use component-it should serve military and civilian populations
alike. By contrast, the Midwestern University team's primary source of funding is philanthropic.
They appeal directly to individual philanthropists and non-profit organizations to keep their
research going, and as a result, they focus on building technological prototypes with a societal
impact (specifically, improving upon the treatment of mental illness), the success of which can
be articulated in non-technical terms.
My informant's specific focus on mental health care also sets them apart from Affectiva.
Studying mental illness and developing technologies to intervene on one of Western psychiatry's
18
longest standing issues-the subjective nature of psychiatric pathology-means that my
interlocutors, unlike the employees and technologists of Affectiva, must confront human
suffering, sometimes indirectly, and other times, face-to-face. The people who create and curate
the team's database (the corpus of audio recorded speech that form the basis of the teams'
algorithmic systems) as well as the intended end-users are a vulnerable population: people who
live with and alongside mental illness, either with a formal diagnosis or somewhere at the
"subclinical" level, between the cracks and gaps in America's conventional diagnostic,
nosological infrastructure. Like others citizen-subjects living under conditions in which the
retreat of state-sponsored social services (like health care) force them seek out alternate means
through which to access resources to sustain their wellbeing (James 2004; Petryna 2009; Nguyen
2010) the research subjects tend to be vulnerable and disenfranchised in multiple ways.
Alongside their mental health issues, research subjects tend to be unemployed, on disability
leave, veterans, recovering addicts, or people experiencing homelessness-the kinds of people
who have the time to spare during the working hours of the weekdays, who are in need of the
money (or other resources, like access to an internet-enabled smart phone) that they can make
while participating in the study.
Like the feature that Affectiva revealed during the Summit's morning keynote, my
informants are trying to build automated technologies that can assess a person's inner state, with
an emphasis on pathological affective states, not based on what they say, the semantic content of
their utterances, but how they say it-the acoustic, formal properties of their utterances. Across
the three teams, the same basic principles and premises support their research. The teams seek to
treat speech not as a linguistic practice but as a motor activity. They study speech not as a
sociocultural narrative but as a biomechanistic output, as a sound that contains information about
19
the source that created it: the speaker's brain or at least their psychological state3 . Put differently,
with reference to composer and theorist Michel Chion's three modes of listening (1990: 23-35),
the teams approach psychiatric speech through reduced listening (attending to sound's formal
qualities and characteristics) in order to enable causal listening (attending to a sound in order to
ascertain its source). In so doing, they seek to press the meaning out of speech, circumventing
altogether all of the components of speech that linguistic anthropologist assert make speech
interactional and cultural. The overarching aim of their studies is to capture something pre-
linguistic about the activity of producing oral speech-something that is universal and grounded
in the biological realm, rather than something particular to an individual. The Primary
Investigators (PIs) of the teams argue that by changing the way speech in psychiatric settings is
listened to and interpreted-with attention placed on sound rather than meaning-will aid in the
identification of objective indicators of mental health.
At the same time, in order to achieve this scaling feat (moving from the most finite, fine-
grained scale of language to the most universal possible scale of human nature), just like
Affectiva, the teams need to assemble a data set, and they need to classify items in the data set.
These are the prerequisites to enabling an algorithmic system that can calculate the statistical
similarity between a known item in the dataset, and some unknown, novel item. In other words,
in order for an algorithmic system to "recognize" features of speech associated with psychiatric
states, at some point, the creators and stewards of that system need to set parameters and
definitions for how these states sound. In the case of my informants, their dataset is comprised of
excerpts of research subjects' speech. To build and then classify this data set, the teams must
3 As discussed in Chapter 2, their research is genealogically entangled with the history of telephony and telephone
engineering in the United States, which is itself indebted to the development of experimental phonetics and d/Deaf
education, both of which birthed the assumption that speech "could be exhaustively investigated as a purely
mechanical process" (Mills 2010: 38).
20
make a series of interlinked choices. How will they measure-and define-mental illness, and
the three different diagnostic categories on which they will focus? What kind of speech do they
want to elicit from research subjects, and how will they elicit it? Will they engage the subject in
a conversation, or record a conversation that the subject has with someone else? Once the speech
is elicited and recorded, how will they qualify-or quantify-the speech? Whose job will it be to
determine how the speech sounds?
My dissertation showcases the variety of ways in which each of the teams answers these
questions, and the stakes of their answers in regards to ideas about being human, being mentally
ill, and language that they reify and reproduce. At the same time, I show that regardless of the
many different ways these questions can be answered, the teams all found themselves grappling
with the same, fundamental issue. While their eventual goal is to build a system that circumvents
the semantic notions of speech, in order to build that system, they must engage with the very
sociocultural dimensions of language that they seek to overcome The connections between
spoken utterances and inner states that their systems perform "autonomously" and automatically,
like the system demonstrated at Affectiva's Summit, depend on a tightly managed infrastructure
of human labor that includes both research subjects and members of the research team.
I follow Ekbia and Nardi (2017) and other scholars of science and technology in asserting
that it is not very productive to think of automation as autonomous, as machines doing things
without human intervention. Instead, it is much more productive and indeed much more accurate
to discuss automation as heteromation-am ixture of human and machine work. Considering
automation as heteromation-as humans doing things with machines, although the humans are
not always easy to find-allows us to investigate automation anthropologically, and to
investigate why these humans are so difficult to find. As Lilly Irani reminds us, "claims about
21
automation are almost always claims about kinds of people" (2017). Heteromation as an analytic
can guide us in pulling apart the seams that suture together automation with categories of the
human.
To a certain extent, the association between qualities of speech and specific diagnostic
categories is a part of professional psychiatric wisdom. The mental health care practitioners who
I interviewed as part of my fieldwork - people whose opinions I sought in reflection on my
primary informants' technological aspirations - agreed for the most part that "everyone knows"
depressed people speak more slowly than the average person, while people experiencing mania
or under duress speak more quickly. Indeed, audible changes in vocal qualities (especially
changes in the pacing of voice) make up the diagnostic criteria for several categories solidified in
the Diagnostic and Statistical Manual of Mental Disorders (DSM), until recently American
psychiatry and psychology's authoritative classificatory field guide. Calling upon the tools and
techniques of their engineering colleagues, my informants seek to use Al-enabled techniques of
pattern recognition to distill this wisdom about the connection between sounds and states, and
then to automate it. Their research is motivated by epistemological, public health, and personal
concerns alike. Like the panelist at the Summit, some of my informants believed an automated
triage system could lighten the load of an over-burdened care system in which a handful of
practitioners juggle caseloads that number in the hundreds. Many of them expressed a genuine
desire to help mental health care workers and their patients, oftentimes because they had
suffered, or a friend had taken their own life, or their father had been sick for years, or their
brother had to live within the walls of a psychiatric institution. They wanted to offer a hands-on
and actionable solution to a heavy, structural issue-the inaccessibility of mental health care
with the tools and the disciplinary angle that they knew.
22
As the epigraph with which I opened this chapter-a passage of a 1962 article written by
a senior member of the Institute of Radio Engineers (IRE)-testifies, just as the idea that "the
tone of peoples' voices tells a listener a great deal about their emotional state" is not novel,
neither is the notion that the "emotional content" of speech can be separated out, processed, and
distilled into information. In David's article, the author tells the fictional story of George Lance,
a graduate student in electrical engineering on the lookout for a doctoral thesis and a way to
combine his love for electronics with his interest in biology. His laboratory director refers him to
Professor Pseudomorph, the head of the new, interdisciplinary Psycho-Systems Information
Center. Pseudomorph waxes with prescient poetics about "the implications of perceiving
machines, which can pack the learning of millions of millions of lifetimes into only a few hours,"
pining for the day "when all the really important decisions will be made automatically by a
machine with remote sensors to sample the world" (74). The first of his colleagues that
Pseudomorph passes Lance along to is Dr. Steamer, head of the Neuro-psychiatric Information
Group, who tells Lance about his collaboration with psychiatrists. Dr. Steamer's research closely
resembles my interlocutors', and he says something to Lance that many of my interlocutors often
argued: to conduct this research, expertise in language-or even psychiatry-is not a necessary
requirement. With "an engineering point of view" that takes all components of human life to be
governed by the same essential principles that can be described mathematically, psychiatry and
linguistics are merely domains of knowledge that can be read about in a book or an article, and
then concretized in the algorithmic system they build. Disciplinary expertise is a feature on
which to train a system.
David's 1962 article is a parody; the author is displeased with the "hoopla" born through
the "cross-fertilization of engineering and the life sciences" (David 1962: 75). Whether or not the
23
melding together of psychiatry and engineering indeed produces hoopla, it is as old as the
professionalization of engineering itself4 and these days, the cross-fertilization is more common
than ever. Collaborations like the ones I study, among psychiatrists, neuroscientists,
psychologists, and engineers and computer scientists, have become increasingly common to the
point that some have suggested they make up a new subfield altogether called Computational
Psychiatry. The subfield now has its own journal, published by MIT. Proponents and
practitioners of Computational Psychiatry integrate the tools and techniques of engineers-such
as machine learning, the great-great grand-kin of the "perceiving machines" that Pseudomorph
describes-to solve the problems of psychiatry-like the lack of biological markers for
diagnosis, which makes it difficult to determine which patients are gravely ill and in need of care
versus those who are less ill and those who might not be ill at all. Some of my informants
position Al-enabled techniques of pattern recognition as a panacea for readdressing American
psychiatry's epistemological problems, and its public health problems, in one fell swoop, while
others recognize it to be a temporary, overly optimistic band-aid. Taken together, my informants'
research projects offer a case study in one way of doing Computational Psychiatry, in part by
illustrating the self-reflexivity and heterogeneous attitudes and affects of the various actors
involved.
HETEROMATION AND GENRES OF THE HUMAN
4 A year after the article was published, the IRE merged with the American Institute of Electrical Engineers (AIEE)
to form the Institute of Electrical and Electronic Engineers (IEEE), the self-identified world's largest technical
professional organization. See <https://www.ieee.org/about/ieee-history.html>
24
The fable of the robot clinicians told at the Summit resonates with ongoing conversations about
the value of human work that are happening in the public sphere. Conversations with my
informants as I watched and worked with them to build an algorithmic system, getting a behind-
the-screen look at the conditions that make the basic functioning of these technologies possible,
challenged my own assumptions about automation (the mechanization of a process once
performed by humans, the supposed removal of human intervention). I decided to open this
introduction with the story about the clinician robots because it points to something that became
a key feature of my ethnography: there is a hierarchy of value within the mental health care
professions, and while much fear revolves around the automation of therapy (a medico-legally
ratified form of psychiatric care) there is less fear-and far less ink spilled-about the
automation of assessment.
If "the clinicians are, of course, robots," what are we to make of the fact that the
clinicians who are humans are not disturbed enough by this to attend the protest? On the one
hand, maybe the clinicians are, like the panelist from the Middle East had imagined, operating
the robots remotely, and will only directly treat a protestor only if they are in "true" need of care.
But there is another potential reading of this science fiction: perhaps the human clinicians are not
there, because the kind of work the robot clinicians are doing-the work of psychic triage,
sorting the ill from the well-is not their territory. The clinicians are not out on the frontlines
treating protestors (presumably free of charge) because they are in their offices, sitting by a
patient recumbent on a couch who is paying out of pocket for the therapy because the clinician
does not take insurance.
My dissertation speaks directly to the uneasy position and decreasing value of psychiatric
screening and other administrative tasks within psychiatry, especially with regards to the role of
25
listening as a key, interactional feature of psychiatric assessment. I show that the devaluing of
psychiatric screening is part of a larger trend toward devaluing gendered, racialized and classed
administrative, service labor-like nurses' assistants, medical technicians, custodial cleaners-in
the context of health care in the United States (Nakano Glenn 1992). Against technical, skillful,
quantifiable work that can only be performed by a credentialed expert after years of training,
psychiatric assessment is positioned as custodial, skilless, as depending on "soft" qualities that
can't be quantified and that are supposedly part and parcel of the basic equipment with which all
humans are born: the capacity to listen empathically. I also show how the very notion of what it
means to be empathic-to listen empathically-is wrapped up in ideas about the relationship
between speech and self, and mind and language, and torqued by ideas about gender, race,
ability, and class.
The division of labor within the teams reinforces the low position of psychiatric
assessment and its attendant tasks within the hierarchy of mental health care professions. Yet this
work is essential to the eventual technological prototypes the teams seek to produce. In order to
create technologies that listen beyond the human, they must rely on humans listening. Thus,
there is a paradox at the center of their efforts that researchers regardless of their disciplinary
training (in either psychiatry or engineering) grappled with: in order to build the algorithmic
infrastructures that would make their technologies possible, researchers had to constantly fall
back on and rely on the language practices-and the linguistic labor-conventional to
psychiatry, the very same practices their technologies were supposed to efface. As I
ethnographically tracked the process of gathering the data and building the infrastructure that is
foundational to this whole process-the process of using Al to automate assessment-the
technologies started to look less and less like a kind of deus ex machina, and I became more
26
attuned to the sometimes subtle critiques my informants were making of their own projects. By
following the day-to-day practices of the people who make automation possible (and the people
who make it seem automated rather than heteromated) the technologies started to look instead
like a microcosm of long-standing dynamics of power and authority within the psychiatric
professions and mental health care in the U.S.
In this dissertation, tracing the remaking of psychiatric assessment ethnographically will
lead us into fundamental questions about what it means to have language-what it means to be
human. As Lucy Suchman has contended, the machinic components of human-like machines
display "a kind of doubling or mimicry...that works as a powerful disclosing agent for
assumptions about the human" (2007: 229). In Lilly Irani's words, "hierarchies of value have
long overlapped with hierarchies of gender in the technological imagination" (2013: 733). More
often than not, the figure of the human that human-like machines are positioned against is
exclusionary rather than all encompassing. In turn, the figure of the machine-either passive
servant to human desires, or unruly agent threatening to overthrow its creators-falls along
historical, colonial fault lines. Ruha Benjamin (2019), with reference to Sylvia Wynter (2003),
writes that "our very notion of what it means to be human is fragmented by race and other axes
of difference," and although the category of the human operates as a universal moniker, there are
in fact "genres" of the human that include "full human, not-quite-humans, and non-humans"
through which "racial, gendered, and colonial hierarchies are encoded" (31). Trying to pin down
the image after which the human-like machine is made can help us pull apart and decipher these
codes. We can read the "artificial"-the non-human, the mechanized, the inert machine-for
what it says about the skills, value, and expertise bundled together with certain kinds of tasks and
not others.
27
In this regard, the following questions, posed by Suchman (2007), also motivate this
dissertation: "what figures of the human are materialized in these technologies? What are the
circumstances through which machines can be claimed, or experienced, as human-like? And
what do these claims and encounters tell us about the particular cultural imaginaries that inform
these technoscience initiatives?" (229). These questions are particularly pressing to address in
the context of communication technologies that are supposed to resemble but also improve upon
aspects of communication-technologies that are supposed to listen like humans while also
listening beyond the human. Parsing through the logics and techniques of resemblance and
likeness in regards to language can help us make sense of the semiotic bundling of the visual and
the aural-for instance, "looking like a language, sounding like a race" (Rosa 2019)-to better
understand, among other things, the language ideologies underpinning the gendering and
racialization of both language and listening (see also Eidsheim 2019).
This dissertation draws heavily from feminist and anti-racist STS scholarship that re-
centers into analytic view the materially grounded labor practices that make high-tech and flashy
and "innovative" technologies possible, like computer chips and cell phones. For instance,
scholars like Donna Haraway (1991) and Lisa Nakamura (2009) emphasize that the manual labor
of women-especially women of color-has largely fueled and yet remains marginal to the
massive manufacturing enterprise that enables the tech giants of Silicon Valley. To keep this
marginalized labor in mind is to keep in mind that digital technologies are always the outcome of
digital work, meaning, as Nakamura puts it, "the work of the hand and its digits" (2014: 932),
and to keep in mind that computation is made possible not only through software but through the
alignment of "wet-ware and fleshware" (Philip et. al 2012:19). These interventions are especially
crucial when it comes to digital media technologies, precisely because these technologies might
28
otherwise seem so immaterial, with the immediacy of the connections they enable, and with their
codes and clouds and their screens that mediate away and make less available for scrutiny the
bodies and the work that went into producing them. Drawing attention to this otherwise marginal
digital labor and to the fleshware of software dissolves what Astra Taylor (2018) calls "the
ideology of automation, and its attendant myth of human obsolescence."
When we search for the digital work that undergirds the technologies my informants
build, we find what I call linguistic labor: the work of giving the impression that you are
listening empathically and carefully, or the work of strategically encouraging the sharing of
personal details, or the work of listening for suicidal ideation. It is precisely the erasure of this
labor that enables this illusion of machine autonomy, that makes heteromation look like
automation, and that allows the notion of "machine listening" to make sense as a mode of
listening that is distinct, superior to, and set apart from human listening. As my ethnography will
show, in practice, in the effort to make machines listen, the division between human and
machine-including human listening and machine listening-wavers and break down.
SPEECH, SIGNAL, SYMPTOM
This dissertation offers a critical science and technology studies (STS) approach to both
psychiatry and the communication sciences in the United States. It shows, processually, how
facts about language are contingent, assembled, and require work to be held steady, including the
very fact that language can be transduced and distilled down into signals. An ethnographic
approach is uniquely capable of locating contingency in the production of scientific facts, while
also avoiding a narrowly technological determinist take-that it is the technologies that
29
recognize speech and identify connections between speech sounds and interior states, and that it
is the technologies that are capable of replacing human labor tout court. Participant observation,
interviewing, and learning alongside and doing things with my interlocutors, allows me to study
these efforts to remake and remix psychiatric assessment "as a dynamic practice between human
and machines" (Thakor 2018: 9), one that hinges just as much upon gut instincts, structural
inequality, tacit knowledge, and longstanding ideas about the nature of language, as it does on
the nuances of code, mathematical processes, and psychiatric inventories.
With reference to Langdon Winner's (1980) question as to whether or not artifacts have
politics, disability studies and STS scholar Mara Mills (2011) poses a corollary question: "do
signals have politics?" My dissertation explores this question ethnographically, focusing on
people whose central, professional concern revolves around looking for, defining, and
transducing signals, moving them from one medium to another, from the oral, to the auditory, to
the informatics, to the bureaucratic. Like Mills (2011a), I assert that the signal is a material-
semiotic object-and idea and a thing-with its own history and social life, as well as an actor's
category wrapped up in disciplinary-specific concerns and epistemologies.
On the one hand, my informants use of the term "signal" in their everyday talk and in
their public presentations of their work-the speech signal, behavioral signals, and so on-is a
reference to signalprocessing. Signal processing is a subfield of engineering concerned with
identifying and extracting information "from the sonic environment for transmission down to a
limited number of channels" (Mills 201Ia: 332). Forged through the coalescence and
overlapping histories of cybernetics, D/deaf education, information theory, and the development
of the telephone, the field of signal processing focuses on methods for transforming auditory
phenomenon into objects that can be mathematically modeled "over time and through circuits,"
30
allowing thus for further modification and modulation (Sterne and Rogers 2011: 32). The signals
of signal processing are "electrical 'carriers' of other signs, encoded transmitters of messages
(these codes often obtaining from the quantified information content of the message)" (Mills
201lb: 81). For my interlocutors, many of whom are trained in speech signal processing and
employ its methods in their research projects, the presence of mental illness is one potential
message that "the speech signal" carries. The signal is the smallest kernel of meaning, that which
really matters in a stream of sensorial stuff, the component of the message which must be
preserved in order for the message to remain meaningful. In this way, the signal of signal
processing is wrapped up in questions of value.
On the other hand, rather than seek out a definitive definition of the signal (and a
definitive answer to the question: if signals have politics, then what are their politics?) I seek to
explore this definition ethnographically, investigating the elaboration of the signal's politics in
practice. To ask my informants about signals is to ask them what they care about. What are they
after? Again, what do they value? As Kockelman puts it, "what is noise for you may be signal (or
meaning in place) for me" (2017: 140). Thus, I follow Seaver's (2017) methodological tactics for
studying algorithms (another material-semiotic assemblage that has multiple meanings and
disciplinary, historical legacies) in place, rather than seeking out stable, sterile, and unitary
definitions. In other words, the point of my dissertation is not to propose a theory of what a
signal truly is in general and in the specific instance of my informant. As Seaver asserts,
technical people (just like us!) "do not maintain the definitional hygiene that some critics have
demanded of each other" (2017: 3). Signals mean (and can do) many things (sometimes
contradictory, sometimes overlapping) for my informants. This is especially the case given the
disciplinary differences between engineering, signal processing, and psychiatry and clinical
31
psychology, the primary fields of expertise that make up my interlocutors' interdisciplinary
teams.
For some of the researchers working on the team, the end goal of their study is to produce
research findings and/or technological prototypes that will remake psychiatric assessment
semiotically, re-tuning the interpretive valence of the encounter between speaking patient and
listening health care worker. Speech analysis technologies shift the terrain of the signal-noise
relationship; semantic content (what the patient says) falls to the foreground, while paralinguistic
form (how they say it) takes the place of semantic meaning as the sought-after signal; this shifted
terrain torques the normal, hegemonic ideologies privileging the referential function of language
that circulate in spaces of power in the United States, from the law, to the church, to science and
biomedicine.
The trichotomy of speech, signal, symptom, forms an indexical chain. Supposedly,
among people experiencing mental illness, vocalized utterances (speech) contain signs of mental
illness that are detached from meaning, that exist within the smallest possible grain of speech as
sound, even below the level of the phoneme. Speech, signal, symptom corresponds with nodes
along the research pipeline: from the elicitation and gathering of data (speech), to processing,
discarding, categorizing, and organizing (signal), to analysis and the production of scientific
facts about language and mental illness (symptom). My dissertation aims to examine the
transductive labor that happens along each of these nodes, with attention to the labor it takes to
move the medium of language across multiple media. This includes the labor of research
subjects, whose vocalized utterances, brains, bodies, and memories form the substrate from
which the researchers draw conclusions and attempt to build interventions. Research subjects,
many of whom are actively, mentally ill, produce the "assistive pretext" of my interlocutors'
32
technologies, the "resourcing of disability within technoscience" (Mills 2010: 39). My
interlocutors' intellectual ancestors-telephone engineers and information theorists-resourced
D/deafness in developing their theories of the signal and building these theories into
communication technologies. Likewise, my informants' resource their research subjects'
experiences of mental illness.
Listening is the common thread that cuts through and across these territories. Speech,
signal, symptom are strung together in association with each other through different modes of
listening. The conviction that listening can be a form of medical treatment, ethical engagement,
and empathic care is a key cultural legacy of North American psychiatry. At the same time, my
interlocutors' research is motivated by a sense that listening in the context of psychiatric
encounters has failed the discipline, has failed the family members and loved ones of mentally ill
people, has failed the mentally ill themselves, and has even failed the nation as a whole, in as
much as the treatment and management of mental illness is a concern of the state. Their
alternative to the conventional-and fallible-tactics of listening in psychiatric contexts is
machine listening which, like the signal, has a variety of definitions and enactments. Sushant, the
PI of the research team at East Coast University, once remarked to me that machine listening is
"just like human listening." Instead of an ear and a brain, there is a microphone (taking the place
of the human ear) and the computer (taking the place of the human brain). Ideas and enactments
of machine listening pose a related question: what is human listening? How is the machine of
machine listening figured against, and through, ideas about human listening? Thus, exploring the
meaning of machine listening in practice brings up questions of epistemology, ontology, and
ethics. For instance, what are the ethics of attempts to machinically outsource the decision-
33
making labor of psychiatric assessment-which amounts to decisions about whether or not
someone is deserving of more professional attention and care?
The dissertation explores what it means to bring together communication sciences,
engineering, psychiatry, and, to an extent, social work, and to apply the tools and techniques of
computer scientists (signal processing, big data analytics, Al-enabled techniques of pattern
recognition) to a psychiatric problem. Like the professor of fictional George Lance, my
interlocutors would often reference the unique capacity of their engineering backgrounds to
approach the study of mental illness. But at the same time, to study mental illness "from an
engineering prospective" requires engineers at varying career stages (from the most novice to the
most senior) to contend with things that people in psychiatry typically contend with. They must
face the realities of living with mental illness head-on, confronting human suffering in a way that
at times hits painfully close to home and in a way that makes questions of professional
responsibility, ethics, and care unavoidable. Thus, questions about care-what does it mean to
care? Who should care? Who does caring include, but also exclude, or even harm?-became
crucial, unavoidable features of my ethnography as well.
The arrangement of expert researchers, psychiatric practitioners, and research subjects
together within the confines of an academic study creates situations in which caring for and
about research subjects seems like the right thing to do but is nevertheless institutionally wrong
and disciplinarily incorrect. The purpose of their technological prototypes was to conduct
psychiatric assessment rather than provide psychotherapeutic care or even provide an official,
medical diagnosis. Likewise, researchers who gathered, listened to, and categorized research
subjects' speech were not mental health care professionals-they were incapable of providing
sanctioned, official care. The distinction my interlocutors made, and asserted again and again,
34
between therapy and assessment, was in many ways about drawing boundaries around what
counts as care, even as they participated in practices that also seemed like care. Indeed, just as
quickly as they would assert that they could not provide care in the context of their study, and
that their technology could not provide care, they would discuss the extent to which participating
in the study afforded subjects the chance to feel caredfor by feeling listened to. Day-to-day
research practices were charged with this tension between listening and care, empathy and
responsibility. Without totally absolving my informants-and my own-complicity in
sometimes ethically hoary practices, the dissertation also suggests that their work points to a
broader "ethical soundscape," to use Charles Hirshkind's term, a milieu in which control,
surveillance, good intentions, and resistance cannot be easily disentangled.
METHODS: RESEARCHERS AS RESEARCH SUBJECTS, FIELDWORK AS HOME-
WORK
Before delving into some of the larger, theoretical concerns that motivate the dissertation, and
describing the trends in U.S. psychiatry that motivate the efforts to develop automated speech
analysis technologies for psychiatric assessment, I will review the methods I employed in this
study, including justification for my site selection. Additionally, I discuss the organization of
labor within the teams and my positionality within them as they relate to my ethnography, along
with my own positionality with respect to biomedicine and the health care system in the United
States.
I selected the three university-based lab groups in order to capture differences related to
four variables: the intended use of the technology being developed, the makeup of the research
team, the source of funding involved, and the institutional affiliations and academic careers of
35
individual team members. The three fieldsites offer opportunities for contrastive comparison
while, taken together, combine to tell a bigger story. Each site is a node in a larger, interrelated
network of engineers collaborating with psychiatric professionals to augment the encounter
between potential patient and mental health care provider in the context of psychiatric
assessment. Because it is concerned with the analysis of interdisciplinary collaboration within
groups, and the comparison of practices, ethics, and ideologies across groups, the dissertation is
both comparative and multi-sited (Hannerz 2003; Marcus 1995).
I use the terms "informants" and "interlocutors" interchangeably through the dissertation.
I find "informants" suiting due to its resonances with "information," given their commitments to
models of language that have their origins information theory, and given engineering and
psychiatry's own entanglements with informatics and computing. Nevertheless, many
anthropologists have adopted the term "interlocutor" to describe the people with whom they have
worked, lived, and learned from, in an attempt to avoid the associations of espionage and
extraction that "informants" comes with-in other words, in attempt to work past and reject
anthropology's history of colonial projects of state-making, development, military intervention,
and occupation. "Interlocutor" is likewise a fitting term to describe people who are concerned
with the nuances of communicative interaction in psychiatric encounters, and of attempting to
replicate and simulate them in data collection portions of their studies. "Interlocutor" also
implies an exchange-that we were in conversation with and mutually learned from each other.
My own ethnographic pursuits and the development of voice analysis technologies for
psychiatric assessment require related tactics: establishing rapport, interviewing, recording
speech that circulates far beyond the context of its utterance and is analyzed in ways that the
initial utterer may never have anticipated. We therefore ran into similar ethical quandaries: how
36
to truly protect the anonymity of our research subjects? While de-identifying data-like the use
of pseudonyms-poses some amount of protection to researchers' privacy, in an interconnected
world (of researchers who all know each other, and in which ubiquitous data gathering is a
feature rather than a bug) how much anonymity could we both really promise?
At the same time, the power dynamics of my encounters with my informants was never a
settled, established matter. PIs of studies at academic institutions, military officials, and so on,
have more power, influence, and far more resources than a graduate student in a social science
field. To keep them anonymous them is a means of protecting myself by downplaying my
association with them. Yet many of the graduate students and undergraduates with whom I
encountered were not U.S. citizens. I conducted my fieldwork in the middle of the Trump
administration's travel ban on people from Muslim-majority countries, which impacted the lives
and families of many of the people who whom I worked. To keep them anonymous is to avoid
meddling with their careers and with their immigration status.
I did feel that my fieldwork was extractive-just as my interlocutors questioned the
extractive, exploitative nature of their relationship with their own research subjects. I use
"informant" and "interlocutor" together, then, to always keep these uneasy, shaky power
dynamics and unanswerable ethical dilemmas in view. By using them interchangeably I hold the
terms and their various associations always in tension with each other, and as a reflection on the
interplay of extraction, transparency, trust and paranoia that effused my fieldsites. Just as
anthropology must always contend with its colonial past-and its enactments in the present-to
use both these terms at once is to sit with the uneasiness that my fieldwork, and the researcher's
own projects, trafficked in. I hope it will shed further light on why exactly these things make us
uneasy.
37
My fieldwork followed the day-to-day practices associated with building and testing
psychiatric speech analysis technologies, tracking how researchers represent and promote their
technologies to media outlets, in grant proposals and journal articles, and at public
demonstrations, conferences, and workshops. After undergoing human subjects research training,
I was added to research teams' institutional review board (IRB) protocols so that I could
participate in and observe, and, when permissible, make audio and video recordings of daily
research activities. Activities ranged from planning and preparation (e.g. making experimental
stimuli to be used in studies); piloting (e.g. pretending to be a research subject); data gathering
(e.g. conducting brain scans, interviewing research subjects); data processing (e.g. sorting,
listening to, and labeling audio recordings); data analysis (calculating agreement between data
labels, building predictive models); and dissemination (e.g. presenting at academic and public
venues). I helped to develop training materials, brainstormed with my interlocutors on how to
revise their grant proposals, and socialized with them, both within the space of their labs and
without, at local bars, restaurants, birthday parties, and goodbye parties. As I discuss below and
in more detail in Chapter 2, a good portion of my fieldwork involved troubleshooting and
maintenance tasks. In addition to participant observation, I conducted person-centered interviews
with key members of each research team. I began transcribing audio and video recordings and
coding my fieldnotes during fieldwork.
According to Summerson Carr "a linguistic anthropological method assumes that culture
and its many institutional forms and formulas manifest in semiotic interaction rather than simply
controlling and containing it" (201Ob: 27). In conducting my fieldwork, in order to explore the
linguistic and semiotic ideologies that researchers elaborate in their efforts to build speech
analysis technologies for psychiatric assessment, I focused on researchers' talk and
38
metalinguistic discourses about listening, language, mental illness, and care in conversation with
me and with each other. When my interlocutors consented, I audio-recorded our day-to-day
activities and conversations in lab meetings, within our individualized offices, or after having
attended events, talks, and conferences together.
I also audio recorded individual interviews if the researchers consented, although quite a
few of them did not consent. This led to several, off-record conversations in which the
researchers reflected frankly on their own feelings about what it meant to record their own
research subject's speech. Their discomfort with having our conversations recorded oftentimes
spoke to their disquiet with the ubiquitous surveillance and data capturing that participation in
their own studies entailed. Their ethnographic refusal 5 formed yet another critique of the very
same research practices that they forwarded and participated in. By consenting to be part of my
ethnographic study, I shifted them to the position of research subject. In this way, my awkward
meta-position within the team-helping them study other people, but also always studying them
studying other people-helped me to heuristically pin down people's ethical limits and beliefs,
which they might not have otherwise voiced. This was not quite studying up, and not even quite
studying sideways. The terrain between us was constantly in flux, and as I struggled to get my
bearings, my interlocutors pushed me to think more self-reflexively-and humbly-about the
ethics of ethnography itself.
The three teams had the same organizational structure, and as a research assistant, I was
embedded on one of the lower rungs within that organizational structure. Thinking reflexively
about my position within the teams helped me to better understand the extent to which the labor
5 See (Simpson 2007) and (Benjamin 2016) for discussions of refusal in which the power dynamic between
researcher/investigator and researched/investigated are top-down (i.e., either in the context of anthropologists
studying native populations, or in the context of research subjects, patients, and tissue donors agreeing to allow
others to access their individualized, somatic data).
39
that powered the teams' projects is both specific to U.S. psychiatry (as I'll discuss in Chapter 1),
but also gendered, and (as I'll discuss in Chapter 3) raced. Across the three teams, at the top of
the hierarchy are the leaders, the PIs. Typically, there is both an engineering PI and a psychology
or psychiatry PI. Underneath the PIs are post-docs, who play a supervising role and delegate
tasks to the people below them: grad students, undergrads, and research assistants like me.
Finally, there are staff members: employees of the university who provide administrative support
to the team. Engineering team members tend to be male, and higher up on the team-most of the
PhD students and post-docs were men. Psychiatry or psychology team members tend to be
women, and lower in the team, such as research assistants and staff. Team members on the
psychiatry side of things also tend to be the "face" of the project-the people who interacted
face-to-face with research subjects the most.
Because I lack training in psychiatry or engineering, as a research assistant, many of the
things I ended up doing werethings that higher-ranking team members lacked the resources or
time to do, but needed to get done. This included things like listening to and labeling voice data,
changing the sheets on the mattress that subjects rest on while getting their brains scanned, or co-
writing a script for and acting in a video created for an experiment. My position as a novice
working alongside other less experienced or credentialed researchers and staff allowed me to get
a better understanding of the tasks that were considered menial or busywork (things that anyone
could do). A common thread uniting this busy work is that it is primarily social work, work that
involved soft skills-work that is stereotypically feminized, from tasks that were overtly
domestic (like making the brain scanner bed) to the more subtly feminized, including tasks
revolving around extracting speech data from research subjects and monitoring the content of
40
their speech, a form of work which I call "linguistic labor" and which I will expand upon
throughout the dissertation.
Due to the custodial and administrative position of these kinds of tasks vis-a-vis other
tasks on the team that more established and more senior members conducted, I often felt like the
research activities which consumed my day from 8am to 5 or 6pm were non-essential or even
tangential to the aims of the teams' projects. I would often think of a scene from Hallam
Stevens's historical ethnography on the encroachment of informatics into biomedical research
(2013). Stevens describe the slick, glossy, and expensive-looking building in which researchers
meet and sit in front of their computers. This is, for all intents and purpose, the public facing
image of bioinformatics: impressive buildings with glass walls that give the impression that the
science going on inside is both important and accessible, impressive and worthy of sustained
funding. Yet there is another building, also connected to the same bioinformatics research
project, which offers a different picture: this building is drab and industrial, a worn-out
warehouse with flaking paint and small windows. Within the building, Stevens finds assembly
line-style and automated machinery, technicians tending to the machinery, and janitorial staff.
Stevens argues that both of these buildings are a part of and necessary to the larger
research project. I often felt as if my immediate fieldwork took place was more closely related to
the warehouse than the glossy building: hidden away from view, monotonous and unglamorous.
It was only upon reflecting on my fieldwork, years later, that I began to fully grasp and
internalize Steven's argument, and realized that my own devaluing of the work my interlocutors
and I performed was playing into the idea that the "real" work of science is mental rather than
physical. A performance studies approach to studying science and technology might refer to
these two buildings and the categories of work they represent as the front stage and the back
41
stage of science-the back stage, which is less visible, is dedicated to coordinating and managing
the image and performance of the front stage (Hilgartner 2000). However, this kind of analysis
insinuates that there is some true, authentic place where "real" science is happening; it re-
inscribes the less glamorous, custodial kind of labor as merely in service of the more impressive,
ethereal realm of thought, rather than recognizing that both are valid and necessary to the
production of knowledge. While I did not have immediate access to the processes and procedures
of data analysis, the access I did have-to more mundane, domestic work-helped me to better
understand the ways in which the making and doing of science is distributed across thinking and
practice, machinery and bodies, technical expertise and embodied, tacit experience.
Early in my graduate career, a professor at another university once remarked to me that
doing fieldwork "at home" is incredibly isolating. I reflected frequently on the ways in which my
fieldwork was homey, familiar. I am a U.S. citizen, a settler-this place, sometimes referred to
as the United States, is my home. Like many of my informants, I was completing my PhD as I
conducted fieldwork. My fieldsites themselves were located at offices, on the top floors of
hospitals, in classrooms, in libraries, on campus green areas, cafeterias, and so on-spacesI
inhabit as a graduate student, and spaces that feel safe for me as an educated white woman with
class privilege. The boundary between "the field" and "home" was porous. With this being the
case, I also recognize that the notion of doing fieldwork "at home" threatens to reify this
boundary-between the home and the field-which anthropologists since Gupta and Ferguson
(1992) have argued is indebted to anthropology's colonial legacy while it perpetuates the siting
of fieldwork as somewhere radically other, with radically Other subjects. Moreover, as Kamala
Visweswaran argues, searching for the ways in which we feel "at home" while in "the field" lays
the groundwork for a feminist method of conducting fieldwork as home-work (2003). Home, as
42
she writes, "once interrogated, is a place where we have never been before" (2003: 113). Indeed,
there were uncanny echoes-familiar but strange-that kept my fieldwork and my own life
inseparably close.
That is to say, as someone who lives with a chronic illness and chronic pain that arises for
no good reason and debilitates me when it does, my fieldwork at times felt deeply personal,
sliding from ethnography, to auto-ethnography, to ethnography again. Studying biomedicine
while also having to lean on it and push myself through its tangled systems (with the support of
my family and loved, ones no less) meant that the stopping point of fieldwork was difficult to
place. Fieldwork melted into homework, and the two fused together even more tightly while
writing the dissertation, and becoming more ill. In my own increasingly frequent encounters with
medical specialists, assessments, pain scales, sensors, sometimes hollow attempts "bedside
manner," and consent forms that I had to review, correct, and then sign, I have developed a
closeness with my informants' research subjects, a wounded affinity. Due to the double bind of
my informants' IRB protocols and my own IRB protocol, I was unable to interview or record
detailed data about the research subjects with whom I interacted, directly and laterally. Still, they
are ghostly present in their absence in my dissertation. The details of their lives and their
experiences being subjects-which I heard and listened to but which I abstain from writing
down-have continually pushed me to keep the more violent dimensions of care in view.
I take my illness and the biomedical zones of authority, surveillance, and uncertainty it
brings me through to be a form of feminist praxis, one that is central to my theorization of
diagnosis, language, and the body. Being ill, in pain, and studying diagnostic systems afforded
me a kind of sixth sense about bureaucracy of the health care system and hierarchies of clinical
labor that sometimes escaped the hand of language. For instance, moving through my fieldwork,
43
I wouldjust have afeeling that there was something going on about gender or race, detect a taste
of ableism, even though it took me years to be able to articulate the evidence underlying this
sensation. My own medical experiences have helped me to develop tactics for reading in
between the lines of the cultural myth of biomedicine, namely, that biomedical illness categories
correspond directly and completely with lived experiences, that they name finite things and can
offer finite, tangible, and linear solutions to sickness. They have also taught me to read IRB
protocols as pragmatic documents, as one kind of way of doing ethics, rather than all-
encompassing protective measures. Maya J. Berry, Claudia Chavez Argilelles, Shanya Cordis,
Sarah Ihmoud, and Elizabeth Velisquez Estrada's (2017) poignant, co-authored essay reflects
both my own experiences, and the tactics I have deployed in crafting this ethnography:
"we are not merely conducting research, but are connected to the places where we work
through familial ties, diasporic relationships, and investments in political struggles, all of
which hold us accountable even after our departure. Our relationship to our research thus
subverts the assumption that the field inhabits an/Other time-space, as well as the
masculinist notion that the time-space of the Other is to be instrumentally penetrated and
evacuated. Our entrances and exits do not hinge on geographical border crossings. In a
sense, the field travels with and within our bodies" (Berry et al 2017: 540).
THERAPEUTIC TALK AND LINGUISTIC LABOR
Linguistic anthropologists have argued that the enterprise of Western psychiatry has been a
privileged site for the enactment of cultural assumptions about the nature of self, mind, language,
and health, reflected most legibly in the verbal practices of "talk therapy" (Carr 2010; Perdkyl
1995, 1998; Wilce 2009). In its most Freudian form, American psychiatric practice of treatment,
diagnosis, and assessment operate under the assumption that absolute, linguistic transparency is
impossible, and that the therapist is uniquely capable of arriving at the patient's secret, occluded
desires through symbolic analysis of speech (Reik 1948, 1964; Vehvilainen 2008). Marsilli-
44
Vargas (2014) suggests that psychoanalysis constitutes a "genre of listening" that circulates as a
framework for setting an interpretive context and guiding how expert interpreters "tune" their
ears. On the other hand, and under the influence of Carl Rogers's "client-centered" approach,
countervailing tendencies in specifically American psychiatric practices emphasize self-
realization through therapeutic talk, suggesting that a speaker can agentively locate, articulate,
and actualize a true self in therapeutic discourse (Smith 2005; Carr & Smith 2013), sometimes
even at odds with the clinician. Thus, Carr has shown that, at least in the context of American
addiction treatment centers, while clinicians wield a considerable amount of power in the
psychiatric encounter, patients can subvert and strategically flout the interpretive frameworks
clinicians seek to impose (201Ob:23; 2010).
Talk-based psychiatric encounters in the United States reflect connections that linguistic
anthropologists have established between Euro-American language practices and ideologies of
mental transparency (Jones & Schieffelin 2009) that contrast sharply with ideologies of mental
opacity prevalent in Pacific societies (e.g., Rosaldo 1982; Schieffelin 2008; Throop 2010).
Hegemonic Euro-American language ideologies have been found to privilege the denotational
function of language (Silverstein 2012) and to imagine that speech signifies by referring to a
speaker's intentions (Duranti 1993; Keane 1997; Silverstein 1998). Paradoxically, speech
analysis technologies appear to pursue an ideal of transparent inner reference by circumventing
the semantic, referential dimension of language altogether, emphasizing indexical properties that,
linguistic anthropologists (e.g., Silverstein 1985) contend are often minimized in spaces of power
in the United States, such as the legal arena (Mertz 2007), in Christian missionary encounters
(Robbins 2008; Keane 2008), and in the sciences (Gordin 2015).6 As Stasch (2008) contends,
6 While these ideologies are dominant and linked to institutions of power, they are not the only ideologies in circulation in the
United States. See, for example, Claudia Mitchell-Kernan's paper on "signifying and marking" within African-American
45
claims to linguistic opacity or transparency are always political in nature. The technology the
three teams hope to produce entail a rearrangement or perhaps intensification of the
asymmetrical terms and power dynamics of the patient-clinician encounter, in which the
patient's speech is made more transparent than the health care workers. The development of the
technologies in the context of research studies also requires that research subject's speech be
scrutinized in ways that the subject themselves may have never anticipated. Lower-level
researchers within the teams in particular are tasked with listening to research subject's speech
intently while also trying to remove, downplay, or strip away the semantic dimensions-the
narrative, personal content of their utterances.
At the same time, while the purpose of their research is to identify markers of mental
illness in the sounds of speech that are wrapped up in biological univeralism, linguistic
anthropologists have shown that qualities of the human voice like pitch and intonation, beyond
the denotative function of speech, have a variety of culturally elaborated meanings (Harkness
2013; 2015). The three teams' attempts to develop vocal diagnostic technologies promise to be
particularly rich case studies in this regard, since in the process of designing their technology and
conducting research on it they attribute value to specific vocal qualia. Altogether, my
informants' research projects are potent sites at which language ideologies-ideas about how
language works-are assembled and ratified, even as they are contested and bent and pushed to
their limit.
communities of speakers (1972). Shaka McGlotten (2016) also discusses readinga nd shade as distinctly Black, queer
signifying practices. Reading and throwing shade involve stridently yet subtly insultingone's interlocutor. The sting
emanates from what need not be said about a person. The referential function of speech is poetically, torqued and twisted, and
the speaker recruits other paralinguistic cues to make their point. Writes McGlotten, "In Paris Is Burning, Dorian Corey
describes it this way: "Shade is, 'I don't tell you you're ugly, but I don't have to tell you because you know you're
ugly'...[throwing shade] does not require any specific enunciation to deliver an insult; rather, it uses looks, bodily gestures,
and tones to deliver a message" (McGlotten 2016: 265, 279).
46
Moreover, dominant ideologies play out much messier in the on-the-ground practice of
psychiatry. Ethnographies of psychiatric diagnosis in context have shown that clinicians tend to
take a pragmatic approach to language in psychiatric encounters. Rather than using diagnosis and
assessment as lights that illuminates the inner truths of a patient's psychosis and corresponds
one-to-on with their symptom expression, Lorna Rhodes (1995) has shown that, especially in
resource-low public health contexts like emergency psychiatric hospitals, clinicians diagnosis
patients with bureaucracy in mind, strategizing on how to move a patient through the health care
system in a way that grants them access to resources they need, whether it be medication,
psychotherapy, or confinement in a ward bed. My dissertation focuses on the development of
psychiatric technologies prior to their distribution in clinical settings. That being said, while I do
not study clinical encounters, I am interested in probing my interlocutors' imaginaries about the
lives and afterlives of their prototypes, and how these ideas about their prototypes' potentials
motivate the very models of language, mind, and human difference that get built into them (see
Taussig, Hoeyer, and Helmreich 2013). Moreover, describing moments in which research
subjects refuse, subvert, and jam the data collection process, offers a kind of speculative fiction
of how automated psychiatric assessment might be resisted and/or reformulated to better serve
the patients and people they are designed to interpolate.
Just as psychiatry in the United States reproduces dominant language ideologies, so does
speech signal processing. These two ideologies coalesce in my interlocutors' research,
sometimes in competing, conflicting ways. Speech signal processing forwards a motor theory of
speech, in which speech is one arch of a circuit connecting the brain, the muscles involved in the
production of speech, and the sound of speech itself. The three teams attempt to grasp hold to
this part of the loop-speech-using an assemblage of recording technologies and human labor
47
in order to follow it the brain, which they frame as thel ocus of all human experience, especially
the experience of mental illness. That is to say, the researchers are not setting out to prove that
acoustic features of speech can be read as signs of mental illness, so long as they are listened to
using the right technoscientific mediation. Rather, they are trying to figure out what these
features might be, and how they might be located in clinical encounters so as to render
psychiatric assessment-determining which patients are mentally ill, and which are not-more
efficient. Although they require research subjects' speech to build their data sets, in the context
of the studies, the semantic components of speech are a decoy. They are after something more
fundamental than the what of speech, something more foundational that hardly even looks like
speech at all: sounds that can be described with reference to waveform analysis.
Yet while the research projects turn on the notion that referential semiosis is fallible, and
mental illness cannot be represented linguistically, they must rely on language in their search for
these acoustic, pan-human signs. They must participate in practices of rapport and trust building,
depending on culturally legible ideas about speech and the self. This is where questions of labor
also come into play. Whose job is it to elicit data-speech-from research subjects, and who
works to render speech (sounds) transparent (or trans-sononant)? Should elicitation unfold in the
form of a communicative interaction between two humans? Between a human and a machine?
Between a human and a machine that is, secretly, controlled by a human? Even though
psychiatric assessment-the very genre of interaction they seek to automate-is figured against
diagnosis as less technical, and even though my informants' attempts to automate assessment
might seem to suggest that the work of assessment is less skillful than diagnosis, my fieldwork
revealed that conducting assessment is indeed skillful-albeit undervalued-work. It's skillful
because it requires the performance and display of a certain kind of listening subject position
48
an active, empathic listening subject who is attentive to the meaning and emotional impact of a
patient's speech. This attentive listening is displayed through verbal and non-verbal practices
aimed at maintaining social bonds, at sustaining trust and rapport, and at managing the emotional
wellbeing of the speaker.
Displays of active listening that manage the speaker's impression of how the listener is
listening, and what they are listeningfor are a crucial component of what I refer to as linguistic
labor. While much of linguistic anthropology has focused on the production of speech, the
reception of speech is just as viable of an object of linguistic anthropological concern (Erlmann
2004; Feld and Brennis 2004; Hirshkind 2006; Faudree 2012; Feld 2012, 2015). A larger
intervention of my research is to take listening seriously as an ethnographic object, emphasizing
that listening is not the passive uptake of speech but an agentive communicative practice. This
means pointing out that language ideologies always contain within them listening ideologies-
ideas about how speech should be auditorily attended to and ethically attuned toward, especially
with regards to the speaking subject.
Several other forms of labor that linguistic anthropologists have described fall under the
umbrella of what I am referring to as "linguistic labor," which has both semiotic and linguistic
components. For instance, Miyako Inoue (2018) uses the term "verbatim labor" to refer to the
work involved in ensuring the faithful correspondence between word and text, spoken utterance
and graphic (or otherwise) representation, such as stenographers, medical transcriptionists, and
oversees call center operators. 7 Linguistic labor also relates to what Wilf calls, with reference to
Garfinkel's theory of interaction, "interactional homeostasis," or "the idea that participants in an
7 For a historical account of the gendered dimensions of early telephone operators, the "human switches" who
manually connected calls through the removal and insertion of wires, see Kenneth Lipartito's 1994 article, "When
Women Were Switches."
49
interaction strive to maintain interactional order and compensate for interactional noise and
disorder through negative feedback mechanisms such as 'repair work"' (Wilf 2019: 203). This
includes efforts to ensure that the conversation unfurls "naturally," with a feeling of ease, along
with attempts build rapport, to avoid interactions that make the interactional partners feel
uncomfortable (or ensuring that they feel so comfortable that they share private, intimate details
about their lives). My interlocutors' projects are fascinating case studies in this regard, since part
of what they strive to do is engineer-craft, fabricate, and sustain-a sense that all interactional
partners are playing a symmetrical role in the encounter. The burden of making a conversational
partner feel comfortable and feel an affinity for one another is the interactional duty of the
listener in the case of psychiatric assessment (the mental health care practitioner). In the clinical
encounters that my informants simulate for the purpose of gathering their data, research
personnel must likewise maintain the illusion that they are listening for linguistic content, even
though this is not the interpretive locus of their technologies.
Linguistic labor is a kind of social repair work involving social reproduction: an
interactional practice of custodial maintenance and reproducing the dynamics between active
speaker and passive listener. Linguistic labor can involve an interactional partner enacting an
emotional status-or intersubjective engagement with the semantic content of speech-through
bodily gestures, positive minimal responses, or carefully crafted questions. My ethnography
suggests that this work has both gendered and racialized dimensions. When they engineer
rapport-building conversational agents, or deploy trust-building interactional strategies, my
interlocutors draw on race and gender as resources for tuning the interactional partner's
impression of how their speech is being taken up ad interpreted.
50
THE CLINICIANS, OF COURSE, ARE ROBOTS
Popular discourse surrounding automation and human job loss often posits psychotherapy and
other "caring" professional practices as the hard case against automation, the final stronghold. If
humans leave the tending and mending of the psyche to robots and computers, the story goes,
then this is a sign that humanity has collectively lost its ethico-moral way. The figure of the robot
therapist heralds the end of intimacy, empathy, and thus, the end of authentically "human" care,
because a robot can only provide a cheap parody of "the real thing" (see for example Turkle
2006; Turkle 2018).
On the other hand, historian Elizabeth Wilson (2010) argues that people have long had
therapeutic experiences that are artificial, simulated, and hinge on machinic mediation. She
argues that the psychoanalytic encounter itself is an artificialo ne. The analyst works with the
patient to simulate the relationships they hold in the outside world. The therapist is an avatar-a
playable character, a stand-in-of authoritative figures in the patient's "real" life. Throughout the
dissertation, and in Chapters 3 and 4 especially, I take up Wilson's invitation to explore the
artificial dimensions of mental health care. I do so by searching for the technological and the
mechanical as distinct features of care rather than its abject doubles. This will help to clarify why
it is that certain caring professions, like social workers and medical technicians, are not as
"professional" as others, like licensed clinical therapists and scientific researchers.
Even if robot mental health care workers are imitations of the real thing, this begs the
question: what exactly is the "real thing" that humans have designed them to imitate? What is the
political economy of "real" (call it human-to-human, or face-to-face) psychotherapeutic
encounters in the United States? Who is in a position to receive "real" mental health care?
51
Likewise, why exactly would the mental health services that encounters with an automated
system might provide be such poor substitutes? If an automaton can never be a therapist, then
whose care-care that is not quite enough-would an automaton in a mental health care context
stand in forKeeping Suchman's in mind arguments about the figure of the machine as a
disclosing agent for the human, my dissertation seeks to explore how intimacy, empathy, and
care, have always been artificial-that is, crafted, fabricated, and animated by broader,
intersecting histories and lines of power, rather than sentimental, individually motivated, and
inherently good. The care of mental health care is a form of labor, but there is a hierarchy of
value within the caring professions: at the top of the hierarchy is virtuous, empathic, and expert
work, with mechanical, skilless, drudgery work at the bottom. 8
The organization of labor within the research teams, and the nature of the linguistic labor
their technological prototypes required and performed, reflect and reproduce this hierarchy of
labor. My interlocutors could not legally provide therapy to their research subjects as part of their
participation in the study. Their participation was not meant to be a form of care. Likewise, the
end goal of the study was not to produce a replacement for therapy, but an assistive technology
to determine which patients should be getting therapy. Curiously, however, research subjects
often reported something cathartic and soothing about their participation-something comforting
in feeling listened to-even while they recognized that they were not participating in a ratified
therapy session, and even while they understood that the interaction was machine-mediated (i.e.,
8 Here, I am indebted to Evelyn Nakano-Glenn's analogous observations regarding the raced and classed dimensions
of social reproduction in institutionalized service work. Feminist scholars use social reproduction "to refer to the
array of activities and relationships involved in maintaining people both on a daily basis and intergenerationally"
(1992: 1). Nakano-Glenn argues that, as the conditions of capitalism move social reproduction outside of he home,
creating the "service sector," white women's ascension to more masculinist modes of production (i.e., gainful
employment) was indebted to and only made possible by Black women and women of color taking up the lower
levels of the ranks, i.e., taking white women's place in the home as domestic care-takers. This division is replicated
in the service economy with white women holding managerial roles over Black women and women of color, who
take up the "dirtier" work (data entry, cleaning, collecting blood samples, etc.)
52
that their weekly phone calls with staff member were being recorded for further analysis, that the
animated character interviewing them through a screen was not a human but also not entirely a
computer). For some, participation in the study opened up a small space for healing if not a
momentary suspension of suffering. In other words, though they were not receiving therapy,
there was something therapy-like about the encounter that impacted them (whether positively or
otherwise). After all, given the demographics of the research subject population-veteran,
homeless, disabled, living at or near poverty-participation in the study might have been the
closest thing (or at least resembled most closely) the kind of mental health care resources to
which they had access: psychiatric assessment.
Psychiatric assessment involves sorting potential patients into categories: people who are
well or not sick enough to warrant further medical attention, and people who might be showing
signs of psychiatric distress and are therefore in need of diagnosis, which is the official, medico-
legal designation of an illness category. Because diagnostic categories are embroidered into the
U.S. health care system, diagnosis is, in historian Charles Rosenberg's words, a "bureaucratic
passcode" that grants one access to insurance-covered treatment, from medications to
psychotherapies. Diagnosis is therefore a more authoritative and supposedly more technical form
of clinical judgment that requires more training, credentialing, and licensing than conducting
psychiatric assessment-only certain kinds of medical professionals can make a diagnosis.
Nevertheless, diagnosis is not the first gate of entry into the U.S. mental health care
system. Assessment is. Though assessment is a more informal triage process, it is a necessary
precursor to diagnosis, an obligatory point of passage. In this sense, assessment is just as much
about directing people away from mental health services as it is directing people toward them,
filtering out the "high priority" cases from the low ones. Therefore, psychiatric assessment (and
53
the people who conduct it) play an important role in resource-low public health settings, like
emergency psychiatric hospitals, and an important role for those who cannot pay out of pocket
for their treatment. If diagnosis is a passcode, then assessment (sometimes called "screening"),
within the hierarchy of clinical labor and medical judgment, is a CAPTCHA.
An acronym for Completely Automated Public Turing Test to Tell Computers and
Humans Apart, a CAPTCHA is a challenge response that verifies a user is a human, rather than
an autonomous piece of malicious software, before they can access a screen where they enter in
more sensitive and usually more personal information. CAPTCHAs can take a variety of forms,
but the tasks are designed to be simple (though debates about their accessibility abound). The
user copies down a warped series of letters and numbers, or they must select squares that contain
pictures of a storefront awning from an image overlaid with a quadrant, or they click a box that
says I'm not a robot. Similarly, even before a potential patient can come face to face with a
clinician who punches in the Diagnostic and Statistical Manual code that enables them access to
insurance-covered care, they must submit to another coded game of matching, but one that is
positioned as requiring less expertise on behalf of the administrator: fill out a form, circle a
number between one and nine, provide a short answer to one of their questions. I'm not a robot
becomes I'm not a malingerer.
I understood the technical differences between assessment and diagnosis, and in many
ways, I continued to take these definitions for granted during my fieldwork. My informants
spoke of the difference between assessment and diagnosis in a mundane way, in their
conversations about their career plans to pursue more schooling, or in the life history interviews
we conducted when they described to me their years of interning and training. The difference
between the two also came up often when attempting to correct public misunderstandings of their
54
technologies, or misrepresentation in the popular media by journalists. When they presented their
prototypes to the general public, as in a demonstration similar to the one I witnessed at
Affectiva's Summit, they were constantly defending themselves against outraged audience
members who accused them of trying to replace human therapists with machines. Their alibi was
straightforward and made sense to me at the time: they were not trying to automate therapy.
Automating therapy would be impossible. Only a licensed, trained, credentialed human therapist
could-and should-conduct therapy. They were merely trying to automate aspects of
psychiatric screening, the triage process.
In conversations with my informants and during showcases of the technology, I would
provide an alibi of my own: their technologies were essentially computerized psychiatric
inventories, a form of survey based on DSM categories that patients fill out in order to determine
if they are in need of care. It was only upon reviewing my fieldnotes, even after the Affectiva
Summit, that I noticed the erasure both my informants and I had made: psychiatric screening
tools do not assess a patient on their own. A human, as part of their job, psychiatrically assesses
a patient using the tool.
How had I been unable to remember the person on the other side of the assessment
encounter, the person-typically a social worker or a psychiatric nurse-whose job it is to
interpret the patient's responses to the assessment questions or to calculate their score? I had
forgotten that conducting psychiatric assessment is a professional practice because, in the context
of my fieldwork, the people occupying the position and conducting the work that a social worker
or nurse might were typically not professionals. The members of the research team who
interacted directly with research subjects, conducting interviews with subjects, managing their
production of speech, or listening to and qualitatively judging their speech, were typically the
55
members with the least amount of skill and training. Indeed, these are the kinds of tasks that PIs
often assigned to me when I joined them as a research assistant: work that anyone, regardless of
their credentials (or lack thereof) could conduct. Trying to automate psychiatric assessment
honors the work of people doing psychiatric assessment, recognizing that what they do is
draining and overwhelming. At the same time, to suggest that it is possible-necessary, even-
for a machine to replicate the linguistic labor involved in conducting assessment devalues it and
insinuates that it does not require the type of skilled, tacit knowledge that automated systems are
incapable of capturing.
What is care is a central question in my analysis. But insofar as psychiatric assessment-
and the process of gathering data to automat assessment-is carefully bracketed from treatment,
perhaps an even more central question is: what isn't care? My interlocutors' projects and
people's responses to them-that they can provide cathartic release, but also, that they are
morally unconscionable-help me to parse through the different modes and meanings of care,
"unsettling care" as a stable analytic, troubling the notion that care is always innocent, and
always affectively motivated, and essentially human (Aulino 2012, 2016; Murphy 2015).
In so doing, I call attention to what I call "para-care," practices that are care-like, care-
ful, but cannot be medically or legally ratified as care. Para-care is work that occurs at the
margins and edges of biomedicine writ proper, even while it has (both affirming and harmful)
impacts on its recipients, and even as it seems to closely resemble the official, formalized and
sanctioned care of biomedicine. To closely follow one's IRB protocol and avoid administering
treatment, or to avoid intervening when a research subject expresses suicidal ideation, are both
care-ful practices. Researchers could "take care" with or without "caring for." As described in
Chapters 3 and 4, for instances, some researchers ignored the details of subjects' lives as a means
56
through which to refuse dehumanizing them. Para-care as an analytic can help better capture and
describe this peripheral work, recuperating informal care-like practices that happen beyond and
outside of the umbrella of credentialing or the bureaucratic structures and strictures of
professionalization.
Thus, I am less concerned about the hypothetical dawning of a techno-dystopic future in
which automated systems "replace" human-lead treatment or triage work, although I hope my
dissertation shows the limitations of treating mental health care as a narrowly scientific,
technological problem that can be "hacked," rather than a structural problem that has as much to
do with power and capitalism as it does about facts, numbers, and knowledge. I am much more
concerned with where the line between therapy and therapeutics-technical and mechanical,
human and machine, care and not-care-is drawn. I am much more concerned with who this line
crosses over and erases in the here and now. Para-care helps to illuminate this in between space,
holding its occupants accountable while also uplifting their work as meaningful. For the
boundary work of what care is and isn't has threatens to devalue (and justify the defunding) of
administrative professions within mental health care and threatens to devalue (and dissolve
resources for) the people who live on para-care as their primary form of treatment. By refusing to
take care as "other to technology" (Mol 2008: 5), I attempt to repair the rupture between the two.
BIG DATA AND COMPUTATIONAL PSYCHIATRY
Attempts to use vocal qualities of speech to better understand the neural mechanisms of mental
illness falls under a broader research paradigm, called Computational Psychiatry. As Chapter 1
will discuss, Computational Psychiatry is more connected to prior epistemological eras in
57
American psychiatry then it might at first appear. It is helpful, however, to briefly review a
dominant narrative about the emergence of Computational Psychiatry with regards to trends in
American psychiatry to move away from the DSM and develop "novel" methodologies for
studying mental illness. In this final section, I give a brief account of one such federally funded
project, the Research Domain Criteria (RDoC), aimed at integrating technics and technologies
from engineering and computer science in order to conduct research that will help psychiatry
move away from DSM. The rise of Computational Psychiatry is, in many ways, a response to the
RDoC project, and to increasing demands for alternative methods for understanding the
connection between pathological brains and pathological states and behaviors.
In September 2015, Thomas Insel announced that he would end his 13-year term as
director of the National Institute of Mental Health (NIMH) and join the Life Sciences unit of
Google, a position he subsequently left in 2017 to begin his own startup. Insel's tenure at NIMH
had been marked by his controversial efforts to unseat the Diagnostic and Statistical Manual of
Mental Disorders (DSM) as the field's paramount reference, the text through which American
psychiatrists are trained to interpret their patients' symptoms or even identify what constitutes a
symptom. DSM is the brick and mortar of a far-reaching "diagnostic infrastructure" in the US
(Lakoff 2005: 256). Research centers and entire journals have been established for the express
purpose of exploring a single diagnostic category (schizophrenia, bipolar disorder, depression,
anxiety, etc.) and these categories, for many patients, have come to define the way they live their
lives and understand themselves.
Nevertheless, like a growing number of researchers, Insel insisted that DSM's categories
are insufficient because they are not based on any kind of biological measures linked to the
underlying mechanisms of psychopathology, which remain poorly understood. In justifying his
58
jump from the public to private sector, Insel contended that the engineers of Silicon Valley
possess exactly the skills that institutions like NIMH lack: the ability to capture and process
behavioral data at an unprecedented scale, especially data that has never been studied in tandem
with the presence of mental illness. According to Insel, if anything, DSM had fortified the gap
between basic science research and applied research, causing more harm than good in the
process. Subsequent editions of DSM, including the most current edition (DSM-5) more or less
resemble DSM-III in terms of their structure and due to their focus on observable symptoms
rather than disease etiology. Editions of the DSM have been published with little to no recourse
to research findings in neuroscience. Many point to the failure of DSM to absorb or reflect
neuroscience findings as a mark of contemporary psychiatry's need for another, paradigm-
shifting overhaul.
For instance, in a February 2014 post to his Director's Blog on the NIMH webpage, in
which he summarizes the new RDoC funding announcements, Insel declared that
"Industry has reduced investments in medications for mental disorders and payers are
raising questions about the quality of evidence for psychosocial treatments. We hope that
this new approach to clinical trials [RDoC] will set us on a course to having the science
base necessary for generating effective new therapeutics and validating those we have
now.",
In a co-authored article published the same year, Insel pointed to a number of other studies
indicating that, despite advances made "in modem biology, especially contemporary cognitive,
affective, and social neuroscience," along with advances in neuroimaging technology that
enables the observation of brain activation and electrical brain activity, The American
Psychological Association, which publishes the DSM, has consistently been unable to
incorporate these findings into DSM, including in DSM-5 (Insel and Cuthbert 2015:499). They
cite World Health Organization statistics on morbidity caused by mental health disorders-over
59
800,000 suicides each year globally, most of which were linked to mental illness (2014: 499)-
and suggest that these fatalities could have been avoided if there were more effective treatments
available. The inefficacy of treatments, they argue, is due to the fact that DSM does not describe
biologically based pathologies. There is no guarantee, then, that treatments developed in studies
conducted using DSM categories target any kind of biological mechanism, because there is no
guarantee that people who share the same diagnosis also share some kind of biological likeness.
For these reasons-among many others-Insel concluded that DSM is both an outdated
research tool and an unethical one, insisting that "patients with mental illnesses deserve better"
than what DSM can give them (Insel 2014). RDoC is the latest attempt to make it more
objective, anchored in biology. RDoC is not itself a diagnostic nosology. Rather, it is a template
listing domains of research that investigators can use to design and test hypotheses about the
mechanisms of psychopathology, with, as Insel says in a 2012 post to his Director's Blog on the
NIMH website, the "near-term goal" of restructuring and refocusing research away from DSM,
despite the primary role the manual plays across multiple sectors of contemporary life9. In a
commentary piece on the difference between RDoC and DSM published in Nature Reviews
Neuroscience, B.J. Casey, developmental psychobiologist, and Francis S. Lee, research
psychiatrist, clarify that the purpose of RDoC is to "facilitate the translation of basic
neuroscience research findings to clinical diagnosis and treatment," although the translation stage
is expected to come much farther down the line (Casey et al 2013:812). Instead of "working
backwards" by trying to describe the neurobiological basis of DSM-dictated diagnostic
categories, "the RDoC approach uses our current understanding of brain-behavior relationships
as the starting point and relates these to clinical phenomenology" (ibid.)
' http://www.n1inh.nih.gov/about/director/2012/research-domain-criteria-rdoc.shtii, accessed on Janurary 12, 2015.
60
In a 2017 commentary piece in Nature, Insel invites researchers to "join the disruptors of
health science" and leave academia for the technology industry sector, where there are fewer
regulations and greater financial incentives (and resources) to move fast and produce results-
focused interventions. Moreover, there are engineers. Engineers have the expertise for
developing methods to capture and analyze vast quantities of data, data which is inaccessible in
academic where restrictions regarding privacy and confidential place limits on how much (and
what kinds) of data can be gathered and stored. Within his start-up, Insel is a champion of
"digital phenotyping," also known as searching for "digital biomarkers" (Carey 2019; Dagnum
2018). Digital phenotyping is one instantiation of Computational Psychiatry. Rather than build
research studies, with an aim of producing an intervention, based on previous studies about the
nature of mental illness, the data-driven approach of Computational Psychiatry turns on
gathering as much data as possible, regardless of the data's relationship to conventional ideas
about mental illness. Hence, Insel and others, including the Midwestern University research
group described in Chapter 4, have turned to mobile phones as a research tool, focusing their
analysis on the data users inadvertently transmit simply by using their phones (see also Brandt
and Stark 2018).
At surface level, Computational Psychiatry and all its various iterations rehearses the
field's enduring biological essentialism and its positivist longings, with a twist. Its advocates
strive to stabilize mental illnesses with an appeal to a mode of objectivity committed to
reproducing the truths of nature passively, with as little intervention from the scientist's hand,
mind, or heart as possible. Champions of Computational Psychiatry like Insel, who warmly
accept its "biotechnical embrace" (DelVecchio Good 2002) seek to achieve mechanical
objectivity through a "big data" approach. In theory, if data is gathered at a high enough
61
volume-if it's "big" enough-patterns will emerge from the data, and correlations that have
always been there, under our noses (or our ears) will become evident, correlations that might
even cut across the conventional boundaries between diagnostic categories set down in DSM.
Another larger aim of my dissertation is to challenge this notion of patterns emerging "on their
own," a discourse that many people who participate in big data projects are critical of. I show the
gap between the discourse of Computational Psychiatry and how things operate on the ground;
this includes giving voice to research practitioner's heterogeneous beliefs and attitudes toward
their work and its discursive promises.
STRUCTURE OF THE DISSERTATION
Chapter 1, Computational Psychiatry's Coded Past, historicizes the preconditions and pre-
occupations that foreground the ethnographic case studies to follow. I review, and then read
against, primary and secondary source material narrating the story of American psychiatry's
infamous "paradigm shift" in order to locate connections between previous movements to re-
make psychiatry and the efforts surrounding Computational Psychiatry in the present. Tracing
ideas about the supposed stability of biomedical things back to a pivotal point of change in
Western psychiatry underscores how ideas about the biomedical resemble and are co-constituted
by their ideas about the computational. I show how the exchange of metaphors between the
biological and the computational, an exchange that other scholars have observed in the history of
the life sciences, is a key feature of the history of North American psychiatry as well. This
metaphorical exchange continues to rhetorically inform the design and development of
62
Computational Psychiatry research, and the division of labor within research teams, in the
contemporary moment.
Chapter 2, "Talking Heads: Brains, Bodies, and Vocal Biomarkers," is the first
ethnographic chapter in the dissertation. It follows the interdisciplinary team at East Coast
University, which is situated in a neuroscience department. This team attempts to better
understand the neural mechanisms underlying depression through micro-level features of the
voice, in their words, "using the voice to understand the mind." In theory, a "vocal biomarker" of
depression is a sound that cuts directly to biological processes, so directly that its mere presence
stands in for and is commensurate with a pathological brain state. By focusing on the
maintenance work involved in data collection-especially efforts to discipline the body and
speech of research subjects-I show how the necessary pre-condition for studying speech as a
"natural object" is difficult (if not impossible) to maintain in practice. The search for vocal
biomarkers, a radically im/mediate sign, requires hyper-mediation.
In Chapter 3, "Do Androids Dream of Electric Speech?" I move to West Coast
University, with a team of researchers working to build a Virtual Human Interviewer (VHI)
system supposedly capable of conducting psychiatric assessment in a way that a human never
could: with far more accuracy, and without ever burning out. Fueled by military funding, this
team seeks to build a tool that can address post-traumatic stress disorder (PTSD) among veteran
populations. The virtual human's rapport-building, interactional infrastructure and its real-time
interactions with research subject, are propped up, sustained and animated by human work, and
not just the classificatory work of labeling data and meta-data. The VHI is also sustained by and
made possible through the stitching together of culturally and historically specific ideologies of
63
language, mind, interaction, race, and gender, which researchers embed in its infrastructure as
they develop, test, and maintain it.
Chapter 4, "Listening Like a Computer," focuses on a team of researchers attempting to
build a cell phone application that can predict when a person with bipolar disorder will have a
manic episode based on changes in the quality of their speech. In addition to a case study of the
infrastructural arrangements, categorizing practices, and labor required to make digital
phenotyping possible, in this chapter, I focus on the figure of bipolar disorder as a mood disorder
that causes audible changes in the quality of speech. The engineers and clinical team members
butted up against the limits of what listening can capture from the voice, implicitly challenging
the biological essentialism of the project in their day-to-day dealings with the research subjects'
voice data. Part of what is at stake in the BPU's work is the semantic ambiguity and polysemy
not only of emotional terms like "mania" and "depression," but of the term "listening" itself,
especially with respect to agency, responsibility, and professional codes of ethics.
In the Conclusion, I suggest that the ethico-moral frameworks and conundrums that
characterize the teams' interventions and enactments of listening, language, assessment, and
care, are strange but also familiar. They bear an uncanny resemblance to the broader milieu of
mental health care and digital media in the United States. Altogether, the proceeding
ethnographic chapters illustrate the work that Euro-American language ideologies are doing for
psychiatry and the mental health care sector in the age of digital reproduction. I conclude by
exploring how the ideologies themselves may (or may not) be shifting in the current moment.
64
References
Affectiva. <https://www.affectiva.com/> (accessed July 12, 2019).
Aulino, Felicity. 2012. "Senses and Sensibilities: The Practice of Carein Everyday Life in
Northern Thailand." Doctoral dissertation, Harvard University.
Aulino, Felicity. 2016. "Rituals of Care for the Elderly in Northern Thailand: Merit, Morality,
and the Everyday of Long-Term Care." American Ethnologist 43(1):91-102
Benjamin, Ruha. 2016. "Informed Refusal: Toward a justice-based bioethics." Science,
Technology, and Human Values 41(6): 967-990.
Benjamin, Ruha. 2019. Race After Technology: Abolitionist Toolsfor the New Jim Code.
Cambridge, UK: Polity Press.
Berry, Maya J., Claudi Chivez Argilelles, Shanya Cordis, Sarah Ihmoud, Elizabeth Veliscuez
Estrada. 2019. "Toward a Fugitive Anthropology: Gender, Race, and Violence in the Field."
Cultural Anthropology 32(4): 537-656.
Brandt, Marisa and Luke Stark. 2018 "Exploring Digital Interventions in Mental Health: A
Roadmap," in Interventions: Communication Research and Practice (International
Communication Association 2017 Theme Book). Adrienne Shaw and D. Travers Scott, eds. Pp.
167-182. Bern: Peter Lang.
Buolamwini, Joy. 2016. "InCoding-In the Beginning." Medium, May 16.
<https://medium.com/mit-media-lab/incoding-in-the-beginning-4e2a5c5 I a45d#.efx8zxith>
(accessed 13 July, 2019).
Carr, E. Summerson. 2010a. Scripting Addiction: The Politics of Therapeutic Talk and American
Sobriety. Princeton: Princeton University Press.
Carr, E. Summerson. 201Ob. "Enactments of Expertise." Annual Review ofAnthropology 39:17-
32.
Carr, E. Summerson and Yvonne Smith. 2013. "The Poetics of Therapeutic Practice:
Motivational Interviewing and the Powers of Pause." Culture, Medicine and Psychiatry 38:83-
114.
Carey, Benedict. 2019. "California Tests a Digital 'Fire Alarm' for Mental Illness." New York
Times, June 17. <https://www.nytimes.com/20 19/06/17/health/mindstron g-mental-health-
app.html> (accessed August 4, 2019).
Chion, Michael. 1990. Audio-Vision: Sound on Screen. Claudia Gorbman, trans. New York:
Columbia University Press.
65
Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009.
Introduction to Algorithms. 3 rd Edition. Cambridge: MIT Press.
Dagum, Paul. 2018. "Digital biomarkers of cognitive function." NPJ Digital Medicine1 (10).
David, E.E. Jr. "Bionics or Electrology? An Introduction to the Sensory Information Processing
Issue." IRE Transactionso n Information Theory 8(2): 74-77.
DelVecchio Good, Mary-Jo. 2002. "The Biotechnical Embrace." Culture, Medicine, and
Psychiatry 25(4): 395-410.
Duranti, Alessando. 1993. "Truth and Intentionality: Towards an Ethnographic Critique."
CulturalA nthropology 8(2):214-245.
Eidsheim, Nina Sun. 2019. The Race of Sound: Listening, Timbre, and Vocality in African
American Music. Durham: Duke University Press.
Erlmann, Viet, ed. 2004. Hearing Cultures: Essays on Sound, Listening, and Modernity. New
York: Bloomsbury.
Feld, Steven and Donald Brenneis. 2004. "Doing anthropology in sound." American Ethnologist
31(4): 461-467.
Feld, Steven. 2012. Sound and Sentiment: Birds, Weeping, Poetics, and Song in Kaluli
Expression. 3 0 th Anniversary Edition. Durham: Duke University Press.
Feld, Seven. 2015. "Acoustemology." In Keywords in Sound Studies. David Novak and Matt
Sakakeeny, eds. Pp. 12-21. Durham: Duke University Press.
Faudree, Paja. 2012. "Music, Language, and Texts: Sound and Semiotic Ethnography." Annual
Review ofAnthropology 41: 519-536.
Gordin, Michael. 2015. Scientific Babel: How Science Was Done Before and After Global
English. Chicago: University of Chicago Press.
Gupta, Akhil and James Ferguson. 1002. "Beyond 'Culture': Space, Identity, and the Politics of
Difference." CulturalA nthropology 7(1): 6-23.
Gusterson, Hugh. 1996. Nuclear Rites: A Weapons Laboratorya t the End of the Cold War.
Berkeley: University of California Press.
Hamid, Ekbia R. and Bonnie Nardi. 2017. Heteromation and Other Stories of Computing and
Capitalism. Cambridge: MIT Press.
Hannerz, Ulf. "2003 Being there...and there...and there! Reflections on multi-sited
ethnography." Ethnography 4(2):201-216.
66
Haraway, Donna. 1991. Simians, Cyborgs, and Women: The Reinvention ofNature. London:
Routledge.
Harkness, Nicholas. 2013. Songs of Seoul: An Ethnography of Voice and Voicing in Christian
South Korea. Berkeley: University of California Press.
Harkness, Nicholas. 2015. "The Pragmatics of Qualia in Practice." Annual Review of
Anthropology 44:573-89.
Hirshkind, Charles. 2006. The Ethical Soundscape: Cassette Sermons and Islamic Conterpublics.
New York: Columbia University Press.
Inoue, Miyako. 2018. "Word for Word: Verbatim as Political Technologies." Annual Review of
Anthropology 47: 271-32.
Insel, Thomas. 2017. "Join the disruptors of health science." Nature 551: 23-26.
Irani, Lilly. 2017. "'Design Thinking': Defending Silicon Valley at the Apex of Global
Hierarchies of Labor." Catalyst 4(1): 1-9.
James, Erica. 2004. "The Political Economy of 'Trauma' in Haiti in the Democratic Era of
Insecurity." Culture, Medicine, and Psychiatry 28: 127-149.
Jasanoff, Sheila and Sang-Hyun Kim. 2015. Dreamscapes of Modernity: Sociotechnical
Imaginariesa nd the Fabricationo fPower. Chicago: University of Chicago Press.
Jones, Graham M. and Bambi B. Schieffelin
2009. "Enquoting Voices, Accomplishing Talk: Uses of Be + Like in Instant Messaging."
Language & Communication 29(1): 77-113.
Keane, Webb. 1997. "Religious Language." Annual Review ofAnthropology 26:47-71.
Keane, Webb. 2008. "Others, Other Minds, and Others' Theories of Other Minds: An Afterward
on the Psychology and Politics of Opacity Claims." Anthropological Quarterly 81(2):473-482.
Kockelman, Paul. 2017. The Art of Interpretationi n the Age of Computation. Oxford, UK:
Oxford University Press.
Lipartito, Kenneth. 1994. "When Women Were Switches: Technology, Work, and Gender in the
Telephony Industry, 1890-1920." The American HistoricalR eview. 99(4): 1075-1111,
Manyika, James, Michael Chui, Mehdi Miremadi, Jacques Bughin, Katy George, Paul Willmott,
and Martin Dewhurst. 2017. A Future that Works: Automation, Employment, and Productivity."
McKinsey Global Institute Executive Summary. McKinsey&Company. <
httos://www.mckinsev.com/-/imedia/mckinsev/fcatured%/20insi2hts/Di]ital%/20Disrution/Harn
67
essing%20automation%20for%20a%20future%20that%20works/MGI-A-future-that-works-
Executive-summary.ashx>
Marcus, George E. 1995. "Ethnography in/of the World System: The Emergence of Multi-Sited
Ethnography." Annual Review ofAnthropology 24:95-117.
Marsilli-Vargas, Xochitl. 2014. Listening genres: The emergence of relevance structures though
the reception of sound. Journal ofPragmatics( 69):42-51.
McGlotten, Shaka. 2016. "Black Data." In No Tea, No Shade: New Writings in Black Queer
Studies. E. Patrick Johnson, ed. Pp. 262-286. Durham: Duke University Press.
Mills, Mara. 2010. "Do Signals Have Politics? Describing Abilities in Cochlear Implants." In
The Oxford Handbook ofSound Studies. Trevor Pinch and Karin Bijstervled, eds. Pp. 320-346.
Oxford, UK: Oxford University Press.
Mills, Mara. 201Ia. "Deaf Jam: From Inscription to Reproduction to Information." Social Text
28(1): 35-58.
Mills, Mara. 201lb. "On Disability and Cybernetics: Helen Keller, Norbert Wiener, and the
Hearing Glove." differences 22(2-3): 74-111.
Mitchell-Kernan, Claudia. 1972. "Signifying and marking: Two Afro-American speech acts." In
Directions in Sociolinguistics. John J. Gumperz and Dell Hymes, eds. Pp. 161-179. New York: Holt,
Rinehart and Winston.
Mol, Annmarie. 2008. The Logic of Care: Health and the Problem ofPatient Choice. London:
Routledge.
Murphy, Michelle. 2015. "Unsettling Care: Troubling transnational itineraries of care in feminist health
practices." Social Studies ofScience 45(5): 717-737.
Nakamura, Lisa. 2009. "Don't Hate the Player, Hate the Game: The Racialization of Labor in World of
Warcraft." CriticalS tudies in Media Communication 26(2): 128-144.
Nakamura, Lisa. 2014. "Indigenous Circuits: Navajo Women and the Racializaton of Early
Electronic Manufacture." American Quarterly 66(4): 919-941.
Nakano Glenn, Evelyn. 1992. "From Servitude to Service Work: Historical Continuities in the
Racial Division of Paid Reproductive Labor." Signs 18(1): 1-43.
Nguyen, Vinh-Kim. 2010. The Republic of Therapy: Triage and Sovereignty in West Africa's
Time ofAIDS. Durham: Duke University Press.
Perakyla, Anssi. 1995. AIDS Counseling. Cambridge, UK: Cambridge University Press.
68
Petryna, Adriana. 2009. When Experiments Travel: Clinical Trials and the Global Searchfor
Human Subjects. Princeton: Princeton University Press.
Philip, Kavita, Lilly Irani, and Paul Dourish. 2012. "Postcolonial Computing: A Tactical
Survey." Science, Technology, & Human Values 37(1): 3-29
Picard, Rosalind. 1995. "Affective Computing." M.I.T. Media Laboratory Perceptual Computing
Section Technical Report 321.
Picard, Rosalind. 1997. Affective Computing. Cambridge: MIT Press.
Picard, Rosalind. 2003. "Affective computing: challenges." InternationalJ ournalo f Human-
Computer Studies 59: 55-64.
Reik, Theodor. 1964. Voicesfrom the Inaudible: the Patients Speak. Farrar, Straus, New York.
Rhodes, Lorna. 1995. Emptying Beds: The Work of an Emergency Psychiatric Unit. Berkeley:
University of California Press.
Roosth, Sophia. 2017. Synthetic: How Life Got Made. Chicago: University of Chicago Press.
Rosa, Jonathan. 2019. Looking Like a Language, Sounding Like a Race: Raciolinguistic
Ideologies and the Learning ofLatinidad. Oxford, UK: Oxford University Press.
Rosaldo, Michelle Z. 1982. "The things we do with words: Ilongot speech acts and speech act
theory in philosophy." Language in Society 11(2):203-237.
Schieffelin, Bambi B. 2008. "Speaking Only Your Own Mind: Reflections on Talk, Gossip, and
Intentionality in Bosavi (PNG)." Anthropological Quarterly 81(2):432-442.
Seaver, Nick. 2017. "Algorithms as culture: Some tactics for the ethnography of algorithm
systems." Big Data and Society 1-17.
Seaver, Nick. 2019. "Knowing Algorithms." In digitalSTS: A Field Guidefor Science &
Technology Studies. Janet Vertesi and David Ribes, eds. Pp. 412-422. Princeton: Princeton
University Press.
Silverstein, Michael. 1985. "On the pragmatic 'poetry' of prose." In Meaning, Form and Use in
Context. D. Schiffrin, ed. Pp. 181-199. Washington: Georgetown University Press.
Silverstein, Michael. 1998. "The Uses and Utility of Ideology: A Commentary." In Language
Ideologies: Practicea nd Theory. Bambi B. Schieffelin, Kathryn Woolard, and Paul Kroskrity,
eds. Pp. 123-145. New York: Oxford University Press.
Silverstein, Michael. 2012. "Denotation and the pragmatics of language." The Cambridge
Handbook ofLinguistic Anthropology. N.J. Enfield, Paul Kockelman and Jack Sidnell, eds. Pp.
69
128-157. Cambridge: Cambridge University Press.
Simpson, Audra. 2007. "On Ethnographic Refusal: Indigeneity, 'Voice,' and Colonial
Citizenship." Junctures 9: 67-80.
Smith, Benjamin. 2005. "Ideologies of the speaking subject in the psychotherapeutic theory and
practice of Carl Rogers." Journal ofLinguistic Anthropology 15:258-72.
Stasch, Rupert. 2008. "Knowing Minds is a Matter of Authority: Political Dimensions of Opacity
Statements in Korowai Moral Psychology." Anthropological Quarterly 81(2):443-453.
Stevens, Hallam. 2013. Life Out of Sequence: A Data-DrivenH istory ofBioinformatics.
Chicago: University of Chicago Press.
Suchman, Lucy. 2007. Human-Machine Reconfigurations: Plans and SituatedA ctions. 2d
Edition. Cambridge, UK: Cambridge University Press.
Sunder-Rajan, Kaushik. 2006. Biocapital:T he Constitution of Postgenomic Life. Durham: Duke
University Press.
Taussig, Karen-Sue, Klaus Hoeyer, and Stefan Helmreich. 2013. "The Anthropology of
Potentiality in Biomedicine: An Introduction to Supplement 7." CurrentA nthropology 54(S7):
S3-S14.
Taylor, Astra. 2018. "The Automation Charade." Logic 5: < https://ogicnag.io/failure/the-
automation-charade> (accessed August 5, 2019).
Thakor, Mitali. 2018. "Digital Apprehension: Policing, Child Porn, and the Algorithmic
Management of Innocence." Catalyst: Feminism, Theory, Technoscience 4(1): 1-16.
Throop, Jason. 2010. Suffering and Sentiment: Exploring the Vicissitudes ofExperience and
Pain in Yap. Berkeley: University of California Press.
Turkle, Sherry. 2006. "A Nascent Robotics Culture: New Complicities for Companionship."
AAAI Technical Report Series, July.
Turkley, Sherry. 2018. "There Will Never Be An Age of Aritifical Intimacy." The New York
Times, August 11. < https://www.nytimes.com/2018/08/ I /opinion/there-will-never-be-an-age-
of-artificial-intimacy.html> (accessed August 4, 2019).
Vehvilainen, Sanna. 2008. "Focus on the patient's action: identifying and managing resistance in
psychoanalytic interaction." In ConversationA nalysis and Psychotherapy. Anssi Perdkyld,
Charles Antaki, Sanna Vehvilainen, Ivan Leudar, eds. Pp. 120-38. Cambridge, UK: Cambridge
University Press.
70
Visweswaran, Kamala. 2003. Fictions ofFeminist Ethnography. Minneapolis: University of
Minnesota Press.
Vrecko, Scott. 2010. "Birth of a brain disease: science, the state, and addiction neuropolitics."
History of the Human Sciences 23(52):52-67.
Wilce, James M. 2009. "Medical Discourse." Annual Review ofAnthropology 38:119-215.
Wilf, Eitan. 2019. "Separating noise from signal: The ethnomethodological uncanny as aesthetic
pleasure in human-machine interactions in the United States." American Ethnologist 46(2): 202-
213.
Wilson, Elizabeth. 2010. Affect andArtificialI ntelligence. Seattle: University of Washington
Press.
Winner, Langdon. 1980. "Do Artefacts Have Politics?" Daedalus 109(1): 121-136.
Wynter, Sylvia. 2003. "Unsettling the coloniality of being/power/truth/freedom: Towards the
human, after man, its overrepresentation: An argument." New CentennialR eview 3(3): 257-337.
71
Chapter 1: Computational Psychiatry's Coded Past
"Perhaps in parasitology, in orthopedics, and in computer technology one can escape from
humanism, but not in psychiatry...it has more in common with the inevitable ambiguity of great
drama than with the DSM-III's quest for algorithms compatible with the cold binary logic of
computer science"
- (Vaillant 1984: 544).
"We used laugh and kind of say, okay, we'll stop trying to teach the computer to act like
a clinician...we're trying to teach the clinician to apply logical rules, kind of more like a
computer."
- Jean Endicott, DSM-III Task Force member, to Jackie Orr (2006: 245)
By popular and scholarly accounts, and according to the psychiatric researchers and practitioners
I spoke to during my preliminary fieldwork, the publication of the third edition of the Diagnostic
and Statistical Manual of Mental Disorders (hereafter DSM-III) in 1980 marked a significant
turning point in the history of North American psychiatry. As the dominant narrative goes, its
publication both represented and catalyzed a radical break from the old epistemological guard of
psychoanalysis: once DSM-II went out into the world, psychiatry as it was practiced in the U.S.
and exported elsewhere had changed (Spitzer 2001; Sanders 2011). In the subtitle of her book
dedicated to telling the story of DSM-II1's creation, medical historian Hannah Decker (2013)
goes so far as to equate the diagnostic manual's third, official revamping with a "conquest of
American psychiatry" (my emphasis). Indeed, the roughly 450-page text-330 or so pages
longer than its predecessor-initiated dramatic, far-reaching change in the United States. DSM-
III was a powerful document because of how seamlessly it fused with and reinforced the
bureaucratic logics and logistics of biomedicine, which insurance companies and pharmaceutical
manufacturers had increasingly come to dictate. As Jackie Orr puts it, "in the entangled realms of
psychiatry and psychotherapy, medicine, the pharmaceutical industry, the legal system, the
72
insurance industry, social and self-identity, and popular discourse," DSM-III birthed "a new
order of things" (Orr 2010: 354). This was accomplished, in part, because DSM-I1I standardized
the language of psychiatry in a way that it had never been before, linguistically aligning the
interaction between clinician and client (Semel 2013) across the triad of the psychiatric
encounter-from assessment/screening, to diagnosis, to monitoring. The symptom criteria in
DSM-III and in subsequent editions structured the questions that clinicians asked patients, and
determined how to interpret the content of patient responses.
Previous iterations of DSM (like DSM-11, published in 1968) grounded the diagnostic
criteria of mental illnesses in psychoanalytic theories of disease causality-thwarted libidos,
overly cathected egos, and so on. One of the crowning achievements and most controversial
moves of the DSM-1II task force, fronted by Dr. Robert Spitzer, was to expel psychoanalysis
from the manual as much as possible in favor of defining and grouping illnesses according to
symptoms that cohorts of research subjects seemed to share (APA 1980). The task force wanted
to categorize mental illnesses based on the symptoms that patients expressed and that any and
every clinician, regardless of their theoretical training, could identify (Spitzer and Sheehy 1976).
For Spitzer and his colleagues, this emphasis on symptom phenomenology put their work in lock
step with Emil Kraepelin, the 1 9th century German physician who was both a foundational figure
to North American psychiatry and a foil to Freud (Decker 2007). Kraepelin drew his
classificatory schema from long-term observations of patients' suffering: detailed descriptions of
his patient's hallucinations, glossolalia, the contents of their obsessional thoughts, bodily ticks,
attempts at self-harm, and so on. If the goal of Freud's disciples was to identify sublimated
connections between the known and unknown selves of the analysand, the goal of the
73
Kraepelinians was to scrutinize and catalogue the various behavioral manifestations, grand and
minute, of psychic pathology.
Task force members deemed their neo-Kraepelinian, phenomenological approach an
"atheoretical" one (Feigner 1979; Bayer and Spitzer 1985). Their ideological approach aimed to
realize a genre of objectivity that Daston and Galison (2007) call "truth to nature," featuring
images of what symptoms of mental illnesses looked like in a way that was as faithful to the
biologically processes possible, putatively unmediated by any dogma or theory. This was, at
least, their ideal, the form for which the neo-Kraepelinians strived. The success of their endeavor
may be debated. For instance, the task force's decision to include "psychogenic pain disorder"
and "ego dystonic homosexuality"" in the manual not only bespeaks the lingering presence of
psychoanalytic etiology but also signals that DSM-II1 remained a technology for policing social
deviance rather than, as the task force had wished, a magnifying glass for identifying biologically
based pathology. Moreover, as others then and now have pointed out, the task force's
"atheoretical" approach in and of itself posited a theoretical orientation toward making sense of
the world (see Klerman 1977; Rosenberg 2007; Orr 2010). Indeed, task force members drew
from very specific interpretations of how other biomedical practices (like orthopedics and
oncology) conceived of and investigated their objects of study: namely, as stable, discretely
defined, material phenomena that could be extracted from their contexts of occurrence.
" The task force did indeed remove the category of "sexual orientation disturbance" in 1973, a diagnostic criterion
that overtly pathologized homosexuality. Unlike "sexual orientation disturbance," "ego dystonic homosexuality"
corresponds with the distress one feels upon the realization of one's attraction to same-gendered people, or due to
feelings of shame after participating in shame-gendered sexual acts. Spitzer and company's reasoning was that this
category would allow people to seek counseling for these distressing, shameful feelings. Critics, however, argue that
this diagnostic criterion leaves room for the pathologization of non-heterosexuality, while also keeping the door
open for conversion therapy as one method of treatment (i.e., eliminate shame and distress by converting sexual and
romantic desires from same-gendered to different-gendered).
74
In this chapter, I historicize the preconditions and pre-occupations that foreground the
ethnographic case studies to follow. I do so by reviewing and reading against secondary source
material narrating the story of American psychiatry's infamous "paradigm shift," along with
research articles produced during this time. Tracing ideas about the supposed stability of
biomedical things back to a pivotal point of change in Western psychiatry underscores how task
force members' ideas about the biomedical resemble and are co-constituted by their ideas about
the computational. In other words, the rise of what Jackie Orr calls "biopsychiatry"-which
"embraces a medicalized model of mental disorders while claiming a scientific status for
contemporary psychiatric practices of diagnosis and treatment" (2010: 345)-coincides with the
introduction of computers and other machines into psychiatry in the U.S. Thus, I show how the
exchange of metaphors between the biological and the computational, an exchange that other
scholars have observed in the history of the life sciences (Fox-Keller 1995; Helmreich 1998;
Hayles 1999; Kay 2000; Erickson et al 2013) is a key feature of the history of North American
psychiatry as well (see also Martin 2007). This metaphorical exchange continues to rhetorically
inform the design and development of Computational Psychiatry research, and the division of
labor within research teams, in the contemporary moment.
My analysis does not focus on the practice of psychiatry, so I treat neither the
administration of psychiatric care, nor acts of assessment or diagnosis themselves. Instead, I
trace out the ideological strands and rhetorical shifts within U.S. psychiatry, articulating them in
order to illustrate how they continue to motivate research projects that fall under the domain of
Computational Psychiatry, such as my informants' research. My aim in this chapter is two-fold.
First, I seek to clarify how the figure of the machine (especially the computer) operates within
primary and secondary sources surrounding the publication of DSM-III, underscoring the way
75
that efforts to introduce machines into the diagnostic encounter serve as switching points at
which ideas about the computational and the biological come together, or are co-produced
(Jasanoff 2004). Throughout the 2 0th century up until the present, psychiatric researchers and
practitioners recognized that mental illnesses are frustratingly slippery "moving targets" that
"emerge in the encounter between patients' subjective reports and a clinician's interpretive
schemes" (Lakoff 2005: 2), thereby resisting rigid, definitional boundaries and consistency
across person, place, or time. Running in tandem with this frustration have been attempts to
overcome it using the discourse of the machines (and often, real machines) with the notion in
mind that machines have an inherent, essential capacity for locating the biomaterial
underpinnings of mental illnesses and settling them into discrete, specific units. Machine reason
(as disinterested, binary, and less costly) is positioned against human reason (as haphazard,
infinitely varied, and expensive).
Upon closer examination, however, rhetorics and enactments of machine logics rely on
the work of para-professional administrative laborers, from typists, to assistants, to technicians
(see Schaffer 1994). Secondary source material in particular tends to shift focus away from this
supportive, administrative labor and the crucial role these actors played in the computerization of
psychiatry. Recovering such persons and their labor as well as underscoring both their presence
and importance can help to intervene on contemporary debates about the capacity for machines
to "replace" human labor-debates in which my informants often participate. Even in the earlier
years of the history of computing in the U.S., what looks at first like human "replacement" was
human re-placement: instances in which humans are situated to a less visible positions in the
production pipeline, or, due to the status of the labor they perform or the jobs they hold,
assimilated themselves to the figure of machine.
76
The second aim of this chapter is to historically situate the so-called paradigm shift that
"Computational Psychiatry" enacts. Popular discourse, such as the language used in a 2017
article in the MIT Technology Review, deems Computational Psychiatry an "emerging science"
facilitating a "quiet revolution" that turns away from the past. As discussed in the Introduction,
Computational Psychiatry involves the use of artificial intelligence-enabled analysis methods to
pin down signs of psychopathology in biological processes, especially neuronal activity.
Researchers and journalists tend to define Computational Psychiatry against Western
psychiatry's traditional hypothesis or theory-driven approach, in part as a way to signal its
novelty (and thus its lack of historicity). In conventional research contexts, researchers use
existent hypotheses or theories about mental illness based on prior scholarship to structure their
research questions. Conversely, with the data-driven approach that is the hallmark of
Computational Psychiatry, researchers apply "theoretically agnostic data-analysis methods from
machine learning (ML) broadly construed (including, but extending, standard statistical
methods)" to structure their research (Huys et al 2016: 404; my emphasis). I argue that
Computational Psychiatry, rather than marking a clean break from the past, represents a re-
instantiation, retrenchment or even reiteration of the epistemological goals and the infrastructural
requirements of the years leading up to its ascendency in the present, namely, the use of
machines and computational processes to "delete the social" from practices of studying and
identifying mental illness (Leigh Star 1991; Forsythe 1993).
Part of this historicizing work includes challenging the dominant narratives in many
secondary sources of DSM-III's own paradigm shift, which represent the publication of DSM as
a totalizing, top-down transformation away from "the inevitable ambiguity of great drama" and
toward empiricism, theoretical neutrality, and objectivity. I do so by highlighting Lempert's
77
(2019) groundbreaking article describing the models of empiricism and the use of recording
devices in psychiatry in the years prior to DSM-III (although I suggest that Lempert downplays
the important role that administrative workers played in making these projects possible). Freud
initially defined psychoanalysis, and the psychoanalytic therapist, against biomedicine, arguing
that psychoanalysis does not deal with the biophysiological realm; this is the definition of
psychiatry that the neo-Kraepelinians worked to overturn. Nevertheless, as Lempert shows, there
were concerted efforts in the 1930s and 1940s in the U.S. to render psychoanalysis into an
empirically grounded science, precisely through the use of machines to capture the ineffable
presence of the unconscious in speech, isolating a material trace that could help scientists and
practitioners track the efficacy of therapy, thereby rendering it more objective. While the
secretaries, typists, and other "verbatim laborers" (Inoue 2018) situated at the edges of Lempert's
archives may seem to play a neutral role in transforming text from spoken utterance to graphic
trace, I argue that they played an active role in projects of making the unconscious material and
legible.
Despite these efforts, psychoanalysis fell out of favor in the U.S. But it did not fall out of
favor because researchers studying psychoanalysis rejected empirically driven methods tout
court. Rather, the sun set on psychoanalysis in the U.S. because its supporters and practitioners
could not fit the mysteriously coded actions of the unconscious into the actuarial frameworks of
evidence and efficacy that pharmaceutical companies, insurance companies, and regulatory
apparatuses increasingly pushed in the U.S. Contemporary efforts to stabilize mental illness
using computational methods thus satisfy epistemological and bureaucratic longings alike. While
Computational Psychiatry is part of a larger movement in U.S. psychiatry to dispose of the DSM
altogether and develop a novel, biologically based nomenclature, this movement nevertheless
78
operates within the same, entangled premises that drove the publication of DSM-III and that
psychoanalysis could not satisfy.
Altogether, there is connective tissue between Computational Psychiatry and DSM-III,
spun by the work of people like Spitzer, and supported by assertions about the abilities of
machines to restrain so-called theoretical biases from coloring psychiatric research. The notion
that computational techniques might standardize the order of observational and interpretive
operations in psychiatry-thus rendering psychiatry into an objective science and saving costly
human resources in the process-has much deeper historical roots than the popular discourse
about Computational Psychiatry's novelty suggests. Computational Psychiatry today holds the
same rhetorical promise that DSM once did; its application in the context of research studies
requires research investigators to pursue the same doomed enterprise of trying to shore up the
division between objectivity and subjectivity.
In historicizing Computational Psychiatry, I interrogate the taken-for-granted division
between the machinic and the human, pushing back against the notion that psychiatry was a
humanistic practice prior to DSM-III and has become more technical (and less human) since.
Scholars writing about the role DSM-III played in the history of psychiatry tend to segment the
time leading up to its publication, and the years following it, in dichotomous terms: subjective
and objective, immaterial and material, psychic and organic, personal and general,
psychoanalytic and neo-Kraepelinian, and so on. For instance, Lakoff (2009: 3) phrases the
divide in terms of the difference between recognizing and treating mental illness "through purely
technical means" (on the neo-Kraepelinian side) or by accounting "for the particular life
trajectory of the subject" (on the psychoanalytic side). These binaries are mapped on to the
division between the machine and the human. In my critical reading of secondary sources, I
79
show how the dominant narrative about the re-making of U.S. psychiatry reifies this divide and
calcifies these binary categories in the process. Foreshadowing a tactic I employ in analyzing my
ethnographic material, this chapter focuses on moments during which the divide between the
computational breaks down and refracts, in which the computational and the humanistic (or their
corollaries, the objective and subjective) dance together.
MINDING THE BODY: PSYCHOANALYSIS, MATERIALITY, AND TRANSDUCTIVE
LABOR
From the mid-1960s onward, psychiatrists in the U.S. have tended to answer what Lakoff
(2005:3) says is the field's most fundamental question-do we locate mental illness in the
organism, or in the psyche?-by seeking the biological mechanisms of mental illness at greater
levels of specificity. To situate mental illness in the body, the claim goes, would be to stabilize it
as an object of knowledge, to free it from the specificities of its context-to render it objective.
But what has it meant to locate illness in the psyche? Moreover, how have proponents of
psychoanalysis prior to the 1960s searched for mental illness in the psyche? The project of
making psychiatry into what it is today began as an attempt to turn away from psychoanalytic
models of disease and treatment methods. I give a brief review of how psychoanalytic theory
positioned its object of study-mental illness-against other forms of pathology and "organic"
medicine in order to give a clearer picture of what the neo-Kraepelinians were working against. I
then draw from secondary literature covering efforts to locate evidence of the unconscious using
audio recording devices and transcripts, in order to clarify the claim that psychoanalysis is
opposed to objectivity or materiality. These projects to track therapeutic processes through subtle
signs in the body and the voice anticipate my informants' projects, both in their aspirations,
80
linguistic ideologies, and in their reliance on transductive labor-in this case, the work of
secretaries and transcriptionists whose job it was to create and annotate transcripts of patients'
speech, transforming audio recordings into orthographic representations of verbal and non-verbal
communication. These projects also suggest a more nuanced framing of the body, objectivity,
and evidence prior to DSM-III than the narrative of the neo-Kraepelinians conveys, while also
suggesting continuity between the pre-and post-DSM-1II eras with respect to labor and machines.
In Freud's The Interpretationo fDreams ([1899] 1998), psychoanalysis' foundational
text, Freud analyzes the content of his own dream in order to exemplify that dreams enact the
fulfillment of an inappropriate wish or desire (151) while also diagramming his theory of the
tripartite structure of the self (the id, the ego, the super-ego). As an analyst caught up not only in
crystallizing psychoanalytic theory but also in curing his patients, the central wish of his own
inappropriate dream is that he be "acquitted" (151) of the responsibility of curing his patient,
Irma, whom he had diagnosed with hysteria but had been unable to cure. In both the dream and
waking worlds, Irma continues to suffer from physical symptoms (chest pains, difficulty
breathing) that Freud had hitherto determined as psychological in origin, arising not from some
underlying physiological issue that could be treated by a physician, but from some disturbing and
yet to be reckoned with event in the past that a psychiatrist should treat. Decker (2013) points out
that hysteria was "the disorder that first led Freud to develop psychoanalysis and the theory of
'unconscious' conflict" (203). The perennial anxiety that hysteria causes the analyst-the fact
that it involves both the mind and the body, the psychic and the organic-thus lies at the origins
of psychoanalytic thought.
81
In his analysis of his own dream, the way in which Freud reasons himself out of being
"responsiblefor the pains [Irma] still had" (141) reveals something about the tenuous and
delicate boundary between psychiatry and biological medicine in his time. Remarks Freud:
I was alarmeda t the idea that I had missed an organic illness. This, as may well be
believed, is a perpetual source of anxiety to a specialist whose practice is almost limited
to neurotic patients and who is in the habit of attributing to hysteria a great number of
symptoms which other physicians treat as organic. On the other hand, a faint doubt crept
into my mind-from where, I could not tell-that my alarm was not entirely genuine. If
Irma's pains had an organic basis...I could not be held responsible for curing them; my
treatment only set out to get rid of hysterical pains. It occurred to me, in fact, that I was
actually wishing that there had been a wrong diagnosis; for, if so, the blame for my lack
of success would also have been gotten rid of (141-142, emphasis original).
Since, according to Freud's theory, hysteria manifests itself and is experienced by the patient as
physiological distress, the analyst would rightly be fraught with anxiety regarding its diagnosis.
Accurate diagnosis of disturbances like hysteria, which appeared to blur the divide between mind
and body, challenged the expertise and diagnostic resources of both psychiatric clinicians and
medical physicians alike. Biological illnesses were the responsibility of physicians, and
physicians could not be expected properly to identify or treat "hysterical," psychological
disturbances. Such illnesses exceeded the limits of a physician's prowess. The dividing line
between mental illness and biological illness relied on the distinction between organic and
psychogenic, and between the physiologically grounded in the body and the psychologically
grounded in experience. The presumed space between these categories was what separated
psychiatric clinicians from other kinds of biomedical physicians.
Rosenberg (2002) notes that the way in which biomedical physicians conceive of disease
is a historical achievement of the 1 9 th century, fomented by the proliferation of imaging
techniques for apprehending and knowing the body's internal processes. In many ways, then, the
dividing line between the psychoanalyst and the physician also revolves around the concept of
82
"disease specificity," or the notion that diseases "can and should be thought of as entities existing
outside the unique manifestation of illness" in a person (Rosenberg 2002: 237) rather than a fluid
phenomenon that shifts according to environment, relationship, circumstance, or individual life
course. DSM-III task force members hitched together their pursuit of disease specificity with the
pursuit of making psychiatry into a more biomedically oriented field.
Because disease specificity was not a primary concern for Freud and his predecessors,
neither was diagnosis. Under a psychoanalytic paradigm, there is no clear distinction between the
well and the unwell. Psychopathology is sewn into the fabric of being human; to be born and
continue to live is to rupture psychically, and forever pursue repair. In its most classic, Freudian
iteration, the psychoanalytic subject is wracked by the tension between the socially unacceptable,
erotic and violent urges of the id, and the ego and superego's drive to combat, thwart, conceal, or
convert these urges into something more acceptable. The psychodynamic analyst's job is to help
the patient make sense of the myriad ways in which these inner conflicts and forbidden desires
re-substantiate and re-code themselves in one's interpersonal relationships, dreams, slips of the
tongue, and so on. Hewing closely to a person's singular life history, diagnosis under
psychoanalysis resists standardization. The nature of the problem-the diagnosis-differs from
person to person.
At the same time, as psychoanalysis caught fire (and met challenge) in the United States
from the 1920s onward, there were several concerted efforts to grasp hold of the unconscious,
via the body and the voice, in order to demonstrate it existed and that the efficacy of therapy
could be accounted for. Michael Lempert (2019) traces the work of Earl Zinn and Harold
Lasswell, two researchers experimenting with recording psychoanalytic sessions in the early
1930s. While Lempert is primarily focused on mapping out the relationship between these
83
attempts to "spy on the mind through the aperture of the body" (35) and the flourishing
communication sciences and studies of face-to-face interaction in the U.S., his article provides a
vivid picture of the role researchers hoped machines could play in making the mind's inner
workings more material. Particularly useful is Lempert's suggestion that these early projects to
distill the data of psychoanalytic encounters were driven by a wish to downplay and "bypass the
human in order to let nature speak as truly as possibly" (29), a mode of ethical orientation toward
one's object of study that Daston and Galison (2007) term "mechanical objectivity," or the use of
machinic technologies to downplay and restrain the introduction of the scientist's self into the
pursuit of scientific knowledge. In the experiments of Zinn and Lasswell we find the seeds of
Computational Psychiatry's reoccurring theme: that machines-here gramophones", aided by
wax records and transcripts-are media that can downplay (and obviate) the interference of
human subjectivity and allow the secrets of the mind's inner life to shine through, unadorned. At
the same time, the transductive work of Zinn and Lasswell's secretaries-in transcribing the
audio recordings to be available for analysis-suggests that mechanical objectivity entails not
only the removal of human subjectivity, but also depends upon the labor of actors who are not
firmly placed within the category of the human subject. In other words, the scientific self that is
nobly restrained in pursuit of mechanical objectivity depends upon a strict, exclusionary
guidelines for who can count as a scientific subject versus object, dependent upon liberal
conceptualization of the person as an individual endowed with the spark of intellect and
inalienable rights, such as the right to own property, to participate in liberal democracy, and so
on (Haraway 1997; Herzig 2005).
11 Laswell and Zinn's use of the gramophone corresponds with and almost perfectly embodies the argument Kittler
makes in Gramophone, Film, Typewriter (1999), which explores the role of media in the remaking of the psyche and
subjectivity: Kittler argues that phonography was a technology for rendering the psyche objective-for creating
"non-subjective" inscriptions of subjectivity.
84
Zinn was the director of the Committee for the Study of Personality, a New York-based
subcommittee of the Social Science Research Council (SSRC). Lasswell, his contemporary, was
a "psychoanalytically inclined political scientist" at the University of Chicago (Lempert 2019:
33). Avid supporters of psychoanalysis, both men sought to push back against those who
denigrated the paradigm for the "subjectivity of the reported data" (Lempert 2019: 35). Zinn
himself often complained that the psychiatrists and psychoanalysts he encountered showed no
interest in the "scientific validity of their data" about their patients and proposed a conference
dedicated to establishing uniform research methods in psychoanalysis (Lempert 2019: 31). There
was also growing interest within the SSRC to explore psychoanalysis-"'a difficult field as yet
virgin to rigorously controlled scientific exploration'-in formalized, experimental situations
(SSRC annual report, quoted in Lempert 2019: 31). In the Midwest, Laswell was in obsessive
pursuit of "somatic indicators of psychological states that could be measured quantitatively"
(Lempert 2019: 34). He developed elaborate laboratory setups that had patients connected to
bands, sensors, and wires for tracking galvanic skin response, pulse and heart rate, breathing, and
fidgeting limbs as patients underwent analysis. Laswell had a hunch that these somatic signs
might reveal the latent content of a patient's psyche in a way that denotational speech content
alone could not express. For Laswell, attending avidly to the body's semiotic output during
analysis could finally provide "evidence of otherwise gauzy, abstract claims about mind-claims
that behaviorists dismissed as backward and unscientific" (Lempert 2019: 34).
Zinn and Lasswell were faced with a dilemma. They wanted to record the entirety of the
session, initially for the purpose of obtaining "verbatim" transcripts." However, inserting a
" Verbatim, for Lasswell and Zinn, corresponded with a representation of the what-is-said of speech: the content, or
the denotational substance alone (Lempert 2019: 29). Lempert argues that as Lasswell and Zinn's experiments with
recording sessions progressed, their interpretation of the semiotic potential of the transcripts expanded. Narrow
85
human note-taker into the session was out of the question. Their presence would wrinkle the
dyadic analyst-analysand relationship, sending the crucial process of transference askew. That
the analyst themselves take detailed notes was also out of the question. Freud dictated that the
analyst should remain receptive and responsive without attending too closely or consciously to
the analysand's speech, avoiding the risk of mapping their own (subjective) meaning onto the
analysand's free associations." How, then, to capture the interaction? Zinn and Lasswell
resolved to abandon "human stenographers and note-takers" altogether and instead "repurposed
wax cylinder dictation machines that had been marketed for business," creating audio recordings
of sessions (29). Zinn partnered with Alexander Graham Bell's Dictaphone Company. He hid the
presence of microphones throughout the session room, embedding at least one microphone in the
head of the couch where the analysand reclined (Lempert 2019: 36).
It was through this unobtrusive, invisible recording (and subsequent transcribing) that
Lasswell and Zinn began to codify and seek out what Lempert calls the "communicative
unconscious": bodily signs and vocal blips that the men interpreted to be the output of the
unconscious, the encoded signals of its response to analysis. In his 1935 article, "Verbal
Reference and Physiological Changes During the Psychoanalytic Interview," Lasswell posited a
interest in the content of speech morphed into an interest in the indexical components of the communicative
interactions inscribed in transcripts.
13A s Elizabeth Wilson (2010) points out, the psychoanalytic encounter offers a space of simulation: an opportunity
for the analysand to simulate with the therapist the relationships they have elsewhere in life, so as to better
understand the contours, nuances, and unaddressed tensions of these relationships. This simulated relating-
transference-is therefore vital material for analysis in and of itself, and must not be disrupted by the introduction of
additional parties.
" Freud recognized that analysts faced a difficult task in keeping track of the innumerable personal details-the
memories, phobias, and life histories-of scores and scores of patients over months, if not years, of analysis. His
technique to avoid over-saturation and to keep the analyst's own memory and therapeutic faculties as attuned as
possible was to avoid recording or detailed note taking. His technique, in his words, "consists simply in not directing
one's notice to anything in particular and in maintaining the same 'evenly-suspended attention'...in the face of all
that one hears...For as soon as one deliberately concentrates his attention to a certain degree, he begins to select from
the material before him...and in making this selection he will be following his [own] expectations or inclinations"
(1912: 110-111).
86
direct relationship between vocal quality and unconscious content. He found that slowed speech
rate corresponded with increased psychophysiological tension, eventually arguing that "somatic
measures reveal what speech 'means,' clinically speaking" (Lempert 2019: 39; 34). Both
researchers began to realize that there was additional semiotic substance running alongside
speech content that the loosely attentive analyst might not pick up on: false starts, words cut off
before completion, or hitches in the voice that occurred at certain, significant points in analysis.
Sometimes, these signs even contradicted semantic content. A patient might insist that his dream
was not about his father, but upon reviewing the recording, Lasswell and Zinn would find a
tremor in the heart-or the voice-caught by a sensor or transcribed by one of their secretaries
that suggested an alternative interpretation.
In other words, Zinn and Lasswell began to suggest that the recording revealed the
presence of indexical signs in psychoanalytic interactions: signs that bear an existential, causally
contiguous relationship with the objects for which they stand. In this case, the fidgets and sighs,
according to the two researchers, emanated from and expressed the unconscious." But these
indexical components did not unfurl from the audio-recorded speech on their own. The legibility
of these signs depended upon the work of the secretaries and typists employed under Zinn and
Lasswell, who inscribed these signs into existence, listening to and then transcribing the speech
that played from the wax records, following the notational conventions that their bosses
prescribed.' 6
" To put it differently, the recording devices played a critical role in the "indexicalizaiton" of therapeutic
interactions, or the process by which indexical relations come to be interpretively treated as indexes (Lempert 2019:
25).
1 Archival records indicate that Zinn instructed his typists to mark speech for pauses as well as "kinestic behavior,"
like the lighting of cigarettes and the opening and closing of doors (Lempert 2019: 38).
87
While the act of transforming speech from spoken word to written trace may seem like a
passive copying, pure mimesis1 7, it was through the very rendering and graphic notation of text
that the unconscious became available for analysis. Not just the recording, then, but the
transcription of spoken words to text was pivotal to the researchers' findings. Zinn failed to train
his typists uniformly, and because he re-used the wax cylinders of his Dictaphone, the
inconsistent transcripts left consequential gaps in his research, gaps that I argue speak to the
pivotal role Zinn's administrative team played in his research. Lempert points to an unevenness
in the transcripts, particularly for Zinn, who trained as an analyst so that he could personally
conduct the therapy that his surreptitious microphones recorded. In the transcripts that still exist
from Zinn's experiments, his patient's speech is marked with metacommunicative,
symptomological pauses and quips. But Zinn's own stream of speech, reproduced on the written
page, "seemed suspiciously fluid; pauses were seldom marked-and never in a context that
might reveal something psychological about him" (38). While Lempert interprets this to be a
clerical error, I suggest once again that it underlines just how powerful-and crucial-was the
typists' transcribing work. Zinn and Lasswell's administrative teams' may have lacked
psychoanalytic training, but to make these traces legible in their supervisors' speech would have
put them in a position of illuminating the men's own unconscious impulses.
Although it is unclear if Zinn and Lasswell explicitly instructed their typists to keep the
transcripts asymmetrically opaque, this lack of detail nevertheless kept the power asymmetry
between (expert, scientific) employer and (inexpert, administrative) employee in place. Zinn and
Lasswell's projects to uncover the communicative unconscious, then, resemble my informants'
7 See Miyako Inuoe (2011) for a historical ethnographic account of the shifting engenderment of stenography work
in Japan, from agentive and creative act (when it was the domain of men's labor) to imitative, passive verbatim
mimesis (when it was the domain of women's labor).
88
work on two counts. First, we see a surprising continuity between these pre-DSM-III efforts and
the basic logic and language ideologies that drive my informants' research (the existence of a
non-referential series of signs in a patient's speech that, if refracted through the proper machinic
media, can reveal otherwise opaque interior states). Second, we see that the machinic mediation
scientists call upon to render these indexical signs transparent, indelible, and legible, requires
transductive labor, despite the fact that the organization of hierarchy of labor within the scientific
research team situates this work at the bottom, as non-agentive. The use of machines to let the
voice of either the unconscious, or the body, sing forth, full throated, relies on objectified human
labor; the humans in this loop melt away and meld with the media of the various technologies
(especially recording devices) giving the impression that it is the machines that autonomously
"find" and transduce these indexical signs.
THE GREAT DRAMA OF COLD LOGIC
The death and violence of World War II boosted the status of psychoanalysis as analysts begin to
grapple with, study, and theorize the relationship between participating in and witnessing acts of
violence and the experience of trauma (Young 1995). At the same time, dissent against
psychoanalysis was brewing again, this time not only from behaviorists but from budding
researchers and practitioners in training-like Robert Spitzer-who found themselves
disappointed with the epistemological tools handed down in their classrooms and clinical
internships. Spitzer and his colleagues began to articulate these critiques-and envision
alternative models for psychiatry and the DSM-through experiments with computerized
diagnosis.
89
Robert Spitzer's position at the helm of DSM's "atheoretical" reform was consistent with
his own training and his past experiences. Fourteen years prior to DSM-III's publication, Spitzer
had left the Columbia Psychoanalytic Institute with his degree-barely, by his own reports
(Decker 2013:94)-and with a deep dissatisfaction for psychoanalysis. Spitzer was part of a
growing group of psychoanalysis dissenters at universities and hospitals primarily concentrated
in the northeastern United States. Like Spitzer, these researchers and practitioners rejected
psychoanalysis for the primacy it placed on "wisdom" over "empiricism," or "debate"
subjective convention that could not be reified in laboratory studies-over "data"-which they
figured as durable, quantifiable evidence, ideally rooted in human biology (Edelman 1969;
Feighner 1979).
Spitzer's attraction to this model of empiricism had as much to do with his personal tastes
and talents as it did with his ideas about the medical sciences. In a 2013 interview, Spitzer
disclosed that he neither enjoyed nor excelled at conducting psychotherapy. "I was always
unsure that I was being helpful," he confessed, "and I was uncomfortable listening and
empathizing...I just didn't know what the hell to do" (Decker 2013:94). While psychotherapy
repelled Spitzer, the diagnostic interview enticed him. During his time at Columbia, he
committed himself to creating uniform interviewing guides, testing and developing apparatuses
that set standardized procedures for assessing a patient's mental state and making a diagnosis.
For example, in the late 1960s, he published guidelines for New York State Department of
Mental Hygiene personnel on how to diagnosis patients using DSM-II's freshly published
nomenclature (Spitzer and Wilson 1968).
All in all, it was the standardization of diagnostic procedures-which Spitzer and his
fellow neo-Kraepelinians saw as key to uniting psychiatry with the rest of biomedicine-that
90
kept Spitzer in the field, rather than the hermeneutics of analysis or the conversational arts. In
lieu of more rigorous psychotherapeutic training, the young Spitzer sought out expertise in
"technical" fields, dreaming up ways to bring these skills to bear on psychiatry. Most notably,
Spitzer took several courses in data processing, general computing, and the coding languages
FORTRAN II and IV at IBM's New York-based Data Processing Division. The courses "opened
up for him the world of algorithms" (Decker 2013:94)-a clean, idealized world of stable
correspondences between inputs and outputs, and clear-cut binaries rather than psychoanalysis's
winding and ever widening spectrum of individualized pathologies. At least, this is the image of
the algorithm that Spitzer chased after and that his forays into melding programming with
psychiatry would reproduce.
Just as DSM-II was published, Spitzer collaborated with like-minded Columbia
psychologist and eventual DSM-III task force member Dr. Jean Endicott and produced the first
of what would be three, interlinked papers in the Archives of General Psychiatry. Uniting his
commitment to standardization and the hope he invested in algorithms, the papers presented
three iterations of a computerized diagnostic program written in FORTRAN for the IBM 7049:
DIAGNO-I (1968), DIAGNO-II (1969) and DIAGNO-III (1974). Together, Spitzer and Endicott
aimed to establish a computer program that a clinician could use to diagnose patients with as
little human decision-making work as possible. In each of their papers, they set up a series of
"Man [sic] versus Computer" (1968:749) experiments testing the DIAGNOs' diagnostic prowess
against technicians and clinicians with varying degrees of experience. In offloading the decision-
making work of diagnosis to a computer, Spitzer and Endicott aimed to show that the various
DIAGNOs had the potential to reduce the time a clinician would need to spend with a patient,
and the money that the patient (or their insurance provider) would need to spend on the clinician.
91
Schematic flow chart for DIAGNO computer program.
Start BRAINSYNDtOME AFCTIVPSHoS PERSONALITYDIORDER NEUtOSIS OTHE
9 ARERAOCLSCUIZHENIAF
§ RACTIc uEs WI BEHACT RE AT ---10 
Arch GenPsychiat--Vol18,June 1968
Internal decisiontreestructure ofDIAGNO-I.
In amore polarizing move, Spitzer and Endicott argued that computerized diagnosis
could address aproblem that cut to the heart ofthe mounting tension between Freudians and neo-
Kraepelinians: diagnostic reliability. The neo-Kraepelinians often pointed out that, when using
the conventional, psychoanalytically oriented nomenclature, two clinicians could examine the
same patient and come up with completely different evaluations of the patient's psychiatric state
(Edelman 1969). One ofmy informants from West Coast University, apsychoanalytically
trained therapist from Argentina who was trying to fold engineering perspectives into his
psychotherapeutic practice, once referred to classic psychoanalysis as "the ultimate black box."
A patient might input their symptoms and stories to the receptive therapist, who would then
output adiagnosis or interpretation, but there was no way of knowing or replicating the
procedures atherapist was following to arrive at that output. Likewise, Spitzer and Endicott
attributed the "well-documented unreliability of psychiatric diagnoses" to variability in the order
92
of "operations by which clinicians use the raw data of observations to make a diagnosis" (1968:
746). Psychoanalysis could not provide clinicians with a universal web of associations between
symptom and disease, or a flowchart dictating that if a patient expresses x, then their diagnosis is
more likely to be y and never z. Through their Man vs. Computer experiments, Spitzer and
Endicott concluded that "this source of unreliability is completely eliminated by the use of a
computer program which will always arrive at the same diagnosis when given the raw data
describing a subject" (1968: 746).
They built DIAGNO-I to implement a "logical decision tree model similar to the
differential diagnostic procedure employed in clinical medicine"-a series of true/false questions
that would follow a different pathway depending on the answer (ibid). The three papers convey
that using DIAGNO requires little prior experience. All a DIAGNO operator need do is input the
patient's gender, age, number of previous hospitalizations, and symptoms - which the operator
should describe using the Psychiatric Status Schedule (PSS), a scale for assessing social role and
mental state (Spitzer and Endicott 1968: 746). Spitzer created-and never published-the PSS
while at Columbia; he intended for clinicians to use the scale much in the same way that DSM-
III would eventually be used. The PSS provided guidelines for how a diagnostician might ask
questions and elicit information about the patient's mental and social role functioning, as well as
guidelines on associations between their answers and DSM-II diagnostic categories. After
entering this data, DIAGNO-I would spit out a diagnosis, using "diagnoses and qualifying
phrases as well as two unofficial diagnoses: not ill and nonspecific illness with mild
symptomology" (Spitzer and Endicott 1968: 747). These two categories would eventually make
their way into DSM-III. Therefore, in addition to a foray into computerized diagnosis, the
DIAGNOs were a testing ground for DSM-II's epistemological finer points.
93
Despite the 1968 paper's promissory overtones, by the time Spitzer and Endicott were
writing with several other colleagues about DIAGNO-II in 1974, they concluded that the
computerization of diagnosis had reached a stopping point. However, the authors assured readers
that "all constraints on computerized diagnosis are of a partial nature and are inherent neither to
the kinds of information that computers can process, nor in the nature of the algorithms available
to them" (Spitzer et al 1974: 202). Indeed, other psychiatrists saw great promise in DIAGNO in
terms of its diagnostic acuity and its ability to save time and money. Orr notes that in 1975,
DIAGNO-II was "fully operational" and "installed for use at Rockland State Mental Hospital
[in New York], home of Nathan Kline, cyborg psychiatrist and founder of U.S.
psychopharmacology" (Orr 2010: 367). The authors of the 1974 paper contended instead, "the
constraint [on computerized diagnosis] lies...in the traditional diagnostic system itself' (202). To
them, the problem lay with the current state of psychiatry as a whole. It was at this moment that
members of the recently-formed DSM-III task force, like Endicott, decided that the answer to
psychiatry's reliability issue was not to train a computer to reason like a clinician, but to teach
the clinician to reason like a computer by revising psychiatric nomenclature altogether (Orr
2006:245).
But a closer look at the DIAGNO papers reveals a crucial caveat to Endicott's
characterization. Spitzer and Endicott qualify in the second DIAGNO paper that "a computer
program can...yield a diagnosis, eliminating the costly use of experienced clinicians" so long as
"specifically trainedt echnicians can be used to collect accurate data on subjects" (1969: 12 my
emphasis). In other words, clinicians cannot simulate the logical procedures of a computer alone.
To reason "like a computer," clinicians require human assistance. In order for DIAGNO to do its
job-to diagnose reliably and economically-DIAGNO requires a fleet of technicians trained in
94
standardized methods of data collection, which, in the case of psychiatric diagnosis, includes the
elicitation of details about a patient's symptoms. Throughout the six years they worked on
DIAGNO, Spitzer and Endicott recognized that "any system that relies on routinely collected
data must instate training and administrative procedures...to ensure high quality data" (Spitzer et
al 1974: 202). In their 1974 paper, the authors even suggest that many of the instances in which
the human clinician and DIAGNO gave conflicting diagnoses were "due to sheer blunders in the
ratings made by the clinical staff," such as data entry errors (ibid). In this way, the DSM's
psychoanalytic nomenclature and the status of psychiatry as a whole was not the only constraint
on computerized diagnosis. The success of computerized diagnosis depended, like any A
application or machine learning, upon a para-professional labor force of data custodians, and
their capacity to carry out uniform procedures."
Implicit in this admission is that while computerized diagnosis might scale back the
costly use of clinicians, it scales up the presumably less costly (and, by proxy, less valuable)
labor of technicians gathering the data to be fed into the program. At the germinal moment of
North American psychiatry's empiricism, then, we find another familiar refrain: the more
computational psychiatry becomes, the greater the need for cheap, mechanized labor. Moreover,
in this asterisk about the necessity of a para-professional labor force whose job it is to gather the
"raw" data to be delivered to the person-or machine-responsible for diagnosis, we find the
stirrings of psychiatric screening as a sub-species of psychiatric judgment, a kind of sorting that
is necessary yet inferior to the act of diagnosis. For the work of gathering and inputting patient
" Scholarship across STS and the history of science has indeed affirmed that this is the case for many scientific
disciplines. The making and doing of science relies on delegated and distributed human labor, whether it be in the
production of maps (Turnbull 2000), accounts of the administrative labor that transforms objects found out in the
world into specimens for display in museums (Star and Griesemer 1989), in the seemingly flashy and high-tech field
of bloinformatics in which the "wet work" of laboratory sciences has been scaled up and in-sourced to heteromated
warehouses (Stevens 2013) or through neo-colonial configurations of labor in bio-prospecting exhibitions to develop
novel medications (Hayden 2004; Soto Laveaga 2009).
95
data in order to determine whether or not they require the (costly) time and attention of a
clinician is the work of psychiatric assessment. DIAGNO's technicians pre-figure the
administrative position of psychiatric assessment with respect to diagnosis.
Daston (1992) deems the ideological production of a neutral, scientific gaze, in which the
idiosyncrasies of the observing scientist(s) identities are wiped away from the written page,
"aperspectival objectivity." This view from nowhere-which is, in the case of DSM-III and
Computational Psychiatry, also an ear from nowhere-is achieved through a distributed network
of observers doing "technical" work. In many ways, Spitzer and Endicott attempted to use the
computer to achieve aperspectival objectivity-a mode of interpreting the signs of mental illness
that was supposedly set aside from the clinician's theoretical dispositions-in addition to
mechanical objectivity. Daston notes that aperspectival objectivity was only "imported and
naturalized into the ethos of the natural sciences, as a result of reorganization of scientific
life...when science came to consist in large part of communications that crossed boundaries of
nationality, training, and skill" (600). Daston's theorization of aperspectival objectivity is
instructive, here, because it links together a mode (and an ethic) of observation with the
arrangement of labor that is required to sustain it. Like Zinn's secretaries, the para-professional
technicians feeding DIAGNO its data all suggest that mechanical objectivity-which entails not
just the removal of the idiosyncrasies of perspective, but of the human altogether -requires its
own arrangement of labor. The "mechanic" of mechanical objectivity includes not only non-
human machines, but machine-like human labor, work that is a mechanized feature of a
computer or recording devices' attendant infrastructure.
96
FILLING A FINANCIAL BOTTOMLESS PIT
Spitzer, Endicott, and the other detractors of psychoanalysis who would eventually form or
become associated with the DSM-II1 task force all took issue with the lack of diagnostic
reliability that DSM-II and the psychoanalytic conventions of the day offered. A psychoanalytic
paradigm emphasizes individual life history and circumstance and, as a result, does not offer
clear-cut definitions of illness, wellness, or the distinction between the two. Because
psychoanalysis also does not require strict boundaries between and criteria for disease categories,
it cannot provide techniques for grouping patients into homogenous populations, nor enable an
outside observer to track the impact of therapy on a patient over time (Zinn and Lasswell's
efforts notwithstanding). The problems of diagnostic unreliability are thus tied to other
conundrums beyond the epistemological, and research-focused psychiatrists like Spitzer were not
the only ones who took issue. While Spitzer and his collaborators were publishing increasingly
forceful critiques of psychiatry and the task force-which officially formed in 1974-began to
coalesce, the rest of the country was undergoing a series of interlinked, top-down reforms. These
changes concerned the bureaucratic and administrative management of mental illness as a public
health issue. Task force members and their associates capitalized on these changes by
transforming DSM-III into a "boundary object" (Star and Griesemer 1989).
Boundary objects-like the map of a state-are simultaneously general and specific,
concrete and abstract. Some component of the object must be stabilized or structurally tenacious,
yet the object also must be loose and plastic enough so that it can be made to speak for different
things, or put to different uses, "reconcile[ing] meaning" across different and even sometimes
contradictory views (Star and Griesemer 1989: 388). As a boundary object, DSM-11 was a
mediating interface that drew together a diverse constituency of actors, becoming a salient tool
97
not only for therapists, but for research investigators, insurance providers, and pharmaceutical
companies alike. The task force drew these constituencies together with their revisions,
coordinating the stabilization of the methods of diagnostic procedures with a stabilization of the
methods of conducting research on diagnostic categories and on mental health care interventions.
They rebuilt a manual that fit into the country's fluctuating health care services infrastructure.
In recounting these top-down changes, I want to suggest that the "machinic" in late
twentieth century psychiatry is not limited to the object of the computer itself but also
encompasses ideas about why things ought to be made computable to begin with. During this
time period, psychiatry found itself squeezed by "consumerist demands," with the grip tightening
year after year (Rosner 2005: 135). The pressure came from regulatory bodies, federal funders,
and private insurance companies, all of which were placing increasing emphasis on statistical
calculations as the most valid form of evidence. Measuring the impact of an intervention in a
quantifiable way made it easy to tie questions of efficacy with questions of cost effectiveness,
and concerns for objectivity with concerns for economy-the same coupling that played out
across the DIAGNO papers, and that continues to drive Computational Psychiatry research
today. Statistical analysis became another means through which the ineffable stuff of mind (here,
psychiatric rupture and the movement toward repair) could be made material, defined this time in
terms of the type and price of care. Human subjectivity and its role in the diagnostic process was
framed as an impediment to the production of measurable therapeutic outcomes, and to the
process of rendering people diagnosed using DSM's categories into subject populations
containing individuals who can be equated with one another under the moniker of their
diagnosis. 9 In this way, the so-called empiricism of DSM-III and the neo-Kraepelinians was a
" With reference to Cronon (1992), Lakoff (2005b) refers to the equalizing capacities of DSM's standardized
diagnostic categories as "diagnostic liquidity."
98
pragmatic one. Diagnostic categories in DSM-I1I described statistically real entities, even while
the biological underpinnings of mental illness remained an admittedly unanswered question, and
even while task force members spoke openly about the manual's provisionary status. Thus,
against the standard narrative that depicts diagnosis and psychiatric under psychoanalysis as un-
empirical against the neo-Kraepelinians absolute empiricism encapsulated in DSM-11, I suggest
that the neo-Kraepelinians simply pursued a certain form of empiricism that suck better than the
empiricism of psychoanalysis, due to its association with calculation and its ability to render
people, experiences, and treatment responses numerically commensurate.
The groundwork for the changes that took place during the revision of the DSM was laid
years prior. Throughout the 1940s and 1950s, scientists in the U.S. and Western Europe began to
discover that certain drugs led to positive outcomes in the management of specific symptoms. In
1949, a scientist found that lithium significantly diminished the symptoms of what was then
called manic depression (a diagnostic category discussed in Chapter 4). Next came the discovery
in 1952 that chlorpromazine diminishes psychotic symptoms, and, five years later, that tricyclic
medications diminish depressive symptoms (Lakoff 2005: 7-8). Previously "untreatable"
patients-otherwise relegated to asylums-that responded well to these medications could leave
spaces of confinement and instead undergo long-term outpatient psychotherapy while living in
the general population. Mental health care professionals thus began to recognize the practical
benefits of defining mental illness discretely and not necessarily in terms of what ailed the
patient but in terms of which interventions seemed to lessen the patient's suffering (Lakoff 2005:
7).
The tight coupling of pathological symptoms with intervention along with the primacy of
quantifiable evidence was formally written into law in the early 1960s. In 1962, Congress passed
99
a landmark revision to Food and Drug Administration (FDA) policy that impacted all biomedical
research in the United States, with far-reaching implications for psychiatry. According to this
legislation, anyone who wanted to develop, promote, or sell a biomedical intervention would
have to test the safety and efficacy of that intervention in a randomized controlled trial (RCT).0
For psychiatric research at the time, this meant that investigators should test both drugs and
psychotherapy alike using the RCT structure. This decision sent "epistemological convulsions"
(Rosner 2005: 136) throughout psychiatry, markedly transforming the way in which researchers
frame mental illness as an object of study and concern, and forcing those who wished for their
interventions to have any kind of profitable life into defining disease according to the treatment
meant to resolve them.
The 1962 legislation had a recursive effect. As Lakoff puts it, any intervention developed
from that point onward "had to embody the system's model of the relationship between illness
and intervention" (Lakoff 2005: 10). More than that, the FDA legislation insinuated that
statistically calculated evidence held a special epistemological place-it was the only proof of an
intervention's efficacy that the FDA would recognize. This primacy placed on statistical
evidence resonated with the neo-Kraepelinians. In the battle of know-how versus numbers, the
1962 legislation was a win for the "data-oriented approach" of Spitzer and his colleagues, and a
2 The purpose of a randomized controlled trial is to reduce "bias" and produce as objective of evidence of an
intervention's impact on a targeted subject population as possible. In an RCT, research cohorts are typically split
into two groups: one group receives the actual intervention, while the other group receives a placebo. Investigators
utilize statistical randomization in order to determine which subjects end up in which group-hence, the splitting of
subjects is supposedly free of bias and agnostic. Investigators calculate the statistical validity of the placebo versus
the actual interventions impact on the targeted symptoms. As Vicanne Adams (2013) and Joe Dumit (2012) have
discussed, the insidiousness of RCTs lies in their perceived, unilateral "objectivity." Although the results of an RCT
bear the moniker of objectivity by way of statistical analysis and the randomization of which subjects receive which
intervention, the results of an RCT can still be tweaked in one way or another, i.e., as is the case with the
pharmaceutical corporation sponsored, cholesterol lowering drug trials that are the subject of Dumit's ethnography.
100
loss for the evidence-based approach of psychoanalysis. 2 Yet while the FDA, like the future task
force members, valued an evidence-based approach, it is not the only possible approach, and nor
is it a neutral one. As scholars such as Petryna (2009), Dumit (2012) and Adams (2013) have
suggested, a narrow emphasis on statistical evidence in the context of biomedical research
participates in the peeling away of what constitutes "wellness" or "health" from "anything
experiential" (Dumit 2012: 123), i.e., according to the patient's own estimation of their internal
states (though this experiential knowledge may or may not be influenced by technoscientific
discourses and expectations).22
The RCT structure posed a particular challenge for psychiatry at the time, highlighting
that psychiatric nomenclature is not only a tool for clinical treatment but also for conducting
statistically validated research. In an RCT, the efficacy of an intervention must be "measurable in
terms of efficacy across populations of comparable patients" (Lakoff 2005: 11). Investigators
need a way to build research subjects cohorts; they require an "instrument of commensuration
(Schechter 2014: 30) that enables them to group together not just a collection of individuals, but
a cohesive population with an (allegedly) shared, homogeneous trait. In the absence of a
diagnostic system that could achieve this biopolitical feat, research-oriented psychiatrists like
Spitzer began to formulate rating scales and questionnaires, like the PSS that Spitzer built in to
DIAGNO-I, Aaron Beck's Depression Inventory (BDI), and the Hamilton Depression Scale
(HAM-D). They designed these inventories-some of which are still used for psychiatric
screening to this day, and some of which we will encounter elsewhere in the dissertation-in
2 The RCT officially became the gold standard of attesting to the efficacy of a psychiatric intervention in 1980, the
same year that DSM-II was published. That year, Bill S-3209 was introduced into Congress, the so-called Efficacy
Bill, which legally formalized the link between RCTs and the calculation of treatment efficacy (Rosner 2005: 136).
2 Dumit (2012) situates the shift from "experience" to clinical-trial produced, statistical "evidence" twenty years
prior to Spitzer and his colleagues efforts, namely, in the 1940s, coinciding with emergence of population-based
mass health and the use of statistical data to stake claims against the tobacco industry and the detrimental health
effects of smoking.
101
order to translate symptomal experiences into numbers, to reify the various states of being
mentally ill. It was during this time that Spitzer, Endicott, and several other researchers
developed the Research Diagnostic Criteria, a matrix for designing multi-step psychiatric
research (a combination of, among other things, laboratory studies, family studies, population-
level studies) that was yet another test-run of DSM-III (Feigner et. al 1972; Spitzer, Endicott, and
Robins 1978). As scholarship on audits and documentation in quantificatory regimes of evidence
have shown (Strathern 2000; Roles 2006) the numbers that RCTs and inventory scores produce
have a real, material existence because of the kinds of work they can accomplish, in this
instance, because of how they can move a patient through-or prevent them from accessing-the
health care system.
The numerical, evidence-based approach of the RCT proved inviting to yet another
constituency concerned with the statistical calculation of costs, benefits, and value: insurance
agencies. Following the FDA's regulatory changes came slow but consequential changes in both
federal and private health care policy, all of which made research funding contingent on adapting
the RCT structure and, after its publication, DSM-III's rigidly defined diagnostic categories. As
Kate Schechter (2014) describes, up until the mid 1960s, most patients paid for their treatment
out of pocket without relying on insurance coverage. Patient reliance on medical insurance to
cover treatment costs increased steadily following World War II, and in the 1960s, insurance
plans finally began to include coverage for mental health care. Third party insurance companies
like Aetna and Blue Cross endorsed the Federal Employees Health Benefits Program, which
"reimbursed psychiatric care dollar for dollar with other medical treatments" (Schechter 2014:
28). Yet as psychiatric care-still predominantly psychoanalytic-became more economically
accessible, third party payers began to take issue with the length and intensity of treatment that
102
most analysts required their patients to undergo. Psychoanalysis's "qualitative continua" and
"symbolic mechanisms" fit poorly into "an insurance logic that would allow payment for the
treatment of discrete diseases and discrete episodes" (Schechter 2014: 31). In other words,
psychoanalysis was decidedly un-actuarial.
By the 1970s, the private insurance industry was booming, and the cost of coverage for
psychiatric treatment rose dramatically. More and more, insurance providers deemed therapy-
again, synonymous with psychoanalysis at the time-to be a "financial bottomless pit that would
require potentially uncontrollable resources" (Schechter 2014: 18). If, according to
psychoanalysis, to be human is to exist in state of psychic disrepair, then there is no real end to
analysis-it is an ongoing, perpetual quest for self-knowledge. To insurance providers, the
interminability and air of mystique surrounding psychoanalysis rendered it suspect, causing them
to call its status as a medical intervention altogether into question. For instance, in 1975, the Vice
President of Blue Cross declared,
Compared to other types of services there is less clarity and uniformity of terminology
concerning mental diagnoses, treatment modalities, and types of facilities providing
care... only the patient and the therapist have direct knowledge of what services were
provided and why (quoted in Schechter 2014: 29).
At the time of the Vice President's statement-a year after the DSM-III task force formed-
Aetna had reduced its coverage to twenty outpatient visits (i.e., sessions with a clinician) per
client (Schechter 2014: 29). The pressure to devise a system that could neatly define illness, cure,
and clear boundaries between different diagnostic categories according to the logic of the
marketplace was on. Insurance companies hungered for a paradigm that could, without friction,
translate its therapeutic procedures into dollars.
A parallel story was unfolding at the federal level. President Carter was well aware of the
rising demand for and cost of health insurance coverage, and to meet this growing need, he
103
sought to develop a national Health Insurance Program. His pursuit of this program transformed
psychiatric research from yet another angle, this time by influencing the type of research that
governmental funding bodies, like the National Institute of Mental Health (NIMH), would
support. At Carter's directive, in the late 1970s Congress tried to establish standardized "criteria
for reimbursement of medical treatments" and likewise declared that psychoanalysis itself was
the major culprit standing in the way of this endeavor, with its failure to answer "practical and
quantifiable questions" about the match between pathology and treatment type (Rosner 2005:
117, 135). Not unlike Harold Zinn, Congress underscored that psychoanalytic researchers and
therapists were not in the business of capturing and cataloguing tangible proof of their
paradigm's efficacy that might be evaluated by an independent party who was not part of the
patient-therapist encounter.
The Carter administration found a key figure in seeing this change through in Gerald
Klerman, the director of the NIMH's parent institute, the Alcohol, Drug, and Mental Health
Administration (ADAMHA). Klerman was a respected researcher and practitioner, and thus
ideally positioned to mediate between policy and research practices. At the vanguard of both
psychopharmacology and depression research, he had developed his own novel, decidedly un-
psychoanalytic paradigm for treating depression called interpersonal therapy (Klerman et al.
1974). Klerman was also a central DSM-III task force consultant, and he was keen to the rising
critiques gathering around psychoanalysis. He was well aware of psychoanalysis's limitations
when it came to producing metrics for measuring health and wellness that could prove to an
outside observer, in the rhetorical language of numbers, that psychoanalytic therapy was working
and working well. And in his estimation, science, like insurance coverage and the pharmaceutical
industry, was yet another marketplace. According to Klerman, dependent on funding and
104
therefore driven by the same capitalist forces, psychiatric research's own "invisible hand...has
not been sufficient to meet public health needs"(quoted in Rosner 2005: 135).
Klerman helped to jumpstart and re-route basic science research in psychiatry, uplifting
the ongoing efforts of DSM-III's task force and denigrating psychoanalysis while he went. He
coordinated and oversaw the first ever-collaborative RCT measuring the efficacy of a
medication-imipramine-against a psychotherapy-Aaron Beck's cognitive behavioral therapy
(CBT), which diverged markedly from psychoanalysis (Rosner 2005: 117).23 His study achieved
many things at once: it demonstrated how to conduct an RCT with a psychotherapy within the
FDA's new guidelines, while also demonstrating the efficacy, quantifiability, and RCT-
compatibility of a non-psychoanalytic therapy. More than that, the study used DSM-III
categories to construct inclusion and exclusion criteria for the subject pool. It used these
categories to demonstrate-and calculate-the extent to which research subjects moved out of
these categories, their symptoms diminishing, through the course of therapy. Although never
written into law, the success of the trial Klerman facilitated signaled that DSM-III categories
were yet another new, gold standard for conducting research. According to the researchers with
whom Rosner conducted oral history interviews, grant reviewers at NIMH favored research
structured around the manual's diagnostic categories and symptom criteria after 1980-it
became clear that funding awards depended on the use of DSM (Rosner 2005: 143). Because the
23 As I have argued elsewhere, CBT contains baked within its techniques and its definitions of cure the neoliberal
logic of the market. The goal of CBT is to train the patient to be a cognitive behavioral therapists themselves-cure
coincides with virtuosic performance of analyzing and evaluating evidence that disputes the validity of negative self-
thought, and then adjusting those thoughts to match with this evidence. Beck knowingly developed his paradigm
with the RCT in mind (Rosner 2005; Rosner 2018)-he wanted patients to become "junior scientists" fluent in the
study (and cybernetic adjustment) of their own selves (Rosner n.d.)
105
NIMH provided the bulk of funding for psychiatric research, investigators who clung to
psychoanalysis after 1980 found themselves at a serious disadvantage.
THE MACHINE-READABLE EMPIRICISM OF DSM-III
As a result of these interconnected changes, by the time DSM-III made its public debut, it had
immense power, and its influence only grew over time. Multiple sectors of contemporary life in
the U.S. would come to gather around and cut across the manual, from the realm of law and
insurance, to conceptualizations of self and social deviance, to medications and the economies
that form around them, to academic journals, conferences, and entire research institutes
dedicated to a singular diagnostic category listed within its pages. Perhaps most consequential of
all, after DSM-III's publication, psychiatric diagnosis came to function as "a key to the repertoire
of passwords that provides access to the institutional software that manages contemporary
medicine" (Rosenberg 2002: 256). Each diagnostic category in DSM since the third volume
coincides with a numerical code, which is itself associated with different tiers and types of
insurance coverage. To be diagnosed after DSM-III is to be coded-literally and figuratively-as
a certain kind of subject, in need of certain state or private resources. As a mechanism of
bureaucratic legibility, diagnosis after DSM-1II is what makes citizens-subjects "machine
readable" (Rosenberg 2002: 257) by the actuarial calculus of the health care system, with
implications for which treatments you receive, and for how long you are to receive them before
you must begin paying for them yourself. Once entered into the system, "the patient is
2 Between 1950 and 1977, according to Rosner (2015), "the federal government spent over $55 million on
approximately 530 psychotherapy research grants" (135). By 1977, federal dollars funded upwards of %75 "of all
large-scale psychotherapy outcome research world wide" (ibid)
106
necessarily objectified and recreated into a structure of linked pathological concepts and
institutional power" (Rosenberg 2002: 257), a member of a category of humans who all
supposedly share some likeness. It is the DSM's material, infrastructural connections to all these
other sectors of life that fuels its tenacity and sustains its influence.
With all this being the case, it might be fair to say that the DSM-III achieved a
"conquest" of American psychiatry, as Decker contends. But its success and influence did not
follow from the manual's capacity to once and for all to make psychiatric diagnosis into an
atheoretical process. Instead, its creators fit the manual in to a historically specific definition of
what counts as a fact given the political economic backdrop at the time: that which can be
statistically validated and verified by a third party external to the interaction between patient and
clinician. If anything, DSM-III's totalizing capture of psychiatry in the U.S. had more to do with
its situatedness-the way in which its authors (the task force members) recognized and pursued
an intimate fit between what they sought to achieve (uniformly agreed upon definitions of what
mental illness looks like) and the contours of the world as it changed around them.
That is to say, the "empiricism" of DSM-I1I was not arrived at through close scrutiny
of-or attempts to pin down-the body's organic processes. DSM-III did not resolve the
question of what (or where) mental illnesses are-in the organism or in the psyche-although it
did lay the foundation for increasingly biological framings of mental illness through the primacy
it placed on disease specificity. What DSM-III's tenacity and success showcases is the extent to
which definitions of the empirical-along with the biomedical-coincide with what David Pye
(1968) calls the "workmanship of certainty." Although Pye made use of the workmanship of
certainty primarily to discuss industrialized mass production, I expand on his discussion by
pointing out how the "exactly predetermined" and therefore "certain" (1968: 341) object it
107
produces also resembles the stabilized, predictable, and replicable outcomes that diagnosis and
research under the umbrella of DSM-III are supposed to produce. Psychiatry's ascension to the
category of biomedicine has less to do with its pursuit of biological processes and material, and
more to do with its ability to standardize its object of study-mental illness-through uniform
methodologies and decision-making processes. 25
Hence, Orr (2006) notes that the manual's empiricism is "a strange and elusive" one that
"exhibits curious symptoms of epistemological dizziness and ontological trembling" (240).
While task force members were committed to establishing a common language for describing
symptom manifestation, they never claimed that these criteria were associated with biologically
existent entities. On the one hand, DSM-1II was folded into the shifting terrain of the health care
sector and federally funded research programs, with lasting, material impacts for those who live
with and alongside mental illness. On the other hand, to the researchers and practitioners serving
on the DSM-III task force, the 1980 edition was considered a productive starting point, a
placeholder, a temporary fix to tide the discipline over while basic science researchers continued
to crack away at the pressing, unresolved matter of disease etiology. Spitzer himself attested that
all of the categories in DSM-III are "hypotheses to be tested," invitations for further research
rather than finalized conclusions (quoted in Orr 2006: 241). In the context of clinical use, task
force members intended for clinicians to conduct diagnosis by identifying the best fit between
the symptoms the patient presented and the diagnostic criteria listed in the manual. Psychiatry's
new nosology provided a system of close-enough approximations and prototypes rather than
ideal types (Cantor et al. 1980).
25 Pye contrasts the workmanship of certainty with the "workmanship of risk," which coincides with the artisanal
and the handmade. If the workmanship of certainty produces the same, uniform product every time, under the
workmanship of risk, "the quality of the result is not predetermined, but depends on the judgment, dexterity and care
which the maker exercises as he works" (Pye 1968: 344).
108
Following DSM-III's publication, task force members established an American
Psychological Association committee to oversee the testing of DSM's diagnostic categories and
its subsequent editions. Thus, figures like Spitzer, Endicott, and Klerman embraced and
advertised that the manual was "standardizing but also dynamic" (Lakoff 2005: 13), underlining
"rather than obscure[ing] the probabilistic nature of diagnostic categories" (Cantor et al 1980
quoted in Orr 2006: 241). Reflecting on his team's work years later, Spitzer paints a picture of a
group of humble and reflexive researchers who shrank from the moniker of "empiricism" for
which they had gained their reputation: "I think we knew that we often...were making up these
criteria because they seemed really appropriate and useful. But there are very few instances
where the actual choice was empirical. And most people don't appreciate that, but that is the
fact" (quoted Orr 2006: 240; my emphasis).
Thus, we find a divergence from the common tale of DSM's empirical-driven conquest of
American psychiatry. Spitzer and the task force were not so confident about the empirics of their
empiricism. Empiricism was more of an ideal, a carrot leading them forward, rather than what
the revision itself ended up achieving. There is a distinction, then, between the tactics used to
revise the manual, and the manual's relationship to other medical fields. As Orr argues,
The notion of [diagnostic] validity starts to float free of any measure of an actual
correlation between the name and the thing (correlations made, for example, in medicine
via the evidence of lesions, bacteria, fractured bones, blocked arteries); instead, validity is
increasingly linked to predictive power, the ability to name not the thing but its future
path. From an objective measure of realness to a pragmatic measure of predictive utility,
the validity of psychiatric diagnoses becomes abstracted from any reality principle at
precisely the moment the diagnostic classification system turns insistently empirical (Orr
2006:242).
While diagnosis after DSM-II may have left room for a shivering, unstable kind of empiricism,
the manual's diagnoses have consequences for those whom it encodes, who live under the weight
of its titles, or must wear their membership to the populations it describes like an albatross. It is
109
rather the definition of empiricism that transforms during this time period, which DSM-III
reproduces. The machine-readable empiricism that the manual ratifies and laminates is also
linked with power, because it is linked with funding, and because of its association with
uniformity, the stable "world of algorithms" that Spitzer, Endicott, and other chased after. In the
advent of "biopsychiatry," the "biological" is a vanishing category, a floating signifier standing
in for that which is stable, can be coded, certain, uniform.
Hence, critics like Vaillant denigrate DSM-III and its authors for attempting to make
psychiatry like orthopedics and like computer science-there is a linkage, a family resemblance,
between the stability, certainty, and uniformity these fields appear capable of achieving. That is,
there is an association between disease specificirv and disease computability. Rosenberg's use of
software metaphors to discuss what DSM-II1 achieved precisely highlight the point I am trying to
make. In DSM-111, the computational/actuarial acts as a placeholder for the biological. It
rhetorically functions as the biological-the biologically real,for now, according to task force
members, until future task force members can better pin down the reality of psychopathology,
perhaps when better science has come along. This is where Computational Psychiatry, and my
informants, enter the scene, leading us in to the ethnographic present. In their estimation, better
science has not come along. It is theirjob to bring it into fruition, to fully reform and reformat
psychiatry once and for all.
CONCLUSION: DSM IS DEAD! LONG LIVE DSM!
In May 2013, weeks before the APA was to publish the fifth edition of DSM (DSM-5), Thomas
Insel announced in the NIMH's "Directors Blog" that the institute would be "re-orienting
110
research away from DSM categories" toward "research projects that look across current
categories...to begin to develop a better system." For Insel and many others at NIMH and
beyond, "better" meant a classificatory system that categorizes mental illnesses according to
their mechanisms of biopathology, with an emphasis on understanding the kind of neural
circuitry that leads to the cognitive and behavioral symptoms of mental illness. Following Insel's
post, NIMH took a number of steps to ensure that researchers who seek NIMH funding would
move away from using DSM in their studies and instead implement NIMH's own, novel matrix
for designing research hypotheses, called the Research Domain Criteria, or RDoC (Insel and
Gotay 2014:745). Most notably, NIMH enforced an adherence to RDoC through a number of
funding announcements 2 6, clarifying that NIMH will be favoring research that posits
mechanisms of action over research that uses DSM categories to formulate subject populations
or that investigates DSM-specific diagnostic categories (Insel 2013; Insel 2014).
7 %.7'TPsychoas
TTICogn
NIMH graphic illustratingPthelogic ofRDoC: subtypesofdisordersorundiscovereddisordersduetosomeshared
biopathological mechanism may occur in populations of people that would typically be grouped in separate
populations according to DSM's traditional diagnostic categories. Insel, Cuthbert and others have proposed that it
would be more viable (especially in terms of treatment development) to group mental illnesses according to these
shared biological features (biotype 1, biotype 2, biotype n). RDoC is a matrix for developing research to arrive at
these biological features without recourse to DSM categories and the "artificial" boundaries they create by, for
2 These claims (that NIMH will not fund research that uses DSM categories or that is aimed at investigating DSM
diagnostic criteria) have been tempered and scaled back since Insel left NIMH for Google Life Sciences in 2017.
11
example, conducting studies with people who experience cognitive control or sensorimotor reactivity differences,
rather than conducting studies with people who share the diagnosis of "schizophrenia."
Insel and others reasoned that DSM was an outdated tool that had fortified the barrier
between basic and applied research, causing more harm than good and standing in the way of
developing efficacious treatments. To this day, DSM-driven diagnosis cannot guide a clinician in
identifying (if such a thing does indeed exist) the essential, biological core of mental illnesses.
When researchers use DSM to recruit research participants for studies, there is no assurance that
the participants share any kind of biological likeness that might be associated with the disorder.
They share, if anything at all, similarly interpreted pathological behaviors and symptom
expression, or score similarly on psychological inventories like the BDI. If there is no way to
identify the existence of shared bio-etiological traits among patient groups, then the biological
validity of the conclusions this research produces (like conclusions about the efficacy of an
intervention) are indeterminate and shaky. Insel and others in support of NIMH's position have
therefore asserted that DSM is prime culprit of America's mental health crisis. They argue that
the RDoC project can deliver American psychiatry's long-term desiderata, since it is supposed to
lay the groundwork for the royal road to biopathological mechanisms. The foundational changes
the DSM-II task force members made-putting together a manual that focuses on
phenomenology and reliability while eschewing models of disease etiology-are finally on the
chopping block. Investigators demand that the manual's ontological trembling be held steady and
that its pragmatic empiricism be taken to task once and for all.
NIMH's rejection of DSM and the public unveiling of RDoC caused a stir, and yet, the
waves it sent out across U.S. psychiatry rippled in a familiar pattern. Though DSM-5 and its
immediate predecessors contain no trace of psychoanalysis, innovators and disruptors like Insel
now name the DSM itself as the culprit of psychiatry's epistemological and public health
112
shortcomings. DSM holds the position psychoanalysis once did-the unresolved bugaboo
preventing psychiatry from achieving its medico-scientific status-and suffers the same
critiques. For instance, Insel and former RDoC project director Bruce Cuthbert underscored in a
2015 Nature article that even though "clinicians rightly pride themselves on their well-honed
observational skills...diagnosis in psychiatry [with DSM], in contrast to most medicine, remains
restricted to subjective symptoms and observable signs" (Insel and Cuthbert 2015:499). The
NIMH's RDoC funding announcements harken back to rise of RCTs in the 60s and 70s, and the
NIMH's eschewing of psychoanalysis. A popular technique in Computational Psychiatry
research is to gather together as large of a pool of research subjects as possible who all share
some broadly construed symptom-cognitive processing, auditory hallucination 27 -that does not
draw from the language or structure of DSM. The overarching goal of this big data approach is to
strip bias away from research and achieve pure, unmediated, theoretical agnosticism-
buzzwords familiar to the DSM-III task force. And the RDoC approach's focus on broadly
construed symptom phenomenology rather than "theoretically specific," conventionally
recognized diagnostic categories and symptom criteria resonate with the DSM-III task force
member's initial logic for revising DSM.
In many ways, then, Insel and Cuthbert-the primary supporters of RDoC at its
inception-are attempting to finish what the DSM-III task force member started. Their stance
toward psychiatric nomenclature-and their proposed tactics to disrupt the field-may at first
blush seem different. However, I hope to have shown through this journey into psychiatry's
27 During preliminary fieldwork, several research investigators at west coast based universities described this as their
best guess for what RDoC-compatible research might look like, since there were no clear guidelines or easily
accessible examples of successfully funded research. With Insel's exit from NIMH, the fervor and mystery
surrounding RDoC has receded, and the promises of its potential to disrupt the old guard of mental health care
research have lost their steam.
113
coded past that investigators across these two moments in the field's history are fixated on
similar issues. What remains consistent across these two projects-the third revision of DSM and
the rise of Computational Psychiatry, an instantiation of the RDoC project-is the battle between
objectivity and subjectivity and efforts to extricate human judgment and the idiosyncrasies of
theoretical training in psychiatric medicine, with recourse to the machinic and the computational,
figured as foils of the human. Many of the tools developed during this time period-from the
diagnostic inventories, to the social life that the diagnostic categories themselves have taken on
since DSM-III's publication in 1980-were palpably and materially present during my
fieldwork, even in the time of RDoC, and even in the middle of interdisciplinary research
endeavors that fall under the umbrella of Computational Psychiatry.
Although the goalpost for measuring what constitutes adequate empiricism in psychiatric
medicine continues to shift (from that which can be calculated and made statistically evident to
that which is anchored in the body's material substances and evidence of its mechanisms) by
tracking this shift, we can observe the humanistic and the computational being defined
dialectically, against and in tandem with each other. Techniques that can provide "unmediated"
access to the body, or that actors interpret to be capable of translating otherwise ineffable,
internal experiences into reified calculations, occupy the space of the machinic or the
computational. Techniques that fail to produce certain or concretized proof of their efficacy, or
that are anchored in supposedly private, hidden or difficult to access individuated experiences,
occupy the space of the humanistic. The tensions between these two categories, and how they are
worked out and negotiated in the context of psychiatric research, are fertile grounds for exploring
hierarchies of value and their naturalization and refraction into realms of life beyond the
psychiatric and even medico-scientific. These binaries replicate and map onto binaries of gender
114
(masculine/feminine), which itself are refracted through the distinction between research-related
tasks that require expertise (the spark of ingenuity and intellect) and tasks that can be completed
through supposedly innately human, automatic, inborn capacities (work that can be performed
"automatically" because it requires no skills).
The symmetry between these two time periods and the productive tension between the
computational and the humanistic plays out most legibly in Spitzer and Endicott's early
experiments with computerized diagnosis, and I turn to their papers to conclude. In the final
DIAGNO paper, Spitzer, Endicott and company contend that the computerization of diagnosis is
an achievable goal because "any feature that is capable of explicit verbalization can be precoded"
(Spitzer et al 1974: 202). In other words, so long as a phenomenon can be articulated-described
verbally or otherwise, recorded, transduced into a more durable form-it can be operationalized.
Honing in on this passage, Orr argues that Spitzer and Endicott's DIAGNO studies are
emblematic of the entire enterprise of DSM-III, which achieved much more than the creation of a
conventionalized, common language with which to describe mental illness. Instead, she asserts
that the task force members were participating in a broader project of refashioning "mental
disorders into patterns of information" (Orr 2006: 244), a project she refers to, following Donna
Haraway (1985), as an "informatics of domination." 2 8 Orr thus argues that DSM-III laid the
foundations for making the automation of psychiatric judgment both culturally desirable, and
2 Orr calls this an "informatics of diagnosis"(2010: 356), explicitly building on Donna Haraway's (1985) notion of
an "informatics of domination," a concept that Haraway developed to make sense of cybernetics via questions of
power. The larger goal of Orr's essay is to make sense of "cybernetics as a technology of social governance" (Orr
2010: 356) and she argues that the remaking of DSM-II1-and psychiatry writ large-after the image of the
computer is a key instantiation of how cybernetics as a "governmentality of mentality" (2010 357). In this way, Orr
argues that DSM-II1's algorithmic-style diagnostic reasoning is not just about advent of biopsychiatry (by way of
disease specificity) but also symptomatic of a larger trend toward an "automated, informatics control of human
mentality" (ibid).
115
technically plausible by reformatting the language of U.S. psychiatry in the image of the
computer.
Orr's assertion that DSM-1II refashioned mental illnesses into patterns of information
resonates into the present day, even while the existence of Computational Psychiatry itself
bespeaks the shortcomings of DSM-I1I's refashioning work. That is to say, language remains a
persistent problem in the context of U.S. psychiatry, from the era of Freudian psychoanalysis, to
the neo-Kraepelinians, to renegades and disruptors like Thomas Insel. In the contemporary
moment as in the time of Spitzer and Endicott, we encounter the same, persistent language
ideology: the notion that language is attached to and has the capacity to represent interior states
(like mental illness) yet the speaker can agentively control, modulate, warp, and jam this
connection. The truth of the matter of mental illness, while available through language, is
difficult to decipher, and the speaking subject in psychiatric encounters (whether psychoanalytic
or evidence-based) is fallible and unreliable, perhaps even more so than the listener. Like
projects to pin down the communicative unconscious and the computerization of diagnosis, the
computerization of psychiatric assessment is just as much about downplaying the agency of the
patient as speaking subject, as it is about downplaying the agency of the observing scientist (and
the dehumanization of the administrative labor force working under the scientist).
On this note, I'd like to suggest that the impacts of DSM-III's publication are not quite as
totalizing as Orr claims, in part by (as I have done throughout this chapter) pointing to the
administrative work that props up efforts to fold computerization into psychiatric judgment.
While Orr marks the publication of DSM-II1 as part of an epochal shift in efforts to govern and
control the terms of what it means to be human, I have shown that the impulse to turn to
machines in order to circumvent the subject realm of human judgment-especially when it
116
comes to locating the signs of psychic states through language-can be found years before DSM-
III's publication, Zinn and Lasswell-as well as years later, with the advent of Computational
Psychiatry. Spitzer and Endicott's assertion-that anything that can be verbalized can be
coded-represents a version of reality in which the potential patient speaks, and the computer
codes their speech as it exits their mouth, leaving out the humans who elicit the speech, record it
or store it somewhere, and manually code it before entering it into a computational system.
Computerization is not a linear process, just as it is not a singular, individualized one. It takes
work, a network of technicians and verbatim laborers, and, as my ethnographic chapters will
show, encounters bumps and roadblocks that are sometimes insurmountable.
Altogether, turning to recent historical and ethnographic accounts of computing, labor,
gender, and race can help to articulate a labor history of psychiatric research, which dulls the
shine of Computational Psychiatry's newness, resisting interpretations that dehistoricize its
tactics and technologies (Irani 2015, 2019; Amrute 2016; Hicks 2017; Rankin 2018). These
histories of computing, labor, and psychiatry are braided together, and form the backdrop of the
three ethnographic chapters that follow this one. The "logic" of computer science might feel
cold, but the labor is always human-this is likewise the case for psychiatry. Early adaptors of
computational techniques in psychiatry and contemporary participants of Computational
Psychiatry privilege the figure of the machine for its capacity to reach beyond the human without
accounting for the humans who make computerized interventions possible, including research
subjects. In other words, the humanistic/computational dichotomy papers over the humans who
conduct mechanized, "unskilled" labor, operating from within the slot of the machine.
Thus, as psychiatry becomes more technical, it begins to look more computational from a
labor standpoint as well. In this way, the chapter foregrounds the distinction made by both
117
Computational Psychiatry and my informants between psychiatric screening as unskilled,
mechanized labor, and diagnosis (and psychotherapy) as expert, human labor.
118
References
Adams, Vicanne. 2013. "Evidence-Based Global Public Health: Subjects, Profits, Erasures." In
When People Come First: CriticalS tudies in Global Health. Pp. 54-90. Princeton: Princeton
University Press.
American Psychiatric Association. 1968. Diagnostic and StatisticalM anual of Mental Disorders,
Second Edition (DSM-IJ). Washington: American Psychiatric Publishing.
American Psychiatric Association. 1980. Diagnostic and StatisticalM anual of Mental Disorders,
Third Edition (DSM-III). Washington: American Psychiatric Publishing,
American Psychiatric Association. 2013. Diagnostic and StatisticalM anual of Mental Disorders,
Fifth Edition (DSM-5). Washington: American Psychiatric Publishing.
Amrute, Sareeta. 2016. Encoding Race, Encoding Class: Indian IT Workers in Berlin. Durham:
Duke University Press.
Bayer, Ronald, and Robert L. Spitzer. 1985. "Neurosis, Psychodynamics, and DSM-III: A
History of the Controversy." Archive of GeneralPsychiatry4 2(2): 187-96.
Casey, B.J., Nick Craddock, Bruce N. Cuthbert, Steven E. Human, Francis S. Lee and Kerry J.
Ressler. 2013. "DSM-5 and RDoC: progress in psychiatry research?" Nature Reviews
Neuroscience 14:810-14.
Cantor, Nancy Smith, Edward E. Smith, Rita D. French, and Juan Mezzich. 1980. "Psychiatric
diagnosis as prototype characterization." Journalo fAbnormal Psychology 89(2): 181-193.
Cronon, William. 1992. Nature's Metropolis: Chicago and the Great West. New York: W.W.
Norton.
Daston, Lorraine. 1992. "Objectivity and the Escape from Perspective." Social Studies ofScience
22(4): 597-618.
Daston, Lorraine and Peter Galison. 2007. Objectivity. Cambridge, MA: Zone Books.
Decker, Hannah. 2007. "How Kraepelinian was Kraepelin? How Kraepelinian are the neo-
Kraepelinians? - from Emil Kraepelin to DSM-III." History ofPsychiatry 18(3): 337-60.
Decker, Hannah S. 2013. The Making ofDSM-III: A Diagnostic Manual's Conquest ofAmerican
Psychiatry. New York: Oxford University Press.
Dumit, Joseph. 2012. Drugsfor Life: How PharmaceuticalC ompanies Define our Health.
Durham: Duke University Press.
Edelman, Robert I. 1969. "Intra-therapist Diagnostic Reliability." Journalo f ClinicalP sychology
119
25(4): 394-96.
Erickson, Paul, Judy L. Klein, Lorraine Daston, Rebecca Lemov, Thomas Sturm, and Michael D.
Gordin. 2013. How Reason Almost Lost its Mind: The Strange Career of Cold War Rationality.
Chicago: University of Chicago Press.
Feighner, John P. 1979. "Nosology: A Voice for a Systematic Data-Oriented Approach."
American Journal ofPsychiatry 136(9): 1173-4.
Feighner, John P. and Eli Robins, Samual Guze, Robert A. Woodruff, George Winokur, Rodrigo
Fox-Keller, Evelyn. 1995. Refiguring Life: Metaphors of Twentieth-century Biology. New York:
Columbia University Press.
Munoz. 1972. "Diagnostic Criteria for Use in Psychiatric Research." Archive of General
Psychiatry 26(1): 57-63.
Forsythe, Diana. 1993. "Engineering Knowledge: The Construction of Knowledge in Artifical
Intelligence." Social Studies of Science, 23(3): 445-477
Freud, Sigmund. 1958[1912]. "Recommendations to Physicians Practising Psycho-Analysis." In
The StandardE dition of the Complete Works of Sigmund Frued, Volume XII (1911-1913): The
Case of Schreber, Papers on Technique and Other Works. Trans. James Strachey. Pp. 109-120.
London: Hogarth Press and the Institute of Psycho-Analysis
Freud, Sigmund. 1998. The Interpretationo fDreams. James Starchey, trans. New York: Avon
Books.
Hayden, Cori. 2004. When Nature Goes Public: The Making and Unmaking of Bioprospecting in
Mexico. Princeton: Princeton University Press.
Hayles, Katherine. 1999. How We Became Post-Human: Virtual Bodies in Cybernetics,
Literature, and Informatics. Chicago: University of Chicago Press.
H elmreich, Stefan. 2000. Silicon Second Nature: Culturing Artificial Life in a Digital World.
Compton: University of California Press.
Herzig, Rebecca. 1995. Sufferingfor Science: Reason and Sacrifice in Modern America.
Piscataway: Rutgers University Press.
Hicks, Marie. 2017. ProgrammedI nequality: How Britain Discarded Women Technologists and
Lost its Edge in Computing. Cambridge, MA: MIT Press.
Huys, Quentin J.M, Tiago V. Maia, and Michael J. Frank. 2016. "Computational psychiatry as a
bridge from neuroscience to clinical applications." Nature Neuroscience 19(3): 404-413.
Inoue, Miyao. 2011. "Stenography and Ventriloquism in Late Nineteenth Century Japan."
Language & Communication 31(3): 181-190.
120
Inoue, Miyako. 2018. "Word for Word: Verbatim as Political Technologies." Annual Review of
Anthropology 47:217-32.
Insel, Thomas. 2012. "Research Domain Criteria-RDoC." National Institute of Mental Health.
Director'sB log, March 6. http://www.nimh.nih.gov/about/director/2012/research-domain-
criteria-rdoc.shtml, accessed on January 12, 2015.
Insel, Thomas. 2013. "Transforming Diagnosis. National Institute of Mental Health." Director's
Blog, April 29. http://www.nimh.nih.gov/about/director/2013/transforming-diagnosis.shtml,
accessed January 3, 2015.
Insel, Thomas. 2014. "A New Approach to Clinical Trials." National Institute of Mental Health.
Director'sB log, February 27. http://www.nimh.nih.gov/about/director/2014/a-new-approach-to-
clinical-trials.shtml, accessed January 3, 2015.
Insel, Thomas and Bruce N. Cuthbert. 2013. "Research Domain Criteria (RDoC): Toward a New
Classification Framework for Research on Mental Disorders". American Journal ofPsychiatry
167(7):748-751.
Insel, Thomas and Gotay 2014. "National Institute of Mental Health Clinical Trials: New
Opportunities, New Expectations." JAMA Psychiatry 71(7):745-56.
Irani, Lilly. 2015. "The cultural work of microwork." New Media and Society 17(5): 720-739.
Irani, Lilly. 2019. Chasing Innovation: Making EntrepreneurialC itizens in Modern India.
Princeton: Princeton University Press.
Kay, Lily. 2000. Who Wrote the Book ofLife? A History of the Genetic Code. Stanford: Stanford
University Press.
Kittler, Friedrich. 1999. Gramophone, Film, Typewriter. Geoffrey Winthrop-Young and Michael
Wutz, trans.Stanford: Stanford University Press.
Kraepelin, Emil. 1921[1919]. Manic Depressive Insanity and Paranoia.R . Mary Barclay, trans.
Edinburgh: E.S. Livingstone.
Lasswell, Harold. D. 1935. "Verbal references and physiological changes during the
psychoanalytic interview: a preliminary communication." PsychoanalyticR eview (22): 10-24.
Lempert, Michael. 2019. "Fine-Grained Analysis: Talk Therapy, Media, and the Miscroscopic
Science of the Face-to-Face." Isis (110)1: 24-47.
Lakoff, Andrew. 2005. PharmaceuticalR eason: Knowledge and Value in Global Psychiatry.
Cambridge: Cambridge University Press.
121
Lakoff, Andrew. 2005b. "Diagnostic Liquidity: Mental Illness an the Global Trade in DNA."
Theory and Society 34(1): 63-92.
Martin, Emily. 2007. Bipolar Expeditions: Mania and Depression in American Culture.
Princeton: Princeton University Press.
Petryna, Adriana. 2009. When Experiments Travel: Clinical Trials and the Global Searchfor
Human Subjects. Princeton: Princeton University Press.
Pye, David. 2010[1968]. "The Nature and Art of Workmanship." In The Craft Reader. Glenne
Adamson, ed. Pp. 341-53. Oxford: Berg.
Orr, Jackie. 2006. Panic Diaries:A  Geneology ofPanic Disorder. Durham: Duke University
Press.
Orr, Jackie. 2010 "Biopsychiatry and the Informatics of Diagnosis." In Biomedicalization:
Technoscience, Health, and Illness in the U.S. Adele E. Clarke et al, eds. Pp. 353-379. Durham:
Duke University Press.
Rankin, Joy. 2018. A People's History of Computing in the United States. Cambridge, MA:
Harvard University Press.
Riles, Annelise. 2006. Documents: Artifacts ofModern Knowledge. Ann Arbor: University of
Michigan Press.
Robins, Eli and Samuel B. Guze. 1970. "Establishment of Diagnostic Validity in Psychiatric
Illness: Its Application to Schizophrenia." American Journal ofPsychiatry 126(7): 983-87.
Rosenberg, Charles. 2002. "The Tyranny of Diagnosis: Specific Entities and Individual
Experience." Millbank Quarterly 80: 237-60.
Rosner, Rachael I. n.d. In Beck's Basement: Aaron T Beck and the Cognitive Revolution in
American Psychotherapy. Unpublished manuscript.
Rosner, Rachael I. 2018. "Manualizing psychotherapy: Aaron T. Beck and the origins of
Cognitive Therapy ofDepression." European Journalo fPsychotherapy and Counselling 20(1):
25-47.
Rosner, Rachael I. 2005. "Psychotherapy Research and the National Institute of Mental Health,
1948-1980." In Psychology and the National Institute of Mental Health: A HistoricalA nalysis of
Science, Practice,a nd Policy. Wade E. Pickren and Stanley F. Schneider, eds. Pp. 113-150.
District of Columbia: American Psychological Association.
Sanders, James L. 2011. "A Distinct Language and a Historic Pendulum: The Evolution of the
Diagnostic and Statistical Manual of Mental Disorders"Archive ofPsychiatricN ursing 25(6):
394-403.
122
Schaffer, Simon. 1994. "Babbage's Intelligence: Calculating Engines and the Factory System."
CriticalI nquiry 21(1): 203-22
Schechter, Kate. 2014. Illusions of a Future: Psychoanalysisa nd the Biopolitics ofDesire.
Durham: Duke University Press.
Semel, Beth. 2014. "Tracking the self, installing expertise: Cognitive-Behavioral Therapy and
the auto-regulating subject." Paper presented at the Annual Meeting of the Society for Social
Studies of Science in conjunction with SociedadLatinoamericanad e Estudios Sociales de la
Ciencia y la Tecnologia, August 21, Buenos Aires, Argentina.
Semel, Beth. 2013. Culture all the way down? Interpreting "Culture" and Imagining
Competence in a Cross-CulturalP sychology Class. Master's Thesis, Brandeis University.
Soto Laveaga, Gabriela. 2009. Jungle Laboratories:M exican Peasants, National Projects, and
the Making ofthe Pill. Durham: Duke University Press.
Spitzer, Robert L. and Jean Endicott, Eli Robins. 1978. "Research Diagnostic Criteria: Rationale
and Reliability." Archive of General Psychiatry 35(6): 773-82.
Spitzer, Robert L. and Michael Sheehy. 1976. "DSM III: A Classification System in
Development." PsychiatricA nnals 6 (9): 102-9.
Spitzer, Robert L., Jean Endicott, Jacob Cohen, and Joseph Fleiss. 1974. "Constraints on the
validity of computer diagnosis." Archives ofGeneralP sychiatry 31(2): 197-203.
Spitzer, Robert L. and Paul T. Wilson. 1968. "An Introduction to the American Psychiatric
Association's New Diagnostic Nomenclature for New York State Department of Mental Hygiene
Personnel."P sychiatric Quarterly 42(3): 487-503.
Spitzer, Robert L. 2001. "Values and Assumptions in the Development of DSM-III and DSM-
III-R: An Insider's Perspective and a Belated Response to Sadler, Hulgus, and Agich's 'On
Values in Recent American Psychiatric Classification." Journalo fNervous and Mental Disease
189(6): 351-9.
Star, Susan Leigh, and James R. Griesemer. 1989. "Institutional Ecology, 'Translations' and
Boundary Objects: Amateurs and Professionals in Berkeley's Museum of Vertebrate Zoology,
1907-39." Social Studies ofScience 19(3): 387-420.
Star, Susan Leigh. 1991. "The Sociology of the Invisible: The Primacy of Work in the Writing of
Anselm Strauss." In Social Organizationa nd Social Process:E ssays in Honor ofAnselm
Strauss. D. Maines, ed. Pp. 265-283. New York: Aldine De Gruyter.
Stevens, Hallam. 2013. Life Out of Sequence: A Data-DrivenH istory ofBioinformatics.
Chicago: University of Chicago Press.
123
Strathern, Marilyn, ed. 2000. Audit Cultures: Anthropological studies in accountability, ethics
and the academy. London: Routledge.
Turnbull, David. 2000. Masons, Tricksters and Cartographers.L ondon: Routledge.
Wilson, Elizabeth. 2010. Affect and Artificial Intelligence. Seattle: University of Washington
Press.
Young, Allan. 1995. The Harmony ofIllusions: Inventing Post-TraumaticS tress Disorder.
Princeton: Princeton University Press.
124
Chapter 2: Talking Heads: Brains, Bodies, and Vocal Biomarkers
"Some day we shall know how to validate the saying of the old physician which is on the title-
page of this book: 'From him who has eyes to see and ears to hear no mortal can hide his secret;
he whose lips are silent chatters with his fingertips and betrays himself through all his pores"
(Lasswell 1930: 239)
Imagine yourself thoroughly packed into a narrow, white tube that surrounds your whole body
as you lie horizontal on a thin pallet, tucked in with a white blanket, yourface covered by what
appearst o be a white motorcycle helmet with goggles that wrap around the back ofyour head to
your neck. Yellowing squares of mattressfoam along each ofyour ears and afolded up
pillowcase at the base of the helmet holdyour headfirmly in place. Your ears are plugged with
expensive, noise canceling ear buds, and along the length ofyour legs and torso run a bundle of
wires connected to devices that rest on your chest: two boxes with red, green, blue, andyellow
buttons on them that you will eventually be directed to press, and an oblong ball that you've
been directed to squeeze in case something goes wrong. You are inside afunctional magnetic
resonance (fMRI) machine, about to begin yourfirst brain scan.
I narrate this second-person ethnographic conceit in my head in an effort to remain calm
as I settle into the scanner's constricted passageway. I imagine what kind of descriptive turns of
phrase it would take to pull my readers into this cold, tiny space with me, in part as a way to
remind myself that I will not be in here forever. Even though I have watched and assisted in
more scans than I can keep track of over the last three months, this is my first time in the belly of
the beast. And even though it never arrives, I am anticipating the sudden onset of claustrophobia
that I have heard grips some people-research subjects, medical patients, my twin sister-who
have never before feared small spaces once inside the scanner.
125
My move to the ethnographic ur-voice-a remixing of Malinowski's own mythical
fabrication of what it is like to conduct fieldwork-is not unique. Anthropologists before me
who have studied cognitive neuroscience labs in the United States and Europe have written
similar passages (see especially Joyce 2008; Langlitz 2012) attempting to capture this strange yet
biomedically mundane experience while underlining that the study of technoscience is just as
valid of an object of ethnographic investigation as, for instance, the gardening practices of the
Trobrianders. Moreover, I am not only crafting this naive subject position-for who everything
is strange-for my own benefit alone, to soothe my anxious nerves. I am trying to help my
informants, a group of researchers working within a cognitive neuroscience lab at East Coast
University (ECU), test-run their experimental set-up, giving them feedback from the first-person
perspective of a research subject entering the scanner for the first time and knowing nothing
about the ins and outs of their research project, its longer history in relation to the history of U.S.
psychiatry, or how its scope has shifted over time.
As a research assistant on the team with no technical or academic training in
neuroscience, this is one of the few research responsibilities that I can actually assist with. Years
prior to my scanning debut, I had met with the head of the entire cognitive neuroscience lab to
learn more about his ongoing work with the lab's lead research scientist, Sushant, to predict
which patients diagnosed with social anxiety disorder would respond well to cognitive
behavioral therapy based on fMRI scans. It was in this meeting that I learned of Sushant's
collective of researchers and their ambitious project. They were looking for "vocal biomarkers"
of depression: micro-level, acoustic features of speech that might be indicators of the presence or
onset of depression and that might also be helpful in shedding light on the neurobiological
underpinnings of depression.
126
A hybrid of "biological" and "marker," biomarkers are a "broad subcategory of medical
signs-that is, objective indications of medical states observed from outside the patient-which
can be measured accurately and reproducibly" (Strimbu and Travel 2010: 463). An ideal-typic
biomarker would be a gene that codes for an enzyme, indicating the action of some disease
mechanism when found in a sample of blood. For those invested in Computational Psychiatry
and in building a diagnostic system that is anchored in biology and moves beyond DSM,
biomarkers are foundational to translating mental illnesses into decontextualized diseases,
offering an entryway into what would be (in the context of Euro-American conceptualizations of
the self and body) otherwise interior, private, and inaccessible phenomena.
The first time I heard the term "vocal biomarkers," it sounded like a riddle, especially
given the normative, Euro-American language ideologies of linguistic transparency that circulate
in mental health care institutions. As Summerson Carr (2010) describes, mental health
counselors in the U.S. tend to frame mental illness as a kind of semiotic detritus that clogs up the
channel between spoken utterances and a speaker's inner self. Undergoing treatment and
achieving psychological health corresponds with clearing up this passageway, enabling "honest"
and "authentic" talk: speech that is referentially transparent and in direct correspondence with
the speaker's intentions. The notion of a vocal biomarker presses tension into this model, for it
insinuates that a speaker has no control over the expression of interior states, regardless of how
they modulate their speech-self relationship (i.e., regardless of their intention to speak
"honestly"). In theory, vocal biomarkers flow freely in streams of speech irrespective of a
speaker's intentions to express or conceal them. If they can be pinned down and identified, they
promise transparent, unmediated access to interior states. At the same time, vocal biomarkers
127
will remain opaque and inaccessible to speakers and listeners alike absenting the proper
technological re-mediation.
I wondered: what media ideologies (Gershon 2010)-or on-the-ground ideas about the
medium of fMRI, audio recording, and the capacity of these techniques to capture something
about brains and speech- supported the research, rendering the study both socioculturally
desirable and technically plausible? If the concept of "vocal biomarkers" insinuates that mental
illness has telltale sounds, then how might these sounds be made audible to the researchers, or to
people with the power to move patients through the health care system? What are the stakes of
non-psychiatric personnel using the tools of psychiatry-like the psychological inventories
developed during the 1970s and 1980s-to define mental illness, categorizing humans (and their
brains and speech sounds) as either depressed or not depressed? In exchange for allowing me to
conduct participant-observation alongside them, taking on the role as their "meta-scientist" (as
Sushant, the team's Primary Investigator (PI), called me when introducing me to others) the team
had to be sure that I was "pulling my own weight." I had to contribute to the research in some
useful manner. This includes playing the role of a research subject whenever the team requested.
By doing so, I am helping them pin down where errors in their data entered into the workflow
and determine if they should augment or amended the directions they give research subjects.
The ear buds I'm wearing are designed to protect research subjects' ears from what will
be the painfully loud sound of the scanner, which will reach heights of 125 decibels once the
scan begins, the sonic equivalent of popping a balloon in front of your ear. The ear buds also
carry into research subjects' ears the gentle voice of Victor, an even-keeled and often
fashionably dressed third-year PhD student on the team. Playing the role of himself as
128
researcher, Victor instructs me to lie as still as possible, explaining that the first scan will soon
begin and will last between ten to fifteen minutes.
Though I cannot see him from my position inside the tube, I know that he watches me
through the soundproofed window of an adjacent room, called the "control room," surrounded by
stacks of blank CDs, errant pens and paper clips, a broken analog clock, three desktop
computers, and three laptops. This is the vantage point from which I typically observe the scans,
keeping my eyes on the research subject's feet sticking out of the scanner. If I move at all inside
the scanner and Victor can see, he must take note of the time and nature of my movements on a
Google form that he keeps open on one of the laptops for the duration of the experiment (which
will last three hours at minimum). Bodily movements during an fMRI scan subtly changes the
position of the research subject's head in the motorcycle helmet apparatus. This "blurs" the
images of the subjects' brain activity that the scan is designed to represent, rendering the data
unusable for later interpretation. Thus, it is crucial to monitor the subject's body, and so Victor
has assistance from Santiago, the team's technical assistant who has recently completed his B.A.
in biology but is a programmer and gamer at heart. When I help with scans, I work with Santiago
while Victor remains in his office busy with tasks that the team consider to be higher priority,
like data analysis, reserved for team members who have more technical expertise and have been
working on the team for longer.
I try to relax and remain still, closing my eyes as I'm bathed in the rhythmic grinding and
pounding and clanking of the scanner, much louder than how I usually hear it from the control
room. The sounds-which the ear buds slightly diminish but which I can feel through the
scanner bed-recall a combination of fire alarms going off, concrete being jackhammered, metal
being sawed in half, and my teeth being drilled at the dentist. This uncanny cacophony has
129
inspired artists like Arnold Dreyblatt to make an entire album of "fMRI music," with each track
featuring a scan of a different part of Dreyblatt's body-flesh as both medium and message
(Zipoyrn 2013).
Although I keep my body as still as possible, I know from the introductory level
neuroscience courses I have been muddling through as part of my fieldwork that on a molecular
level, I'm really quite busy. The giant magnet embedded in the scanner's tube, sitting somewhere
above my head, creates a powerful, static magnetic field that is sixty thousand times the strength
of the Earth's own magnetic field. As a red rug on the threshold dividing the scanner room from
the control room warns, this magnet is always on. When lying still in the scanner, the magnet
tilts certain molecules in my brain that exhibit a physical property called "spin," creating a net
magnetization among them and lining all of the molecules up in the same direction. A pulse of
radio frequency is sent through the magnet tube, causing it to wrench in place-the source of the
familiar, menacing sounds that encircle me-and once again disturbing the position of my
brain's molecules. As the pulse of radio frequency creates a secondary, temporary and much less
powerful magnetic field, gradually, the molecules realign with the original magnetic field. Once
they realign (i.e., as a magnetic "current" moves), my brain releases a second electrical current
that can be measured externally. The job of the helmet, called the "coil," is to pick up and track
this second current.
This is why even the smallest of movements of a subject's head from the cradle of the
coil has such dire consequences for the resulting data. Hence, the foam squares lining my ears
are both to protect them from damage and to ensure that I remain still. The way the molecules in
my brain realign with the first magnetic field (from the scanner itself) tells the monitoring
scientists-like Victor, Santiago, and the rest of the team-something about the brains of
130
members of the experimental cohort: the brains of people who identify as having depression and
who meet the study's other inclusion criteria, versus those who do not identify as having
depression.
Victor's voice returns to tell me that the first scan is complete. He and Santiago now have
a dynamic model of my brain "at rest" that they will use as a baseline to track my "active"
neuronal responses to stimuli soon to be presented to me. The ensuing scans will be "functional,"
capitalizing on the difference between the magnetic properties of oxygenated and deoxygenated
blood to track the ebb and flow of blood29 to different regions of my brain. In theory, the
concentration of blood in a brain region means that the region is active, so Victor and his
colleagues have designed stimuli that supposedly animate the areas of my brain that coordinate
and control the production of speech, the bodily activity that they hypothesize is impacted by the
onset of depression.
Fixed atop the coil is a mirror tilted camera obscura-style and aimed at a projector screen
that Victor explains will display directions for the first of seven tasks I will complete. The ceiling
of the tube is a mere three-and-a-half inches away from the end of my nose but the positioning of
the mirror gives the illusion that the opening of the tube, which lays about two feet horizontally
beyond the crown of my head, sits above me. The projector screen, corresponding with the
screen of a laptop that Santiago controls, displays a blank, black slide. Via the mirror, I gaze up
at the slide through the opening of the tube as if in an observatory, looking up at a night sky.
Instead of stars guiding me, I am met with single-word prompts, repeating in random order: slow,
rapid, normal. These are the speeds at which I am supposed to produce the sounds pa-ta-ka. The
"rate word" appears as the scanner bleeps, and when a green cross appears under the rate word,
2 Specifically, it measures the BOLD (Blood Oxygenation Level Dependent) signal.
131
the scanner falls silent-this my cue to speak. I know from talking with Victor, who, like
Sushant, is trained in phonetics30, that the repetition ofpa-ta-ka is a fairly standard task in the
speech-language pathology world. Producing the /p/, /t/ and /k/ sounds spans the full spectrum of
possible tongue and lip positions for consonant sounds in Standard American English, and so
scientists who study speech consider it to be a good measure for testing a person's ability to
coordinate the muscles associated with speech (also known as the articulators) at varying speeds.
For the uninitiated, I conjecture, this task must proceed like the performance of a Dadaist poem:
Paaa-taa-kaaa-paaa-taaa-kaaa-paaa-taaa-kaaa.
Patakapatakapatakapataka.
Pa ta ka pa ta ka pa ta ka.
After having spent months on the other side of the sound-proofed window, I also know
that, from time to time as I speak, Victor, Santiago, and whoever else is in the control room
listens in on me conducting the task, though they do not tell me when they are listening, or
"checking in" as they call it. Just as they must closely watch the research subjects' body in the
scanner to make note of any movements they make and then ask them not to move again, team
members must check in on the subjects as they speak to confirm that they are following
directions. Are they speaking loudly enough so that the microphone will pick up their voices, and
so their speech will be properly audio recorded for later analysis? Are they indeed saying "pa ta
ka," and not some other collection of sounds? Does the subject's interpretation of the rate words
(slow, rapid, normal) align with what the team agrees is slow, rapid, or normal speech? And if
the subject is doing something "wrong," how should team members like Victor correct them?
I can picture the process easily: with the push of a button on a speaker, Santiago fills the
control room with the sounds of my voice saying pa ta ka, captured by a small microphone
30 A sub-field of linguistics focused on the study and classification of speech sounds.
132
dangling just above my chin. The microphone had been fed through the perforated bottom of a
coffee cup taped to the side of the head coil. After switching off the speaker, careful to time it
with the scanner pulses so that they do not transmit the blaring sound of the scanner into the
control room, team members praise the research subject amongst each other for a job well done.
They recognize that the tasks are confusing and that the directions are ornate and errors are
common. This is why they have research subjects run through each of the seven verbal tasks two
times. If the subject is making what they deem to be an error, they might brainstorm on how to
prompt the subject to conduct the task differently for the second round. Or, despite their best
efforts to be courteous to research subjects, they might start laughing at them.
I admit that I have laughed at research subjects, because sometimes, they make funny
sounds, and it can be very boring inside the sunless control room, where researchers grow sleepy
as the hours it takes to conduct the scan crawl on and on, and as subject after subject cycles in
and out of the scanner while team members begin to lose track of the time of day. Pretending to
be a research subject-"piloting" a scan-puts team members in a position to be laughed at by
their peers. Victor and Santiago may be having a chuckle at my pa ta ka's, and we have chuckled
at theirs. Being the subject of laughter becomes an experience that researcher and participant
share. This can make researchers think twice about snickering at the sound of other people's
voices.
Nevertheless, most research subjects-who are either recruited off of Craigslist or local
and university-wide listservs-have never been a member of a cognitive neuroscience lab or any
other kind of research lab, and so the scene in the control room is even more foreign and out-of-
reach than the meaning of the sounds and the speech they are instructed to utter inside the
scanner. And, as the hours advance, this research subject pilot finds it harder and harder to
133
imagine being anywhere else but inside the scanner. Being in here makes me feel extremely-
almost mystically-present. The scanner bed vibrates with each beep and clang, and I
concentrate on the noises, synchronizing my breathing with them. They resonate through my
chest, making me feel hollow inside. The ear buds muffle the sound of my own voice, and so it
seems as if my speech is not my own-as if it originates from somewhere outside my head, as if
I am barely making any audible noise at all. And yet, lying motionless in the scanner bed makes
the physical phenomena of speaking-the focus of the team's study-feel all the more
pronounced: the opening and closing of my lips, the warmth of my breath, the rise and fall of my
sternum, the tip of my tongue tapping the ridge of flesh behind my teeth, the up and down dip of
my larynx...
The ethnographer's brain "at rest."
134
USING THE VOICE TO UNDERSTAND THE MIND
ECU is the home base not only for Victor, Santiago, Sushant, and the two other team members,
but also for informants across my other fieldsites. PIs from Midwestern University and West
Coast University had passed through ECU as they advanced in their careers, collaborating-and
even training with-Ted, the second PI of the vocal biomarker group alongside Sushant. Thus,
despite the fact that the teams in the Midwest and on the West Coast were less concerned with
studying the brain or even explicitly studying human biology, when conducting research
alongside them and in their own meetings and presentations, I sensed the presence of the ECU
team's logic and methods. The ECU team's approach and the theories they were committed to
made up the epistemological backbone of the other teams' respective studies. To better
understand Sushant and Ted's vocal biomarker study is therefore to grasp something
fundamental about the two other projects this dissertation follows, particularly in terms of how
they all conceptualize the connection between acoustic qualities of speech and interior states, and
the means through which to grasp hold of these connections.
While the other teams were intent on producing a technological prototype that could aid
in psychiatric screening by detecting mental illness in acoustic qualities of the voice, at ECU,
their ambitions were humbler. Their goal was to publish papers responding to a pair of
interlinked research questions. Are there connections between qualities of the voice and changes
in the brain that occur in tandem with the disease state identified in DSM-IV as "major
depressive disorder" (MDD), colloquially known as "depression"? Which vocal qualities suggest
the presence or onset of depression? The ECU team's project itself is part of a longer legacy of
multidisciplinary inquiry into the relationship between depression and speech sounds, a corpus of
research that spans across neuroscience, psychiatry, and psychology (Greden, Albala, and
135
Smokler 1981; Godfrey and Knight 1984; Breznitz 1992; Flint et al 1993; Alpert, Pouget, and
Silva 2001; Cannizzaro et al 2004) as well as communication and computer science, and
engineering (Darby and Hollien 1977; Hollien 1980; Darby, Simmons, and Berger 1984; France
et al 2000; Moore et al. 2003; Ozdas et al 2004; Low et al 2010; Cummins et al 2011; Goechke
2011; Schuller et al 2013; Cummins et al 2015).3 This research has established the basic premise
on which the ECU team's study rests, a premise that echoed, however faintly, in the research
questions of the other teams: the sounds of speech, like all other sounds, have formal, physical
properties. Researchers can mathematically analyze and then reverse engineer these formal
features in order to learn more about the nature of the source that produced them: the
coordination of the articulators, which specific regions of the brain (the cerebellum, the cerebral
cortex, and the basal ganglia) control. 3 According to Sushant, to study speech as a motor control
issue is to study speech at its most basic-and therefore universal-level, for all (neurotypical)
speakers, regardless of the language they are speaking and the affective or sociocultural intent of
their utterances, use the same neuronal pathways to control the production of speech sounds.
Following the precedents and conventions set by the researchers before them, the ECU
team aims to use the sounds of speech to explore the brain. Ralph, an advanced graduate student
on the team, would say in his elevator pitch of the study that they were "using the voice to
understand the mind." Rather than attend to speech in terms of the meaning of what the speaker
says, using the voice to understand the mind entails treating speech as neurobiologically
indexical sign that conveys something about what a depressed person's brain does.
Overshadowing this whole endeavor is the hegemonic figure of "brainhood" or "cerebral
3 For an in-depth review of this literature, see (Cummins et al 2015).
32 Guenther's Neural Controlq fSpeech (2016) offers one of the most comprehensive and unified treatises
describing this approach to studying speech.
136
subjectivity," the notion that all human behavior and experience is governed by and can be
distilled down to brain activity (Vidal 2009; Rose and Abi-Rached 2013; Vidal and Ortega
2017).
Despite the univeralist underpinnings of the research questions and the scholarly legacy
on which they rest, and despite the vocal biomarker team's pursuit of speech at its supposedly
most bedrock of foundations (the brain), in their everyday work, researchers enacted and
described to me a much more complicated relationship with both the scale of their study and the
claims they had designed their study to produce. On the one hand, Ralph's catchphrase-using
the voice to understand the mind-gives the impression that he and his colleagues believed that
vocal qualities have always already been linked to changes in the brain, and these qualities are
simply waiting to be found and described. On the other hand, Ralph and his colleagues were self-
reflexive and self-aware of the complexities, complications and limitations that shaped the facts
they could produce about the brain, the voice, and depression.
My informants' day-to-day lives revolved around attending to the "experimental system"
of their study, or the "local, technical, instrumental, social, and epistemic" aspects of their
experimental set-up (Rheinberger 1997: 238). The fMRI scanner, computer programs,
microphones, headphones, buttons, wires and audio recording software the researchers depended
on were unreliable and malfunctioned in frustrating, unpredictable ways. More than that, using
the voice to understand the mind required tinkering with what Jill Morawski (2015) calls "the
experimenter-subject system." Human research subjects and experimenters are in a social
relationship, and the quality of their data-and by extension, the entirety of their study
depended on the management of this relationship, on ordering the body and the speech of the
subject. Institutional contexts shape the experimenter-subject system, like the fact that ECU
137
could pay research subjects but Sushant and his team could neither diagnose them nor treat them.
But because of the study's emphasis on the speech system, the relationship between
experimenter and subject revolved most closely around complex meta-linguistic interactions.
The vocal biomarker team aimed to build up an experimental context and stimuli that
kept the particular individuality of the research subject and the sociocultural dimensions of
language and speaking at bay, treating the research subject as a body that makes sound-as a
medium transmitting brain and speech data. At the same time, speaking through the microphone
in the control room and into the ear buds lodged in the subject's ear, the researcher conducting
the scan had to constantly describe, clarify, and demonstrate for the subject a set of highly
specific definitions of qualities like "pitch," "volume," and "speed." In turn, the research subject
must produce speech sounds and sentences that align with the team's definitions of these
qualities. This places team members in a contradictory position. The overarching goal of the
study was to peel away meaning and culture to arrive at the fleshy, fundamental form of speech,
but team members must rely on-and constantly contend with-these very same components of
language in order to collect the data they needed.
Researchers had to wrangle the whole assemblage of technologies, bodies, and sounds all
in order to achieve a category of speech referred to in the scholarship of their colleagues and
predecessors as "natural speech." In cognitive neuroscience and speech studies, the difference
between "natural" and "unnatural" speech pivots on the degree to which researchers manipulate
research subjects' bodies and affect. Speech is experimentally "unnatural," for example, if the
researcher uses a device to perturb the research subject's lip or hold theirjaw in place while
asking them to produce speech. Speech is also "unnatural" if the researcher requests that the
subjects vocally evoke a specific emotion, asking the subject to speak angrily, speak happily, or
138
sadly. Relatively speaking, then, the vocal biomarker team studied "natural" speech. Their goal
was to ensure that the subject produced speech as they "actually" would, as if they were speaking
in any other situation out in the world, outside of the scanner, outside of the lab. The notion of
natural speech compliments the media ideology of the vocal biomarker (which grants immediate,
transparent access to the body through language). To facilitate experimentally natural speech is
to encourage a reflexive, passive reaction rather than an active performance.
In this way, although natural and unnatural speech are actor's categories with context-
specific definitions, my informants' struggles with maintaining the naturalness of speech offer an
intervention into linguistics as a scientific enterprise. For the vocal biomarker team, the
distinction between "natural" and "unnatural," like the "biological" and the "social," were
always threatening to dissolve or collapse. Researchers had to work to keep them separate. Their
struggles to maintain the experimental and experimenter-subject system trouble the "construction
of language as a natural object" within linguistics, and the positioning of linguistics against
sociolinguistics that treats the sociocultural components of language as variants on a norm rather
than essential features (Eckert 2003: 393).
A BIG, BEAUTIFUL BUILDING
ECU's Neuroscience Department is located in a big, beautiful building that was built with a
chunk of funds from a massive philanthropic gift. The building's sleek and impassive exterior-a
mixture of marble and floor-to-ceiling green glass-contrasts with the abandoned alleyways
around which the building was constructed. The atrium at the entrance of the building, which
offers green lounge chairs and round tables for students to gather at in between their classes,
139
draws one's eyes upward six stories to glass-paneled ceilings. Even during the frigid east coast
winter, the glass concentrates sunlight onto the tables below, and by midday students move to the
corners of the atrium to avoid overheating and to keep the glare off their laptop screens.
At the start of my fieldwork, I spent most of my time in this place. I would sit at one of
the tables in between assisting with scans and attending introductory neuroscience and acoustic
phonetics courses, lectures, and workshops, watching the ebb and flow of professors, students,
researchers, and tourists, listening to impassioned ping pong and foosball matches going on up
on the third floor, or to the swell of voices whenever a lab offered free ice cream and cookies on
the fifth floor. I kept my laptop opened but would peer over its edge, ever hopeful to catch the
eyes of someone I knew passing through the atrium so that I could strike up a conversation and
try to learn more about what was happening to the brain and audio data I helped to gather with
Santiago in the control room a floor below.
When the semester began, I had requested access to a desk in one of the offices that team
members shared up on the fourth floor of this building, hoping to get closer to and learn more
about what I imagined was the real action: data analysis, which I also imagined taking place on
individual team member's computers. While the lab's administrative assistant had agreed to find
an empty desk for me, it took a month with no communication until she assigned me a space. I
would later learn that space in the lab offices was limited and that the allocation of desks to lab
members was a charged and sensitive subject matter wrapped up in issues of fairness and
efficiency. Desk access was a marker of belonging and authority. It signaled a move inward,
away from the social periphery of the group. While Sushant was in support of my presence, I did
not fit the demographic bill of a research assistant, a position typically meted to hardworking or
recently minted undergraduates with expertise either in programming, in psychological
140
screening, or at least a degree in neuroscience. In arranging for me to have a desk, the admin had
to wait for a lab member to graduate or move to a different university for space to free up. Or
else, she would have to shift a current team member into a new space that was less close to their
core collaborators, sometimes even two floors away. Giving me a desk meant foreclosing on
another team member's potential opportunity to have a desk. In this banal calculus, status, rank,
and perceived value of a researcher to the rest of the lab were laid bare.
Despite the weeks of anticipation and the social significance of receiving a desk, when
my desk request came through, it did not bring the sense of satisfaction I had hoped for.
Daydreaming over my laptop down in the atrium, I had anticipated that access to a desk would
reveal the secrets of what it means to look for vocal biomarkers, the very work of parsing
through the brain pictures and the research subjects' audio recorded speech to find...something.
And yet, the space assigned to me felt evermore adjacent to some main action that remained
concealed to me still. My office was far from Sushant, Ralph, and Victor, who all sat together.
The only team member in my office was Santiago, who I spent most of my time with, anyways,
conducting brain scans. I only saw the other team members during our weekly meetings, if they
came to check on Santiago and I in the control room or if we needed their assistance, during
events and conferences, or in passing at the water cooler and around the ECU campus.
In the days following my desk success, my mind would wander again while in the control
room with Santiago. Sushant had agreed to-and even encouraged-my participant observation,
and I finally had a someplace to sit on the fourth floor, but I felt as if my attempt to gain access
to what really mattered in the search for vocal biomarkers had failed. I wondered if I would ever
be able to observe and assist in some activity that felt more exciting, more complex, less
141
monotonous, and that would demonstrate for me how the team transformed the brain and audio
data into research findings about the voice and depression.
The daily activities I performed surrounded data collection, which required interacting
with research subjects. In addition to piloting, I would observe the process of determining
whether or not a subject was eligible to participate in the study, and observe Santiago running
through the team's informed consent protocol with subjects who met the team's criteria. I would
also do things that felt custodial, bordering on the domestic. I would cover the mattress foams
with fresh, sterile, blue hairnets. I would dress the scanner mattress with a clean bed sheet for
each research subject. I would help research subjects steady themselves onto the mattress,
handing them the earbuds, the button boxes and emergency squeeze ball and arranging the wires
across their abdomen and chest. I would ask them if they would like a blanket or a support pillow
for their legs and clip the coil helmet in place above their brow. My fieldnotes contain pages and
pages describing this sequence, which I began to shorthand as "tucking the subject into the bed."
After tucking them in, I would join Santiago to troubleshoot a seemingly endless cascade of
problems: with the various scanning computers, with the study's microphones and speakers, and
with research subjects failing to respond to or understand Santiago's directions. At the end of
each scan, I would place the used sheets, often damp with the subject's sweat, in a plastic
laundry basket, and discard the hairets in a plastic trash bin. All of these things struck me as
uninteresting, nonessential busy work. After all, how essential could they be to the research
project, if anyone-someone with no skills, training or experience, like me-could do them?
Writing on the challenges of conducting ethnographic research with the makers and
stewards of technological systems, Seaver (2017) asserts that access is "not a precondition for all
ethnographic knowledge" or a "perimeter around legitimate fieldwork" (7). Instead, the
142
ethnographer has much insight to gain by attending to access itself "as a kind of texture" (ibid).
This scavenging method involves triangulating what is known with what remains unknowable
and out of reach while reading gaps in information and doors that stay closed not as empty
spaces or barriers but as ports of ethnographic inquiry. My understanding of the nature of the
work available to me-and of the vocal biomarker study as a whole-slowly shifted the more I
took the perspective that my position at the lower end of the research pipeline afforded. My own
initial low regard of this daily domestic labor was itself a kind of meta-commentary, one that
reified the hierarchy that places data analysis at the top (as highly skilled, specialized, esoteric,
and valuable) and places interactions with research subjects and custodial, domestic-oriented
labor at the bottom (as ordinary, banal, skill-less grunt-work that is tangential to the production
of scientific knowledge). I participated in shunting it to the margins precisely by not taking it
seriously enough as a zone of ethnographic significance.
In so doing, I had caught myself buying into only one side of science's Janus face (Latour
1998): the side proclaiming science to be a singular, linear journey fixated on the pursuit of facts
alone, rather than, as feminist science and technology studies scholars have shown, a messy,
heterogeneous affair that is distributed across bodies, materials, and spaces, and one that is
riddled with problems in need of constant tinkering and tending (Haraway 1988; Barad 1999).
Mattern (2018) points out that attention to maintenance work is itself an act of maintenance,
amplifying otherwise ignored, subaltern voices while exploring these spaces, practices and
practitioners as potent sites and agents of ideological distillation that go unnoticed because they
appear so benign, so passive. To take the administrative, banal, and sweaty, dirty work of
scientific research seriously-work that falls under the murky and capacious rubric of care-is to
143
consider how practices of repair and error correction are not ancillary but vital to the production
of scientific knowledge (Martin, Myers, and Viseu 2015: 628).
The organization of space within the lab maps out social and academic hierarchies:
whose expertise is deemed pivotal to the research, which components of research are the most
crucial versus tangential. Who does and does not get a space on the fourth floor of the big
beautiful building reifies labor relations within the lab. The less skilled and permanent your
position, the more likely you are to participate in maintenance work and interact directly with
research subjects, the more difficult it is to access desk space close to your collaborators. This
reinforces the notion that fine-tuning, troubleshooting, and interfacing with research subjects are
a peripheral form of labor, at the edges of groundbreaking scientific action. But for the vocal
biomarker study, the work involved in managing the body and language of the research subject
was central to the project of ensuring the speech they collected was "natural" enough, a
necessary precondition to gathering (and later analyzing) any data at all. This is the work that
enabled the whole study to hang together, infrastructurally and epistemologically.
THE NON-NEUROSCIENTISTS
When I was given permission to join the rest of the group floors above the atrium and handed a
set of keys to an office, I developed a new routine. At the start of each day, upon entering the
building, I would either turn right, walking across the lobby and then down a set of stairs to the
Imaging Center in the basement, or turn left, up a flight of four stairs and through a winding
hallway to my office. Many people in the vocal biomarker's group lab conduct research on
childhood autism and dyslexia, so I'd often encounter children along my morning route. The
144
office I came to share was one of many branching off from the lab's main, waiting room area,
where the young research subjects and their guardians would sit on a leather couch reading
books or playing with the lab's collection of puzzles and action figures on a mahogany coffee
table.
Along with Santiago, I shared my office with a visiting neurologist from Argentina and
Rebecca, a graduate student affiliated with the lab but not working on the vocal biomarker
project. I was the only woman on the vocal biomarkers team, and I was happy to have Rebecca's
company and kindness in the office. She came from a humanities background and taught English
as a second language outside the U.S. for a number of years before having what she told me was
a "conversion" experience that led her to pursue a PhD in neuroscience. Rebecca tried her best to
explain to me the charts and statistical analyses featured in the research articles assigned in my
audited courses. She invited me to attend the monthly journal club that Victor ran called SMALL
(Speech Motor Auditory Language Learning), which became a resource for making sense of how
Victor and the team positioned their research against or in tandem with the labs producing the
articles we read.
Not unlike Rebecca, the members of the vocal biomarker team had made their way to the
ECU's Neuroscience Department through indirect, interdisciplinary paths. Despite their
departmental affiliation, and despite the fact that during the time of my fieldwork they were
either taking or teaching courses on neuroscience and brain imaging, Sushant, Ted, Victor,
Ralph, and Santiago did not identify as neuroscientists. In the weekly meetings that they held
separately from the wider bi-monthly lab meetings, they liked to say that they were merely
people who "happen to hang out with neuroscientists."
145
Sushant began his academic career studying computer science in Southeast Asia before
moving to the east coast of the United States to pursue graduate training in neuroscience, with a
focus on building computational models to better understand speech motor control. For his post-
doctoral training at ECU, he deepened his studies in what he calls "speech communication"
while pursuing additional projects exploring potential biomarkers for mental health treatment
outcomes-the subject of my initial meeting with the head of the lab, during which I learned of
Sushant's vocal biomarkers project. Ascending to the position of lead scientist of the lab,
Sushant now teaches ECU's only course on speech communication and acoustic phonetics,
which he allowed me to audit.
Acting as a PI for the team alongside Sushant is Ted, who sports a salt-and-paper
mustache and a South Shore accent not unlike the one I used to have growing up. Ted is the head
of a lab in an institute affiliated with ECU located in a more rural part of the state and would
travel to ECU once a week to attend meetings. With a combined three degrees from ECU, he is a
close colleague of Sushant's PhD supervisor and the author of a major textbook in speech signal
processing. When I attended an international signal processing conference with him and the rest
of the team in California the summer before the semester began, everyone seemed to know Ted.
Other attendees silently approached him to shake his hand while sitting in the audience of several
talks. Ted trained team members at my other fieldsites; many of them had sought out a degree or
training at ECU specifically to work with him.
Ralph is a lanky, sarcastic, and thoughtful advanced graduate student who Sushant and
Ted both supervise. Working on the project provided a forum for integrating his two main
research interests: speech production and cognitive and emotional states. Victor once remarked
to me that, to Ralph, every conceivable human or physical phenomena could be distilled into a
146
signal. Ralph himself told me that he believed all components of existence-human life, the
formation of the universe, neuronal mechanisms-were governed by the same, basic universal
laws of physics, which themselves could be distilled and described through the equally universal,
transcendent, language of mathematics. Hence speaking, as well, was a physical process that
could be apprehended through recourse to mathematical processes, operationalized in the form of
an algorithm.
Three years behind Ted in the PhD program is Victor, a California transplant with an
approachable demeanor and a gift for putting research subjects at ease, perhaps due to his
background in speech pathology. For his undergraduate and master's training, he combined
neuroscience and linguistics with an emphasis on phonetics and speech production and
perception, specializing in stuttered speech. Team members like Victor, with his more "clinical"
or "applied" training, were more likely to have experience working not just with human test
subjects, but with humans as clients or patients.
Santiago is the youngest member of the group, a first-generation American whose family
emigrated from Central America to the States when he was a toddler. He had just completed his
BS in neurobiology at a nearby university, and the lab tech job offered a means through which to
gain greater research and management experience. He hoped this might position him to pursue a
career as a programmer in the biomedical sector. As a lab tech, he did the bulk of the hands-on
work necessary for the research. It was primarily his job to respond to research subject
recruitment emails, and to review the team's human research subject protocol with potential
research subjects to help them determine if they would decide to officially participate in the
research (a process the team called "consenting," since the form the subject signed at the end if
147
they agreed to participate is called an "informed consent form"). Finally, Santiago's main job
was to scan research subjects.
The group saw their non-traditional status as an asset rather than a hindrance to their
research. No one on the team had any commitment to the causal models of psychiatry, or even
biomedicine. They were less interested in causality and more interested in gathering data in large
enough volumes that would enable them to pick out patterns that might otherwise go unnoticed.
According to Sushant, the group could "hack a solution" to the lack of robust, biologically based
diagnostic markers through an "engineering approach." An engineering or "computational"
approach implies a commitment to agnosticism, a willingness to approach or interpret a problem
in a radically unexpected manner. For instance, Sushant explained that he was open to believing
that the number of times a person touches their nose might be a biomarker for mental illness, a
sign that is linked to and the byproduct of a psychopathological process, despite the fact that
nose touching bares no conventionalized resemblance with any behavior that has anything to do
with mental health. Hence, Sushant's phrase that they were going to "hack" a solution. Hacking
in engineering parlance implies figuring out how a technical system works so that "it can be
made to perform in previously unintended and unforeseen ways" (Jones, Semel, and Le 2015:
324). Sushant and his team wanted to observe, modify, and rearrange the standards of research
on "neuropsychiatric disorders" (a word they used interchangeably with "mental illness")
working outside of the traditional boxes and categories of the DSM, studying sounds of the voice
and the brain rather than a patient's description of their symptoms.
Walking through the professional pathways that led researchers to the vocal biomarker
group provides a backdrop for making sense of their study's intervention, in terms of the models
of language, body, and mind that their work was committed to and reified. While team members
148
come from backgrounds in linguistics, communication science, mathematics, and engineering, no
one on the team had training in psychiatry or experience in mental health care professions. They
had never treated or conducted long-term observation of people diagnosed with the disease
category they studied: depression. The team did have a psychiatry consultant, affiliated with a
local teaching hospital and available for assistance and commentary. This individual helped the
team develop their study's inclusion and exclusion criteria and advised them in selecting the
screening tools they used to determine who was eligible to participate as a research subject. At
the same time, if spatial proximity to the rest of the team is an indicator of a team member's
value, the consultant's expertise was additive, rather than central. The consultant was never
physically present in the offices. Taking this into consideration, in the following section, I
explore what it means for people with no background in psychiatry to define and conceptualize
"depression" as a coherent disease-state, describing the team's procedures for evaluating a
potential research subject's eligibility to participate in the study.
DEPRESSED?
Research subjects seeking involvement in the study must first pass through a screening
procedure. Under the supervision of a researcher, they fill out psychiatric inventories designed to
determine how closely they approximate criteria for different diagnoses in the DSM. The
collection of scores they produce determines their eligibility. Many of these inventories-like the
Hamilton Depression Inventory (HAMD) and the Beck Depression Inventory (BDI)-were
developed during the time period discussed in Chapter 1, at a moment of sweeping reform across
American psychiatry aimed at overhauling and "scientizing" the discipline by rendering mental
149
illnesses into more stable, bounded, and quantifiable objects. Researchers initially developed
inventories to achieve this feat of stabilization, and the vocal biomarker team uses them in a
similar way.
But the ethnographic record has shown that, although institutions and individuals in
positions of power have used DSM and its interlinked inventories as if they were classificatory
field guides, these tools create groups and kinds of individuals rather than map onto groups and
kinds that already exist. DSM-directed screening and diagnosis construct likeness, rather than
identify an essential, a priori likeness (see especially Hacking 1986; Young 1995; Luhrmann
2001; Lakoff 2006; Conrad 2007; Metzl 2010). Their use as neutral classificatory apparatuses
erases the particularities of the populations and milieu from which they were developed, which
challenge the universality of their application. For instance, while HAMD is a tool that clinicians
use with many populations, the inventory was developed based on studies of cohorts of mostly
male and entirely white research subjects living in psychiatric hospitals in the 1960s and 1970s
(Williams 2001; Worboys 2012).
The vocal biomarker team also treats inventory scores as indicative of the subject's
psychological state: either they are depressed enough to participate in the study, or they are not
depressed enough, or they fall into a diagnostic category that fits the study's exclusion criteria.
As I will show in this section, the vocal biomarker team's use of these inventories to screen and
consent their research subjects constitutes a "trial of qualification" (Callon et. al. 2002). By
determining what counts as depression in the context of their own study-even as they are
humbly honest about their lack of expertise in psychiatry-the team defines the boundary of the
diagnostic category. The consenting procedure ratifies the qualities and criteria of being
"depressed" established in DSM, even though their study takes place beyond the bounds of a
150
typical clinical interaction, and even while their ultimate goal is to develop methods for
diagnosis without recourse to conventional, DSM-derived descriptions.
When I was first added on to the team's IRB protocol in the capacity of a research
assistant in the summer of 2015, Ralph was excited at the prospect of outsourcing to me all of the
tasks that consumed most of his time yet required the least training and skills. One such task was
the consenting of research subjects: reviewing the team's ethical protocol for conducting
research with human subjects, ensuring that the participant understood the risks and benefits of
participating in the study, and administering and scoring a stack of psychological inventories.
Ralph insisted that I sit in on and observe him consenting research subjects as part of my
apprenticeship, so that I could eventually practice by pretending to consent him and then move
on to consenting actual research subjects on my own. This end stage never arrived. By fall 2015,
Sushant had hired Santiago, and this job fell under his purview.
Not long after I had agreed to my apprenticeship, I followed Ralph up to the fifth floor
and waited with him for the day's potential research subject in a long and narrow office that was
locked from the outside. When the subject-a white woman with square glasses-arrived, Ralph
let her into the office and then to an attached room, beckoning me to join them after she agreed
to let me observe, "for training purposes." The room that the team was using to evaluate subjects
was smaller than the main office and much more cramped, with no windows. Ralph and the
research subject were huddled around a meager table that hit Ralph's legs at his bent knees. The
walls of this room were painted white, but someone had cut out shapes-stars, spirals, circles,
and diamonds of varying sizes-from pastel colored construction paper and taped them to the
walls, maybe to make the room less stark and a little more inviting for younger research subjects.
Ralph gestured to a metal chair positioned behind him. I sat down and found that Ralph's
151
shoulder obscured the research subject's face and the forms she would be filling out from my
view. Despite the room's crowdedness and the hot summer day outside, the air conditioning gave
me goose bumps. The paper shapes shuddered in the artificial breeze.
Unlike other biomedical research conducted at universities, ECU has no medical school,
so there was no pool of research subjects to draw from who had a clinical diagnosis of
depression or who were being treated for depression. Instead, the team opened the recruitment
pool up to anyone who self-identified as having depression (or, for the controls, anyone who
identified as not having any major psychological illnesses). They sent recruitment emails through
ECU's Neuroscience Department research subject list-serv or posted announcements on job
opportunity websites like Craigslist. Subjects were compensated in cash both for their
participation in the research and for the time it took (usually an hour) to be evaluated and
consented, so participation was marketed as a form of short-term employment. In addition to
these methods, everyone on the team-myself included-carried a stack of flyers in our bags, to
post around the ECU campus and the surrounding areas. The flyer was succinct, inquiring in
bolded, black, script, DEPRESSED? along with brief information about the study ("using the
voice to understand the mind"), that it was a paid opportunity, and the team's contact email.
Their recruitment tactics left it up to the potential research subject to self-identify as having
depression or not, banking on the circulation of "depression" as a legible, psychopathological
state (see Martin 2007). Only when the research subject had made it as far as this woman with
the glasses-contacting the team, scheduling a time to come in, traveling to ECU-did the
difference between self-identified "depression" and DSM-identified "major depressive disorder"
(MDD) begin to matter.
152
First, Ralph walked the woman through the study's consent form, reading it aloud to her
from his own copy as she followed along on hers, pausing to ask if she had any questions (she
had none) before moving on to each section. After she initialed and signed in all the right places,
Ralph opened up a laptop computer and used Audacity, an open source audio-recording software
program, to record her reading "The Grandfather Passage" out loud, a public domain text that is
frequently used to gather a speech sample. Like the pa-ta-ka exercise conducted from within the
scanner, a speech pathologist wrote The Grandfather Passage to contain almost all of the
phonemes of American English 3 3. The team collected this audio recorded speech without a clear
plan on what they wanted to do with it or what significance it might hold for the rest of their
study, though there was frequent talk of comparing subject's in-scanner speech with their
outside-the-scanner speech.
After the Grandfather Passage came the Mini-Cog 3 4, a test typically used to assess for
dementia and Alzheimer's that the research team was using to evaluate the cognitive processing
abilities of the research subject. Next, came the psychological inventories: The Beck Anxiety
Inventory (BAI); the Beck Depression Inventory-I (BDI-II); the Bipolar Self-Test Mood Swings
Questionnaire (MSQ); the Yale University PRIME Screening Test for psychosis; the SAGE
Scales; the Quick Inventory of Depressive Symptomatology (QIDS-SR); Patient Health
Questionnaire version 9 (PHQ-9); Generalized Anxiety Disorder version 7 (GAD-7); the
Screening for Obsessive-Compulsive Disorder; and the Snaith-Hamilton Pleasure Scale
(SHAPS).
3 The team used two other passages: the Rainbow Passage and the Caterpillar Passage. All three passages varied in
length and reading level. Researchers interchanged the passages assigned to subjects but typically went with the
Grandfather Passage because of its duration and its mid-level difficulty.
" For the Mini-Cog, sometimes referred to as "the clock test," the researcher first asks the subject to repeat five
words, then asks them to draw the hands of a clock at 10 past 12 on a paper with a large circle on it. After they finish
drawing the clock, the researcher asks the subject to repeat back the five words. The subject is evaluated on their
ability to remember the words and on their ability to draw the clock.
153
Inventories are a kind of survey and can be grouped into two categories: clinician-rated
(which are filled out according to a clinician's interpretations of the semantic content of the
patient's answers) or patient self-rated (which the patient fills out themselves according to their
own assessment). Following the trend of their recruitment strategies, the team relied on self-rated
inventories. They left it up to the research subjects to evaluate their own symptoms, interior
states, ability to experience joy or feel a lack of motivations to pursue things that bring them
pleasure, and so on. Recall that the diagnostic criteria and diagnostic categories of DSM are
likewise based on outwardly observable behaviors and self-reported symptoms, rather than the
causal mechanisms that drive or lead to mental illness. In both the DSM and in the context of the
vocal biomarker study, then, the agency of the research subject still plays a central role, although
the premise of vocal biomarkers subverts the agency of speaking subjects and severs the
connection between evaluation and expression.
The woman slid the inventories back to Ralph across the table, one by one, as she filled
them out. When she handed him her completed BDI-II, Ralph glanced over it and solemnly slid a
different sheet of paper back in her direction, a form titled "Community Resources for
Psychological Treatment." This document-the only one in the stack that the vocal biomarker
team had created on their own-listed contact information for local and national suicide hotlines,
sliding-scale community mental health centers, and hospitals with emergency psychiatric units.
Ralph bowed his head and whispered, "I'm sorry." I was familiar with the structure of the BDI-
II, and so although I could not see the woman's response I knew that this meant the woman had
given a score higher than zero for question 9, "Suicidal Thoughts and Wishes," thereby
indicating that she had suicidal ideations or urges. No one on the research team is a medical
doctor. Ralph was in no position to officially diagnose the woman. It was prohibitive to make a
154
medical referral, or to notify a medical professional of the woman's responses to this question.
To do so would be to potentially incur legal liability on behalf of the team and ECU, and the
university's institutional review board required the team to make it exceedingly clear that they
could not diagnose or provide medical care for research subjects.
After all of the inventories were complete, Ralph guided the woman down to the lobby on
the first floor to the administrative office, where she was to receive payment for her hour spent
with us. I watched her and Ralph exit and was filled with a sudden somberness. I remembered
that there were people, and lives, and suffering at the front end of the data pipeline, that the
research participants were not only repositories of data or the hosts of speech-making brains, but
also human beings who were potentially in unspeakable pain. Their participation in the research
could not provide them with much aside from the knowledge that they might, if they qualified
for the study, give back to society in some meaningful way by contributing a few hours of their
bodies, thoughts, and sounds to science.
I wondered if Ralph felt this same weight, or if he had grown accustomed to it or simply
couldn't afford to be held back by it. Evaluating subjects through the administration of
psychiatric screening inventories without being able to help them was an uneasy but necessary
step toward building the study's experimental cohort. In the absence of a channel for providing a
legally and/or medically sanctioned intervention, the team had put together the Community
Resources list for the sake of the research subjects. This was a care-ful practice: careful to not
cross the line into legally unacceptable territory of "care." Providing the paper was not a ratified
form of psychotherapy or treatment. Still, it was care-like: a list directing her toward treatment,
an action attentive to the anguish the woman had expressed through a number. A small gesture-
a recognition, a bow of the head-but a gesture nonetheless.
155
Ralph returned to the office with the empty cubicles. I sat without speaking, listening to
him type in and tabulate her scores while listlessly staring at my own laptop. I thought about
what it would be like to see her again, if I would soon be tucking her into the scanner bed,
offering her a blanket. Ralph cut the silence to announce that she did not qualify for the study.
According to her scores, she showed signs of psychosis and of cognitive deficits, two of the
study's exclusion criteria. She would also have been excluded if she was under 18 years old, if
she showed signs of obsessive-compulsive disorder or bipolar disorder, if she had
hypothyroidism, or if she failed to score a BDI-II of at least 14. This would indicate that she was
not depressed "enough."
In my earliest meetings with Sushant, he identified the vocal biomarker group's research
as RDoC-worthy because of its commitment to exploring markers of pathology with a biological
anchor, "using the voice to understand the mind" rather than, for example, using the voice to
understand depression, or studying slurred speech as a symptom of MDD. RDoC encourages
researchers to group together research subjects who might not have been included in the same
cohort, with the assumption that DSM categories arbitrarily and perhaps even incorrectly group
together people who share no actual biological likeness and in a way that might make
biologically significant and biologically based findings impossible. But even as Sushant and
others asserted that their research contributed to efforts to move beyond the social conventions of
biomedical research on mental illness and identify biologically anchored signs suggesting the
presence of mental illness, they still had to make use of tools that categorizes subjects according
to traditional criteria. Conventional psychiatric inventories determined who would make up the
experimental cohort, and in this way, they shaped the findings that the team would eventually
produce.
156
From a theoretical standpoint, the vocal biomarker group treated MDD as a brain
disorder, a pathology that impacts neuronal the circuitry and functioning. But when it came to
gathering together a group of research subjects to study, they treated depression as something
existing at the level of self-reflexive interpretation, caught up in culturally-specific ideas about
behaviors and states of being that are either normal or pathological. The team was well aware of
this contradiction. They recognized that to try to empty "depression" of its cultural significance
by, for example, using a different word in their recruitment strategies, would be to risk missing
out on research subjects. While Sushant appreciated and supported efforts to move away from
DSM like the RDoC project, he also recognized that DSM was tied up with and embedded in the
bureaucratic infrastructure of America's health care system. In our many conversations about the
feasibility of RDoC, Sushant would often note that a complete rejection of DSM would require
revamping the mechanisms through which health insurance companies cover the cost of patient
care. And finally, Sushant knew that subjects were hard to come by. They had to be depressed
enough to meet the team's inclusion criteria, but not so depressed that they were unable to rouse
themselves from their houses, travel to ECU, and then lay confined in the scanner for 3 hours.
Even when they made it to the scanner, much of the data captured during the scan was unusable.
The subject had fidgeted, or they spoke too softly, or Santiago and I had installed the microphone
incorrectly, and so on. Epistemological hopes aside, at the end of the day, Sushant told me, "we
have to start somewhere." The vocal biomarker team's dilemma-their desire to move beyond
DSM yet DSM's necessary role in their study-speaks to the bumpy road that ambitious projects
like RDoC must encounter, and to the tenacity of the bureaucratic and sociocultural
infrastructure built around DSM. The team could support the aims of RDoC but needed to
contend with the social life-and institutional power-of diagnostic categories. Moreover, in
157
assembling their experimental subject population, researchers were given a first-hand encounter
with the lives at stake in the interlinked epistemological and bureaucratic mental health care
reform efforts like RDoC. In the mundane task of handing out and scoring the low-tech, pen-and-
paper inventories of an era of American psychiatry in its twilight hours, researchers like Ralph
were constantly reminded that data points are people, and that innovation sometimes intersects
with human suffering.
SOUNDSFUNNY
If a research subject qualifies for the study, they schedule a time with Santiago to return to ECU,
this time to the Imaging Center in the basement, for their brain scan. I revisit the process of
scanning from the other side of the control-room window, exploring scanning as a social activity.
I recount a particularly disastrous-and hilarious-scan in order to demonstrate just how many
factors can interrupt the process of gathering speech and brain data. One of the most persistent
challenges is the embodied individuality of the research subject.
Assisting and overseeing brain scans is a rite of passage for researchers within the vocal
biomarker group's lab. It is also a kind of ritual enactment that separates experimenter from
experimental subject, distinguishing the mentally ill from the (relatively) mentally well.
Researchers who sat in the control room and orchestrated the scans could often also act as
controls for the study, as members of the unmarked, non-depressed category. To qualify as a
control for the vocal biomarker study, researchers cannot have a neuropsychiatric disorder,
especially a known diagnosis of MDD. Team members disclose their mental health status to each
other through their status as a control, sometimes coyly, sometimes bluntly. One of the first
questions Ralph asked me when I tried to join the team was whether or not I am mentally ill. I
158
wondered if the reverse ever occurred-if a researcher would serve as a research subject in light
of their diagnosis. This felt too taboo to ever ask out loud, like it would threaten the neat divide
between object of scientific inquiry and agent of scientific study.
For safety's sake, two researchers must be present at every scan. It is their dual
responsibility to double-check each other and stick to protocol, ensuring that the research subject
has removed all ferrous metal on their bodies before going in to the scan room. Since the scanner
contains a giant, powerful magnet, ferrous metal outside or inside of the body poses a serious
safety hazard. In the ominous words of the resident fMRI safety officer, if you don't look
thoroughly enough for metal on the subject's body, the magnet will find it for you. Participants
with non-removable metal like aneurism clips or tattoos that contain iron (a once-common
ingredient in red body pigment) are not "MRI compatible," and cannot be scanned. The two-
body requirement was also meant to protect the safety of researchers. Before this rule was put in
place, there were several instances in which a research subject assaulted a researcher.
A two-tiered hierarchy of scanning responsibilities dictates how much a researcher can
interact with research subjects, dependent on a researcher's technical knowledge of brain
imaging and computer programming. Each tier is associated with a different colored laminated
badge: yellow or green. The badges hang from lanyards that researchers must bring with them to
every scan, and the back of the badges is printed with the telephone numbers of the safety
officer, universality facilities, and other helpful emergency contacts. "Yellow badge" is the lower
level, conferred upon a researcher after they have taken and passed an afternoon-long MRI safety
course. A yellow badge grants the researchers the ability to assist with scans or to at least be
present in the control room during a scan. A step up from the yellow badge is the "green badge,"
the higher level, secured after additional training and another test that covers broader topics,
159
including image reconstruction software and human research subject protocol. Green badges
oversee the scan, re-consent research subjects and read the task directions to them during the
scan. Only a green badge can control the computer that controls the scanner. There has to be at
least one green badge in the room for every scan, and if no green badges are available, the scan is
canceled. Two yellow badges cannot conduct a scan alone.
With my yellow badge, I accompanied and assisted Santiago (a green badge). By "yellow
badging" every scan-being the second body in the room with a green badge-I freed up the
time of other researchers. Yellow badging is highly undesirable work. All that you can do in the
scanning room is check your email, eat, chat, maybe watch video clips or check social media,
and hope that nothing goes wrong. The fits and starts of a scan made it difficult to engage in any
other activity that require sustained attention, like reading course materials or jotting down
fieldnotes. If and when there was a problem with the scan, Santiago and I would try to resolve
the issue as quickly as possible. Otherwise, my tasks were limited to tucking the subject in and
cleaning up, along with basic data entry: typing in the research subject's anonymized ID number
before each task and pressing a single key on a laptop to initiate the task program.
While conducting brain scans played a central role in the study, it was also an enormous
source of error. Conversations on how to identify and address errors around brain imaging or
sound recording dominated the group's regular meetings. The fact that speech recorded outside
the scanner during the consent and evaluation (when subjects read the Grandfather Passage)
tended to be clearer and required less pre-processing than the speech uttered while inside the
scanner. A handful of times, Santiago and I accidentally recorded our idle chitchat inside the
control room rather than the subject's speech from inside of the scanner. Or, I failed to save a
recording, or he forgot to check in on a subject executing a task or remind them to remain still.
160
Every now and then we would get a cross email from Ralph, informing us that we had sent him a
silent audio file.
The act of speaking itself could be a source of error. Most of the tasks required subjects
to read sentences out loud, repeat nonsense words that other lab members had created, or sustain
vowel sounds and consonant pairs, usually for no more than 3-5 seconds. Despite the short
length of oral speech each task required, the location of the articulators relative to the brain in the
skull meant that participants moved the position of their head subtly as they spoke, which blurred
(or created artifacts) in the final reconstructed brain image, ultimately lowering the accuracy of
their findings. In their pursuit of "natural" speech, the researchers tried to create an environment
in which the subject could hear the sound of their own voice during the scan. 35 Theteam
programmed the scanner to pause for a few seconds during the time that the subject was
executing the verbal task, hoping as well to eliminate sounds of the scanner from the audio file
and ensure that the subject's voice was recorded as clearly as possible. But the sparse paradigm
lowered the sampling rate of the brain images captured, resulting in a lower-resolution image.
After around four scans together, Santiago and I fell into a rhythm. We began to
anticipate each other's reactions to the long and illustrious list of things that could go wrong.
Sometimes, when we were especially unlucky, a number of problems would arise at once. On
one such day, Santiago was in an uncharacteristically sour mood. He hadn't had time to grab
lunch before our noon scan, the second one of the day, and he only had a power bar to get him
through the next three hours. Because I was only a yellow badge, he couldn't leave me alone
with the subject in the scanner to retrieve more food. Santiago was also in a bad mood because
1 According to speech and communication science, being able to listen to one's own speech plays a fundamental
role in speech production. It enables the speaker to constantly adjust and re-adjust the sounds that they produce in a
cybernetic feed-forward loop, a phenomenon that was the subject of Victor's dissertation.
161
we had been met by a number of setbacks, before and after the subject-a towering man over six
feet tall-had arrived. As had been the case for the past two months, the microphone was giving
us trouble. Though the microphone that offered the best sound quality and that had been
expressly designed for the study was currently at its manufacturers for another round of
expensive repairs, the second-tier replacement microphone was not working, either. After
running the mike wire into the scanner room through the copper panel in the control room
underneath the desks holding the three computers (to ensure that the mike would not conduct
additional radio frequency into the scanner room) I had sat on the scanner bed in the same
position the research subject would assume, while Santiago stood watching the Audacity
interface and listening for the sound of my voice through the speaker with the soundproofed door
closed. The time we could've spent buying lunch and bringing it back down into the basement
dwindled away.
Audacity was picking up the sound of my voice. The purple waveform representing the
words I spoke into the microphone in the scanning room, "test test, seven six five four three two
one," showed up like a long, fuzzy caterpillar inching across Santiago's screen. But only silence
came through the speakers, which Santiago tapped over and over again while giving me the
signal to speak. In the end, we resorted to the third-tier microphone, the electrostatic mike, a
piece of equipment that Sushant had built while he was a graduate student in the lab. The mike
was reliable but, as Santiago griped, had shitty sound quality. It did a poor job of capturing the
participant's voice as they completed the speaking tasks, which meant that Ralph would not be
happy with the audio we captured.
When the participant arrived, we went through the steps of preparing him for the scan.
After Santiago re-consented him in the waiting room area, demonstrating the spoken tasks to him
162
and asking him to change out of his clothes and into a pair of scrubs, Santiago stood with him at
the door to the scanning room and swept a handheld metal detector over his body. He joked that
we were making sure that the subject was "ok to be let into the club," pretending to be a bouncer
scanning the subject's body for weapons, and pretending that participating in the scan might
offer the same kind of fun and excitement as a nightclub. After Santiago and I patted down our
shirt and pants pockets to ensure we had no metal items on our bodies, we led the participant into
the scanner room. I helped him settle onto the mattress, handing him the buttons and wires,
showing him how to squeeze the soft foam ends of the ear buds so that they could fit snug into
his ear canal and then expand. When Santiago ran back to the control room to test if the
participant could hear him through the ear buds, the subject told us he heard nothing. Santiago
could not hide the frustration on his face.
For the next thirty minutes, we took turns hustling back and forth between the scanner
room and the control room, prompting the research subject to speak while the other one of us
listened from the other side. We gave up and called the safety officer, who asked Santiago, half-
serious, "what did you screw up this time?" Santiago, who liked to assure me that he was not
superstitious, also liked to talk about things being cursed: we must be cursed, this participant was
cursed, the expensive microphone was cursed, the scanner reconstruction computer that crashed
while I fetched the electrostatic mike was cursed. Today, he told the safety officer, was a cursed
day. By the time I had retrieved the electrostatic mike, tested to be sure it worked, let the
participant out for a bathroom break, and tucked him back in, Santiago was even more hungry
and on edge, waiting for the next curse to reveal itself. It manifested in a comedy rather than a
catastrophe.
163
Three out of seven tasks in, it was evident that the microphone was working but the
subject's baritone voice was too quiet. The waveform in Audacity was thin and neat, which
meant his voice would sound faint and indistinct on the audio file. As we prepared for the next
task, Santiago once again prompted him, speaking through the small mike that rested on the
desk, "just make sure that you're speaking nice and loud so that the mike can pick you up, ok?"
"Ok," said the participant, tired and lackluster. Santiago and I had been in the middle of
discussing, of all things, the Jewish tradition of bat mitzvah, a custom he was curious about but
unfamiliar with. We would pause and resume our conversation according to when he had to talk
to and check in on the participant. I had grown accustomed to the stop-and-go cadence of our
talk; we could pause and take up the same topic again after stretches of interacting with the
subject through the microphone. Santiago once again pressed the button to turn the mike on, the
sign for me to stop talking. He began reading the directions for the next task that Victor had
written:
In this task, you will say the vowels that appear on the screen in different pitches. The
vowels are ahh [/a/], ee [/i/], and ooo [/u/]. When you see the vowels on the screen, you
will also see a pitch word, either high, normal, or low! When the text turns green and you
see a green triangle appear on the screen, begin saying the vowel with the pitch indicated
on the screen. Please hold the vowel for a few seconds until the triangle disappears and
the next instructions come on the screen. Does that sound ok?
The subject, blandly, said yes. Santiago switched off the mike and asked me if boys and girls
became bar or bat mitzvah at different ages. I entered in the subject's ID number, pressed the
space bar to initiate the program that controlled the tasks and displayed the power point slides to
the participant, and explained to Santiago that I had become bat mitzvah, with my sister, around
age 12 or 13-1 could not remember. The scanner pulsed around 10 times in rapid succession
before pausing. Santiago asked me if my twin sister and I had our bat mitzvah celebration at the
same time and pressed the speaker button to check in on the subject. I was going to answer him,
164
but the subject produced such an arresting sound that I stopped in my tracks. "Ee!!!" he cried
out, in a short, high-pitched burst. Santiago quickly switched off the speaker to cut off the
blaring of the scanner and we erupted into wordless, breathless laughter. As we laughed, he
pressed the button again in time with the scanner pause. "Ahh!" said the participant, with his
normal pitch and subdued volume, as if he had just taken a sip from a refreshing drink. We
continued laughing and wheezing, slapping our knees, still unable to talk as Santiago once again
turned on the speaker. "Oo" said the participant, in a high-pitched voice like something had
startled him. I tried to regain my composure and Santiago remarked, between gasps of air, "I
don't even want to hear his ah again."
On the one hand, the participant had exceeded our expectations, and was excelling at the
task in terms of pitch modulation, especially given his otherwise deep, dull voice. On the other
hand, he was executing the task incorrectly, failing to sustain the vowel sound for long enough,
and producing instead a too-short burst of sound. The combination of wide pitch variation and
short burst of sound surprised us. "I haven't heard that before," said Santiago. "It was a good
little twist." It was so unlike any interpretation of the task we had heard out of the 20 or so
subjects we had scanned in total at that time. It was also unexpected because of our expectations
of how the subject would sound-his voice otherwise devoid of emotion-and given our initial
read of him combined with his gender presentation-a large statured, cis-gendered, manly man.
Even if the subjects stayed entirely still for the entirety of what Santiago called a
"ridiculous long" scan, there was no guarantee that they would execute the task in the way that
the team wanted. For the two tasks that required pitch modulation, many subjects tended to only
raise or lower the volume of their voice. It did not help that the prompts flashing on the screen
only said "high," "normal," and "low." The spelling of one of the pitch modulation tasks prompts
165
also confused research subjects. One of the sounds they were supposed to sustain was /u/ as in
"shoe," but the prompt spelled this sound "ohh." As a result, many research subjects produced
the sound /o/, as in "show." Santiago himself had made this error when piloting the scan for
Victor and when acting as a normal control. No one corrected him. By the time Santiago and I
realized how subjects were interpreting the prompt, Santiago worried that correcting the slide to
ensure participants produced /u/ instead of /o/ might introduce some unanticipated, unwanted
variation into the dataset.
Every now and then, subjects continued to make errors even after Santiago gently pointed
the error out and demonstrated once again the desired way to execute the task. Some subjects, I
considered, may have done this on purpose. Maybe they were pointedly refusing to perform the
task according to Santiago's directions and committing themselves to making whatever sounds
they chose in the scanner. This might be a means through which subjects could take control of
the scan and subvert their position as a body-as scientific material-in the team's experimental
system. They could collect their payment at the end of the scan while leaving the team with
unusable data, suffering no consequences for their recalcitrance other than a stiff neck and a few
hours of supine discomfort.
For the vocal biomarker team, conducting a scan amounts to a negotiation, a push-and-
pull that ends in compromise. There was a constant slippage between the team's ideal-typic
conceptualization of the vocal qualities that they wanted the subject to reproduce, and the
subject's subjective, embodied articulation of that quality (Chumley and Harkness 2013). The
tasks were supposed to be devoid of linguistic meaning. The team had designed the tasks in order
to activate regions of the subject's brain associated with speech motor control, and they hoped
the tasks would escape entanglement with the subject's own ideas about what they meant. But
166
the funny sounding man confronted Santiago and I with the excess of his interpretation, and our
own interpretation of him.
After all, the subject's novel execution of the task was not the only thing that made the
episode so hilarious and so compelling despite its brevity. It was funny because of the
incongruence between what we read as the subject's identity and how his voice sounded. A
large, baritone-voiced man emitting a high-pitched sound brings normative expectations about
masculinity and language rushing into the room (to flip them on their head). Santiago's job in
this instance, as with all other scans, was to redirect the subject's verbal performance in an effort
to standardize it, re-rendering the research participant into a body, a sounding object rather than a
speaking subject. But the subject, however inadvertently, reinserted his personhood, resisting
being formatted for the sake of the study. His funny sounds reminded us that he was not just a
body but a person, and that even when speech is just a sustained vowel sound, it is still up for
sociocultural elaboration.
HOW TO DO THINGS WITH WORDS
Where do researchers' theoretical models of and technical protocols for studying speech hail
from? I had initially assumed that the researchers would take a Saussurean structuralist approach
to studying human speech communication. This is in part because linguistic anthropological
scholarship since the early 1970s, including texts that survey the history of the sub-field qua the
history of linguistics in North America (Duranti 2003; Mithun 2004), situate Saussurean models
of language at the normative, hegemonic pole against which language ideologies prevalent
elsewhere are compared (Silverstein 1998; Duranti 2004). Saussurean linguistics, in this
167
literature, is foundational to patently "Euro-American" language ideologies or even "language
science" writ large (Silverstein 1979). I had thus expected to encounter Saussure in some form or
another while working alongside this group of non-neuroscientists studying speech.
In Saussurean linguistics, "language" or langue is the invariant, conventionalized
correspondence between sound-image and meaning. According to Saussure, the role of the
linguist is to study the ordered and systematic structures of this relationship, rather than study
"the mechanical, voluntary, accidental, and variable realizations of speech," orparole (Caton
1987:225). For Saussure, the primary purpose of communication is propositional-humans use
language to referentially map out reality (Caton 1987:231). In the United States, Bloomfield and
Chomsky further solidified the dominance of this model by insisting that linguistic "competence"
was the proper target of study over linguistic "performance" (Hymes 1964; Hymes [1973] 2001).
They rendered the Saussurean model of language even more cognitivist by replacing structuralist
systems of categorization with "grammar," and claiming that the potential for all humans to
acquire language-i.e., the potential to learn how syntactically and semantically to cut up the
world-originates from an innate, biological apparatus, the universal grammar apparatus. The
innovative move of linguistic anthropologists in the late 1960s and early 1970s was to de-center
langue and linguistic competence and spotlight the importance-if not the dominance-of
practice, culture, and variation in language. For Dell Hymes (1964) and subsequent students of
the "ethnography of speaking," this meant arguing that speaking and communicative practices-
the domain of Saussurean parole-arei n fact cultural activities, and part of the "social fact" (to
use Saussure's words) to be studied by social scientists. For Michael Silverstein and his students,
this meant arguing that semantics and pragmatics "do not form an opposition" showing instead,
through ethnographic examples, that "semantics is a narrow domain of pragmatics"-that even
168
the seemingly durable rules of grammar can be shaped, warped, patterned, and influenced by
cultural practices, institutions, power, and sociopolitical concerns (Harkness 2017:478).
Curiously, like the linguistic anthropologists writing against Saussure and Chomsky, my
informants were also focused on the acts and processes of speaking, but in a way that did not
center on speech as a sociocultural activity. In the vocal biomarker group's weekly meetings, no
one ever discussed meaning or talked about language as a system of signs. If anything, meaning
was a problem to avoid, like in the creation of nonsense words used in one of the verbal tasks.
The words had to be believable as lexical items in American English (recognizable as a word
rather than a sound) but unrelated to existing words that might hold some kind of emotional
resonance or stir the thought or memory of research subjects in an unintended way.
Rather than talk about meaning, the vocal biomarker team talked about the muscles and
networks in the brain responsible for speech. For instance, during a meeting in July of 2015,
Ralph took to one of the floor-to-ceiling white boards on the walls of the empty classroom we
were congregated in to draw out a vast diagram of the potential neurobiological source of pitch
control. He sketched a cross-sectional outline of a human head, complete with a simple drawing
of the brain, the teeth, tongue, oral cavity, pharynx and larynx. He pointed to the diagram's neck
to indicate the location of a muscle in the larynx (the cricothyroid muscle), which is responsible
for tightening the vocal chords and controlling the flow of air expelled from the lungs during
speech. With electric excitement in his voice, he told us that the cricothyroid muscle "is
innervated by a nerve. If you follow that nerve further you hit the neural cortex, and the
connection can get you all the way up to the limbic system," a system that depressive conditions
have been theorized to impact. Ralph was convinced that this connection suggested that slight
169
changes in the pitch of speech might be the outcome of changes in the limbic system, which
might be due to depression.
At the time, I was carried along and fully convinced by Ralph's line and the route it took
us through. Looking over my notes in my office days later, however, I was perplexed. The logic
seemed sound: if the brain and the process of producing speech are connected, then the sonic
contours of spoken utterances can be connected back to the brain's activity. Yet it was all still so
strange, so alien to me. Where was language-and signification-in Ralph's line, and in my
informants' conceptualization of speech overall?
I decided to show the team an iconic figure from the opening pages of Saussure's course
in general linguistics as a kind of projective test. The image I selected displayed two heads
facing each other, one of their mouths opened and the other's closed, with a dotted and solid line
looped between them. Saussure calls this interaction the "speech circuit," consisting of (A)
phonation (orally producing speech) and (B) audition (listening to and processing the meaning of
speech). On occasions when I found myself alone with Ralph, Victor, and Sushant, I presented
the diagram and asked them what it was showing, what they thought of it, and if they would
make any changes to it.
The three men found the diagram more or less unproblematic, even satisfactory, although
they were unable to identify its origin. I showed it to Victor once as we sat in one of the plush
chairs in the lobby, avoiding the high afternoon sun and debating on whether we should attend a
lecture about olfaction in rats that we were already late to. Victor chuckled at the image and
170
responded in an ironic tone, as if the answer were obvious, "ooooh, that looks like human speech
communication!" Upon further prompting, he described what he saw:
two people are interacting with each other and it looks like-it's showing they are
producing oral sounds, at each other, and then there are other lines that hit both their ear
and their brain; probably the brain is involved in both the oral and the aural aspects of it,
so the hearing and the production side. Yeah. But yeah I [think] it's from like Denes and
Pinson because that's a book, Speech Acts, that has like these, you know, schematized
pictures of like, what speech communication looks like. I think that's perfectly good, like
is, as simply as it needs to be, it's like, two people speaking to each other, if one person
uses their brain to produce something, the other person is using their brain to interpret
what they heard.
To my surprise, his explanation was similar to Saussure's 36 with a few caveats, although Victor
never mentioned his name. Later in our conversation, Victor told me that he would add an
additional, "side" loop to the circuit connecting the speaker's own mouth with their own ear, to
indicate that the speaker listens to their own speech and adjusts it accordingly. With this added
feedback loop, the diagram now began to resemble one found in a book that Victor, Ralph,
Sushant, and Ted had all suggested I read in order to learn more about speech communication,
the same book that Sushant assigned as supplementary reading in his speech communication
course: The Speech Chain: The Physics and Biology of Spoken Language. Written by Peter B.
Denes and Elliot N. Pinson and published through Bell Laboratories in 1963, I suspect that The
Speech Chain is the book Victor had actually been referring to in our conversation, though he
seemed to misremember the title of the book as Speech Acts. Victor's (potential)
36 "Suppose that the opening of the circuit is in A's brain, where mental facts (concepts) are associated
with the representations of the linguistic sounds (sound images) that are used for their expression. A given
concept unlocks a corresponding sound-image in the brain; this is purely psychologicalp henomenon is
followed in turn by a physiologicalp rocess: the brain transmits an impulse corresponding to the image to
the organs used in producing sounds. Then the sound waves travel from the mouth of A to the ear of B: a
purely physical process. Next, the circuit continues in B, but the order is reversed: from the ear to the
brain, the physiological transmission of the sound-image, in the brain, the psychological association of the
image with the corresponding concept. If B then speaks, the new act will follow - from his brain to A's -
exactly the same course as the first act and pass through the same successive phases, which I shall
diagram as follows" (1966[1959]: 11-12).
M
171
misremembering is evocative because it melds the pragmatic work of speaking (speech acts)
with the mechanical activity of embedding meaning in and producing a material, physical effect
(speech sounds). It collapses the social doing of speech into the biomechanical making of speech.
THE SPEECH CHAIN
With its origins in Bell Laboratories, The Speech Chain points to another branch in the
history of the scientific study of language in the United States, one running parallel to the
trajectory of Saussurean semniology and Chomskian linguistics that linguistic anthropologists
have narrated. My informants' disregard for semantic meaning and their primary concern with
the physical properties of speech as sound aligns their work with early telephone engineers, who
both overlapped with while diverging from Saussurian semniology. As Saussure was penning and
preaching his theory of the sign in the early half of the 20th century, in the U.S., information
theory, psychophysics, and industrialization coalesced in the technology of the telephone,
birthing what Mara Mills (2011) calls "the industrial conception of language," or the notion of
speech as "a material good and sellable commodity" (77). Telephone engineers applied Claude
Shannon's information theory in an effort to translate the smallest possible intelligible unit of
speech--the phone-into an electrical signal in order to move that signal across a channel from
172
sender to receiver with as little interference as possible.3 7 Concatenate with the development of
Cold War weaponry and cryptography, in the making and proliferation of the telephone, the
name of the game was to maximize intelligibility while minimizing cost. So persuasive was the
industrial conception of language that even Saussure's semiology bares its mark. Mills notes that
Saussure's theory of the sign, emblemized in the diagram I showed Victor and his colleagues, is
"seemingly modeled on a telephone call"-a sender and receiver, in a closed, dyadic circuit,
send communication via "impulses" along invisible but still present wires (Mills 2011: 79).
Saussure indeed referred to the processes of communication as "the speech circuit." But as
Timothy Lenoir observes, these early telephone engineers honed in on a key component of
language that sturcturalist semiotics following Saussure ignore: the notion that "language itself is
not a pure sign, it is also a thing...tied to voice, to bitmaps on a screen, to materiality" (1994:
122). In other words, the notion that language has a material existence, a texture in addition to a
meaning.
Not unlike the Saussurean speech circuit, in The Speech Chain, speech links "the
speakers' brain to the listener's brain," emphasizing the role of the brain as the ultimate
processor (Denes and Pinson 1963: 5). At either end of the speech chain is the talking brain and
the listening brain, intermediated by the articulatory organs and the ear, displayed in anatomical
detail resembling Ralph's simplified drawing. But gone are the threads and wires of
"communication"-or perhaps signification-tying the two conversational partners together in
Saussure's model. Instead, there are sound waves, rippling out of the speaker's mouth and
through the ears toward the brains of both the conversational partner and the speaker themselves.
37Although, as Mills notes, the embodied subjectivity of d/Deaf individuals played a crucial role in the field of
cybernetics and the object and associated infrastructure of the telephone alike, engineers founded their models and
technologies on the basis of a normative, exclusionary speaking and listening subject. The vocal biomarker team's
entire study likewise is premised on an audist model of speech.
173
This diagram contains Victor's addition: the cybernetic loop connecting the speaker's own
speech with their ear, indicating their constant, real-time adjustment of their own voice as they
listen to themselves speak, ensuring the transmission of information is as efficient as possible.
The Saussurean notion of "sound-image" also departs drastically from how my
informants considered and studied speech as sound, a departure that The Speech Chain took part
in as well. The Saussurean sign is made up of the sound pattern (signifier) and the concept
(signified). According to Saussure, "the sound pattern is not actually a sound; for a sound is
something physical. A sound pattern is the hearer's psychological impression of a sound, as
given to him by evidence of his own senses" (66). In Saussurean linguistics, the sounds of speech
exist in a singular, individualistic, phenomenological impression, and have no independent
existence beyond the listener's sensory capturing of speech's occurrence. On the other hand, for
the vocal biomarker team, speech sounds are always "actually sounds." For them, there is
nothing ontologically phonemic about speech sounds. Speech sounds exist definitively in a
common reality that all (hearing, neurotypical 3 8) humans have access to, because speech sounds
are governed by the same properties of physics that govern all materials in general and all waves
in particular, from the wild and shaky waves of a shhhh to the potent waves of radiofrequency
used in fMRI. Although when prompted, Sushant cited "acoustic phonetics" as the discipline
from which their model of speech arose, we might also call their model of speech spectral
phonetics. A spectrogram, like the team's study, brings sound into representation in a way that is
agnostic to the differences between the fleshy, biological realm and the mechanical realm of
electric impulses.
3 In the speech chain model, Saussurean psychological perception is replaced with brain and biology-based
perception, anchored in an abelist model of the body insinuating that all humans have more or less the same brain,
with the same uniform faculties for apprehending and processing sound in the same, standardized way.
174
In the Speech Chain, the core of the interactive, communicative act is spectral and
biological, and orchestrated by the brain independently of the intent to create or encode
"meaning" or social action. Only the brief, second chapter of the Speech Chain is dedicated to
"linguistic organization," covering the phoneme, the syllable, the word, sentences, the
grammatical and semantic rules of linguistic organization, stress, and intonation. The opening
paragraphs of Chapter 2 explain that the "linguistic level" of speech contains the "message" of
speech, which the speaker conveys by choosing "the right words and sentences to express what
he wants to say. The information then goes through a series of transformations into physiological
and acoustic forms" (10). In a familiar Saussurean division, the speech sounds are the vessel for
the planned, intentional meaning of speech, and the units of language-symbols-"stand for
objects around us and for familiar concepts and ideas" (ibid).
Nevertheless, Denes and Pinson advise, "throughout the rest of this book, we will
concern ourselves with relating events on the physiological and acoustic levels with events on
the linguistic level" (ibid). The hierarchy of the "levels" of study in this elementary text is clear.
Eight out of the nine chapters in The Speech Chain cover physics and biology, spanning from
topics like the anatomy of vocal organs, neurons, nerve impulses, the peripheral and central
nervous systems, the spectra of speech waves, the formants of English vowels, acoustic cues for
speech recognition, and advances in the neurophysiology of speech. Moreover, the fact that the
linguistic level of speech is set apart from sections on physiology and physics implies that
language cannot be reduced by or pulled apart using these tools.
Beyond the hierarchical organization of The Speech Chain and its thematic focus on
biology and physics, why did my informants not care about the "linguistic level"? I brought this
up with Sushant after talking with him about the Saussure diagram. He explained that speech is a
175
"biomechanical output," and while it has language components, the motor control components
are the most elementary of all. In fact, understanding speech production at the level of motor
control-as he put it, "understanding how the biomechanics of the system are used to produce
things"-is a necessary prerequisite for understanding "how that symbolic mapping [of
language] is translated into this continuous acoustic wave form." He was willing to concede that
the biomechanistic output of speech, and its neural coordinates, does indeed have "language
components," because different languages require producing different sounds, but he assured me
that these differences were superficial.
I probed him on this further. What about variation within spoken English, like
occurrences of vocal fry, which is produced by augmenting the flow of air from the lungs to the
oral cavity using the larynx? What about upspeak, another feature of some languages that comes
and goes historically? Or what about tonal languages, like Mandarin, which also require different
control of the larynx than a non-tonal language like English? Wouldn't that produce a drastically
different "biomechanical output" and rely on different parts of the brain? Sushant stuck to his
guns:
S: you may have differences in the coordination. We haven't delved into Mandarin or
other tonal languages in terms of depression but at least, in the Romanic or Germanic
languages that we have looked at, control is very similar, but at the basic level I would
think control even in Mandarin is somewhat similar: you have mechanisms, yes there's
some specialties and that might influence how these processes behave, and that's one of
the reasons that we feel we can extract information from voice rather than focusing on the
language.
B: I think the key point that I was missing when I would try to answer those kinds of
questions for myself [about tonal languages] was that you guys are focused on
coordination, that's what you care about as far as what's going on in the brain.
S: Correct-because that is one aspect we feel is mostly language agnostic. I can't tell you
that it's completely language agnostic. But we feel that the basic mechanisms by which
you produce sound are in some ways common. Now. There are sensitivities and
specificities in each language [...] the repertoire of phonemes in a given language is going
176
to vary, and that might influence how somebody controls those pieces. But on average
across a long utterance [...] these phonemes and formants are reflecting the shape of the
mouth and the state of your larynx and your breathing all in one go. And so, to us, that's a
more fundamental thing that, independent of language constructs, one needs to control, so
that's where we focused on.3 9
The vocal biomarker group pursues parole and the mechanisms that produce the sounds of
speech for the same reason that Saussure left those things behind: because the act of coordinating
the muscles necessary to produce the sounds of oral communication is the "embryo of speech," a
realm that is so basic, so fundamentally human because it is so fundamentally biological, that it
is beyond the "social fact" of communication. Focusing on the coarsest domain of the
communicative act also ensures that their findings will be "language agnostic" and, therefore,
edge toward the panhuman.
To summarize Sushant's response to the conundrum of tonal and non-tonal languages:
the neural pathways that drive the mechanisms for producing sounds will be the same across
humans, regardless of the nature of the specific sound being produced. The underlying
assumption is that human language speakers all share a common, fundamental feature: a brain,
which has the same features and functions more or less the same from person to person. Sushant
and his colleagues are looking for, in his words, the most fundamental "basic iota of information
that will offer insight" about depression. That is, so long as depression is conceptualized as a
brain disorder.
Their theoretical framework for studying speech, and pursuing vocal biomarkers,
combines two universalist frameworks. On the one hand, it offers up language universalism: all
' Victor had a similar answer: he explained that a "recent study looked at native Portuguese and native English
speakers and found like basically no differences between any of the speech network even though you could point out
some qualitative differences between the languages, like, speech is speech and you're using more or less all the
same characteristics in order to produce it in the brain at least. Maybe in some languages you might lean a little
more one way or another like the complexity might be in the variety of speech sounds versus grammatical structure
so maybe you would get some slight brain differences either in structure or in activation but, you know, on the
whole, I would say that you're almost certainly gonna have the same activation patterns across languages."
177
communities of speakers share a commonality, because they all use the same cognitive faculties
to execute the task of producing oral speech regardless of the language they are speaking
(Enfield 2012; Evans & Levinson 2009)." On the other hand, it offers up biological
universalism: all communities of speakers in their subject population (depressed and not
depressed speakers of English) share the same brain, aside from those experiencing depression,
whose brains will be slightly, subtly different in a distinguishing way. Like eight out of the nine
chapters of The Speech Chain, the linguistic level of oral communication is quite simply beyond
the scope of the vocal biomarker group's theoretical, experimental focus.
At the same time, that is not only reason they are seeking out the embryonic, pan-human
level of communication. Sushant conceded that to conduct a study attentive to the linguistic level
would mean his team had to deal with the particularities and nuances of human difference. Such
a project would require far more material resources-money, research personnel, research
subject, fMRI machines, and time-than they have access to. Sushant was clear about this:
it's possible that some of these cognitive states affect the linguistic component more than
the basic components, it's just that, that's a fairly complex and comprehensive
project...If we were to bring in specific languages, we might need a person per language
on the team, to help us do things. [We don't] have access to those kinds of resources.[...]
We're not saying that language is not important, it's just outside the scope of our current
approach.
Studying basic components of speech communication means that there are fewer variables to
control, that exclusion criteria for research subjects can tend toward the broad, and that subject
recruitment, consent, data gathering, analysis, and publication can be carried out by a rather
small and plucky team of five researchers (plus an eager and inexperienced anthropologist-
4° Victor had some interesting theories about "deaf speech." He mused that non-congenitally deaf people who
communicate through oral speech might be using more or less the same motor activity networks as non-deaf people;
the only difference is that they use somatosensory cues-the position of tongue in the mouth-as their feedback
mechanism
178
research assistant). Economic resources shape the making and doing of science just as much as
theoretical models and disciplinary convention. This adds another layer of significance to the
industrial conception of speech. Scientifically pursuing speech at a granular scale, as a spectral
emanation operating at the level of language and biological univerals, is a more cost-efficient
option. Attending to speech at a greater level of complexity beyond its smallest iota-taking into
consideration culture, history, difference, nuance, and meaning-would be an expensive
undertaking.
CONCLUSION: NOISEY SCIENCE AND NATURAL SPEECH
Together, Sushant, Ted, Victor, Ralph and Santiago pursue speech not as a code of signification,
but as a sonic output of the brain's inner workings. In their efforts to use the voice to understand
the mind, they run into another, troublesome sonic entity: noise. In the context of information
theory-the technical undergirding of communication technologies and the vocal biomarker
team's spectral phonetics-noise is "the byproduct of technological reproduction that interfered
with the reception of a message (i.e., static on a radio transmission, distortion over a loudspeaker,
or hiss on a magnetic tape)" (Novak 2015: 128). As an unavoidable feature of technologically
mediated sound, noise is not exactly a sound in its own right. It is more accurately a
"metadiscourse of sound and its social interpretation" (Novak 2015: 126). Noise is defined in
negative tension with and against the meaningful, the significant, and the valuable. Examining
noise in place can indeed tell us something about noise's counterpart, its cybernetic twin: the
signal, that which is sought after and the focus of attention. Noise is what stands in the
background, to be disregarded and discarded. But even at the edges of attention, noise never
179
disappears. Like the semiotic spillage of the funny sounding man, the woman with the glasses
who Ralph could not care for, or other moments in which the research subject's subjectivity (the
nuances of their bodies and their voices) interrupted their transformation from people into data,
noise is an excess, a leftover that gets in the way.
The vocal biomarker study is noisy business. The blaring, menacing wrenching of the
magnet in the fMRI machine drowns out and distorts the subject's speech. If the research subject
shifts slightly in the scanner, even to move their articulators to produce sounds and sentences
according to Santiago's direction, they introduce artifacts into the fMRI image. The microphones
fail, the software fails, Santiago and I record the wrong kind of speech (our own), the subject
(willfully?) misinterprets Santiago's directions and flubs the task. The vocal biomarker team was
looking for the biological underneath the social, but they recognized that their methods and tools
and techniques were incapable of offering a one-to-one correspondence with the truth. There was
always going to be some kind of interruption, some form of interference. On the one hand, noise
is a problem to manage. On the other hand, noise is a testimony to the limits of abstraction and
reduction. Just as their efforts to hack mental health care research still contain an unshakable
grain of the very classificatory system they seek to disrupt, their pursuit of the universal and
bedrock foundations of both mental illness and language contain a signal-jamming grain of the
particular, the subjective, the irreducible, the different.
Laboratory technicians like Sanitago are tasked with managing the noisy excess of
research subject's bodies and voices-the never-ending and menial work of noise cancelation.
The goal of the study was to pry apart sound from semantic meaning, biology from the body.
The models of speech that the team adheres to and that guides their investigation torques the
relationship that a linguistic anthropologist might anticipate between the body and speech. This
180
is because their models essentialize the body-a normative, audist body-as a site of truth,
taking the body out of the social by attempting to disentangle speech, and sound, from the social.
To search for vocal biomarkers of depression, they must materialize speech through the body of
the research subject only to disembody it again, enacting what Schafer calls "schizophonia"
(Schafer 1969). They strive to split the sounds of speech from their source, removing them from
the particularities of its contexts, only to argue that it is a universally true sign that has been in
the body all along, waiting to be found. But in order to achieve this feat of abstraction, the team
runs into the very same components of language that they seek to overcome and pull away from
the act of speaking: context, semantic meaning, variation and difference.
Close cousin to the pair of signal and noise are mediation and immediation. Mediation,
Mazzarella argues, is "the ambiguous foundation of all social life," involving the multitude of
"conceptual, technical, and linguistic practices by which the actually irreducible particularties of
our experience are...reduced...rendered provisionally commensurable and thus recognizable and
communicable in general terms" (2006: 476). Eisenlohr notes that, paradoxically, media are the
most successful when they disappear, when the fact of mediation melts away, giving the
impression of im-mediacy, "drawing attention away from their own materiality and technicality
in order to redirect attention to what is being mediated" (2011: 44). In theory, a vocal biomarker
of depression is a sound that cuts directly to biological processes, so directly that its mere
presence stands in for and is commensurate with a pathological brain state. A vocal biomarker
suggests an immediate-im-mediated-connection between voice and mind. This paradox-
convincing mediation draws attention away from the very fact of mediation-is a source of
power, fueling what Mazzarella calls"the politics of immediation" (2006).
181
One instantiation of this power emanates from the language sciences and their hegemonic
commitment to language universals. Language universalism implies that all human language
have some irreducible core, durable center that culture, history, and politics can never touch or
budge, and that can be arrived at once these other, superficial layers are melted down. In this
chapter, I hope to have emphasized what an STS approach to the scientific study of language can
achieve: a demonstration of how facts about language, especially facts about the biological basis
of language, are mediated, remediated, made and assembled. My attempt to de-naturalize the
figure of "natural speech" in the scientific study of language resembles Penny Eckert's (2003)
critique of the figure of the "authentic speaker" in sociolinguistic research. Authenticity, like
naturalness, is aligned with proximity to the ingrained and invariable core, while inauthenticity,
like unnaturalness, is "tainted by the social" (392). The authentic speaker conveys vernacular
realness, while the inauthentic speaker conducts a conscious, intentional performance. Likewise,
"natural speech" implies a downplaying of the speaker's agency and intentionality. But as I have
shown, speech has to be made natural. It is someone's job to ensure that context, culture, the
speaker's own interpretations, are all kept at bay. This is why the work of troubleshooting,
fixing, and tucking in, while hierarchically nominal, are in fact epistemologically central. This is
the very work of rendering the research subject into a transparent medium. If this work is done
well, it too will fall to the edges of attention, slip into transparency, enabling mediation to
shimmer away into immediation yet again.
The advent of Computational Psychiatry suggests that the burden of defining and
communicating the signs of mental illness will shift away from the patient. Biological truth will
emanate from the sufferer's body, and the knowledge of their own suffering will be external to
their sense of self. This is another reason why it is so crucial to emphasize that mediation and
182
noise are a key feature of the search for vocal biomarkers. The notion of a vocal biomarker
threatens to naturalize the body as the only site of mental illness, and further threatens to
extricate definitions of health and wellbeing from the patient, "utterly decoupled from anything
experiential" (Dumit 2012: 123). It re-creates the asymmetrical power relations that feminist
critiques of psychoanalysis have attempted to intervene on, suggesting a scenario in which
mental illness is a mysterious code that only a technical expert can unwind and demystify
through the operation of machines whose inner workings remain out of reach. In this way,
Computational Psychiatry's most morally pressing concerns rests not with machines that threaten
to replace humans, but with the de-humanization of patients as mere automatons emitting
neurobiologically significant exhaust.
183
References
Alpert, Murray, Enrique R. Pouget, and Raul R. Silva. 2001. "Reflections of depression in
acoustic measures of the patient's speech." Journal ofAffective Disorders 66(1): 59-69.
Barad, Karen. 1999. "Agential realism: Feminist interventions in understanding scientific
practices." In The Science Studies Reader, ed. Mario Biagioli. Pp. 1-11. New York: Routledge.
Breznitz, Zvia. 1992. "Verbal Indicators of Depression." The Journalo f General Psychology
119(4): 351-636.
Callon, Michel and C. Meadel and V. Rabehariosa. 2012. "The Economy of Qualities." Economy
and Society 21(2): 194-217.
Cannizzaro, Michael, Brian Harel, Nicole Reilly, Philip Chappell, and Peter J. Snyder. 2004.
"Vocal acoustical measurement of the severity of major depression." Brain and Cognition 56(1):
30-35.
Carr, E. Summerson. 2010. Scripting Addiction: The Politics of Therapeutic Talk andAmerican
Sobriety. Princeton: Princeton University Press.
Caton, Steven C. 1987. "Contributions of Roman Jakobson." Annual Review ofAnthropology 16:
223-260.
Conrad, Peter. 2007. The Medicalizationo f Society On the Transformation ofHuman
Conditions into Treatable Disorders. Baltimore: John Hopkins University Press.
Chumley, Lily Hope and Nicholas Harkness. 2013. "Introduction: QUALIA." Anthropological
Theory 13(1/2): 3-11.
Cummins, N. J. Epps, M. Breakspear, and R. Goecke. 2011. "An investigation of depressed
speech detection: features and normalization." Proceedings oflnterspeech, ISCA, Florence,
Italy, pp. 2997-3000.
Cummins, Nicholas, Stefan Scherer, Jarek Krajewsi, Sebastian Schnieder, Julien Epps, and
Thoams F. Quatieri. 2015. "A review of depression and suicide risk assessment using speech
analysis." Speech Communication 71: 10-49.
Cummins, Nicholas, Vidhyasaharan Sethu, Julien Epps, Sebastian Schnieder, and Jarek
Krajewski. 2015. "Analysis of acoustic space variability in speech affected by depression."
Speech Communication 75: 27-49.
Darby, John K. and H. Hollien. 1977. "Vocal and Speech Patterns of Depressive Patients." Folia
Phoniatricae t Logopaedica 29(4): 279-291.
184
Darby, John K., Nina Simmons, and Philip A. Berger. 1984. "Speech and voice parameters of
depression: A pilot study," Journal of Communication Disorders 17(2): 75-85.
Denes, Peter B. and Elliot N. Pinson. 1963. The Speech Chain: The Physics and Biology of
Spoken Language. Ann Arbor: Bell Telephone Laboratories.
Dumit, Joe. 2012. Drugsf or Life: How PharmaceuticalC ompanies Define Our Health. Durham:
Duke University Press.
Duranti, Alessandro. 2003. "Language as Culture in U.S. Anthropology." CurrentA nthropology
44(3):324-347.
Duranti, Alessandro. 2004."Agency in Language." In A Companion to Linguistic Anthropology.
Alessandro Duranti, ed. Pp. 451-473. Malden: Blackwell Publishing.
Eckert, Penny. 2003. "Sociolinguistics and authenticity: an elephant in the room." Journal of
Sociolinguistics 7(3): 392-431.
Enfield, N.J. 2012. "Language, culture, and mind: trends and standards in the latest pendulum
swing." Journalo fthe Royal Anthropological Institute 19:155-169.
Evans, Nicholas and Stephen C. Levinson. 2009. "With diversity in mind: Freeing the language
sciences from Universal Grammar. Behavioral and Brain Sciences 32(5): 472-492.
Flint, Alistair J., Sandra E. Black, Irene Campbell-Taylor, Gillian F. Gailey, and Carey Levinton.
1993. "Abnormal speech articulation, psychomotor retardation, and subcortical dysfunction in
major depression." Journalo fPsychiatricR esearch 27(3): 309-319.
France, D.J., R.G. Shiavi, S. Silverman, M. Silverman, and M. Wilkes. 2000. "Acoustical
properties of speech as indicators of depression and suicidal risk." IEEE Transactions on
Biomedical Engineering 47(7): 829-837.
Gershon, Ilana. 2010. "Media Ideologies: An Introduction." Journal ofLinguistic Anthropology
20(2): 283-293.
Greden, J.F., A.A. Albala, and I.A. Smokler. 11981. "Speech pause time: a marker of
psychomotor retardation among endogenous depressives. Biological Psychiatry 16: 851-859.
Guenther, Frank. 2016. Neural Control of Speech. Cambridge: MIT Press.
Godfrey, Hamish P.D. and Robert G. Knight. 1984. "The Validity of Actometer and Speech
Activity Measures in the Assessment of Depressed Patients." The British Journal ofPsychiatry
145(2): 159-163.
Hacking Ian. 1986. "Making up people." In ReconstructingI ndividualism: Autonomy,
Individuality, and the Selfin Western Thought, ed. T Heller, M Sosna, DWellberg, pp. 222-36.
185
Stanford: Stanford University Press.
Haraway, Donna. 1988. "Situated Knowledges: the Science Question in Feminism and the
Privilege of Partial Perspective." Feminist Studies 14(3): 575-599.
Harkness, Nicholas. 2017. "Glossolalia and cacophony in South Korea: Cultural semiosis at the
limits of language." American Ethnologist 44(3): 476-489.
Hollien, Harry. 1980. "Vocal Indicators of Psychological Stress." Forensic Psychology and
Psychiatry 347(1): 47-71.
Hymes, Dell. 1964. "Introduction: Toward Ethnographies of Communication." American
Anthropologist 66(6):1-34.
Hymes, Dell. [1972] 2001. "On Communicative Competence." In LinguisticA nthropology: A
Reader. 2001. Alessandro Duranti, ed. Pp. 53-73. Maden: Blackwell Publishers.
Jones, Graham, Beth Semel, and Audrey Le. 2015. "'There's no rules. It's hackathon.':
Negotiating Commitment in a Context of Volatile Sociality." Journalo fLinguistic Anthropology
25(3): 322-345.
Joyce, Kelly A. 2008. Magnetic Appeal: MRI and the myth of transparency. Ithaca: Cornell
University Press.
Lakoff Andrew. 2006. PharmaceuticalR eason: Knowledge and Value in Global Psychiatry.
Cambridge: Cambridge University Press.
Langlitz, Nicolas. 2010. "The persistence of the subjective in neuropsychopharmacology:
observations of contemporary hallucinogen research." History ofthe Human Sciences 23(1): 37-
57.
Langlitz, Nicolas. 2012. Neuropsychadelia: The Revival ofHallucinogen Research since the
Decade of the Brain. Berkeley: University of California Press.
Lasswell, Harold D. 1930. Psychopathology and Politics. Chicago: University of Chicago Press.
Latour, Bruno. 1998. Science in Action: How to Follow Scientists and Engineers Through
Society. Cambridge: Harvard University Press.
Lenoir, Timothy. 1994. "Was the Last Turn the Right Turn? The Semiotic Turn and A. J.
Greimas." Configurations2 : 119-36.
Low, Lu Shih Alex, Namunu C. Maddage, Margaret Lech, Lisa Sheeber, and Nicholas Allen.
2010. "Influence of acoustic low-level descriptors in the detection of clinical depression in
adolescents." IEEE InternationalC onference on Acoustics, Speech and Signal Processesing,
Dallas, TX https://ieeexplore.ieee.org/document/5495018
186
Luhrmann, Tanya M. 2001. Of Two Minds: An AnthropologistL ooks at American Psychiatry.
New York: Vintage Books.
Mattern, Shannon. 2018. "Maintenance and Care." Places Journal, November. Accessed 9 Jan
2019. < https://placesjournal.org/article/maintenance-and-care/?cn-reloaded=#0>
Martin, Aryn, Natasha Myers, and Ana Viseu. 2015. "The politics of care in technoscience."
Social Studies of Science 45(5) 625-641.
Martin, Emily. 2007. BipolarE xpeditions: Mania and Depression in American Culture.
Princeton: Princeton University Press.
Mazzerella, William. 2006. "Internet X-Ray: E-Governance, Transparency, and the Politics of
Immediation in India."PublicC ulture 18(3): 473-305.
Metzl, Jonathan. 2010. The Protest Psychosis: How Schizophrenia Became a Black Disease.
Boston: Beacon Press.
Mills, Mara. 2011. "On Disability and Cybernetics: Helen Keller, Norbert Weiner, and the
Hearing Glove." differences 22(2-3): 74-111.
Mithun, Marianne. 2004. "The Value of Linguistic Diversity." In A Companion to Linguistic
Anthropology. Alessandro Duranti, ed. Pp. 121-140. Malden: Blackwell Publishing.
Moore, E., M. Clements, J. Peifer, and L. Weisser. 2003. "Analysis of prosodic variation in
speech for clinical depression." In Proceedings of the 25 'h Annual InternationalC onference of
the IEEE Engineering in Medicine and Biology Society, pp. 2925-2928.
Morrison, Hazel, Shannon McBriar, Hilary Powell, Jesse Proudfoot, Steven Stanley, Des
Fitzgerald, and Felicity Callard. "What is a Psychological Task? The Operational Pliability of
'Task' in Psychological Laboratory Experimentation. Engaging Science, Technology, and
Society 5: 61-85.
Morawski, Jill. 2015. "Epistemological Dizziness in the Psychological Laboratory: Lively
Subjects, Anxious Experimenters, and Experimental Relations, 1950-1970." Isis 106:3): 567-
579.
Novak, David. 2015. "Noise." In Keywords in Sound. David Novak and Matt Sakakeeny, eds.
Pp. 125-138. Durham: Duke University Press.
Ozdas, A., R.G. Shiavi, S.E. Silverman, M.K. Silverman, and D.M. Wilkes. 2004. "Investigation
of vocal jitter and glottal flow spectrum as possible cues for depression and near-term suicidal
risk." IEEE Transactionso f Biomedical Engineering 51(9): 1530-1540.
187
Rheinberger, Hans-J6rg. 1997. Toward a History ofEpistemic Things: Synthesizing Proteins in
the Test Tube. Stanford: Stanford University Press.
Rose, Nikolas and Joelle M. Abi-Rached. 2013. Neuro: The New Brain Sciences and the
Management of the Mind. Princeton: Princeton University Press.
Saussure, Ferdinand de. 1966[1959]. Course in General Linguistics. Wade Baskin, trans. Charles
Bally Albert Sechehaye, eds. New York: McGraw-Hill Book Company.
Schafer, R. Murray. 1969. The New Soundscape: A Handbookfor the Modern Music Teacher.
Ontario: Berandol Music Limited.
Schuller, Bj6rn, Stefan Steidl, Anton Batliner, Fleix Burkhardt, Laurence Devillers, Christian
Mfller, and Shrikanth Narayanan. 2013. "Paralinguistics in speech and language-State-of-the-
art and the challenge." Computer Speech and Language 27(1): 4-39.
Seaver, Nick. 2017. "Algorithms as culture: Some tactics for the ethnography of algorithm
systems." Big Data and Society 1-17.
Silverstein, Michael. 1979. "Language Structure and Linguistic Ideology." In The Elements: A
Parasessiono n Linguistic Units and Levels. Paul R. Clyne, William F. Hanks, and Caroll
Hofbauer, eds. 193-247. Chicago: University of Chicago Press.
Silverstein, Michael. 1998. "The Uses and Utility of Ideology: A Commentary." In Language
Ideologies: Practicea nd Theory. Bambi B. Schieffelin, Kathryn Woolard, and Paul Kroskrity,
eds. Pp. 123-145. New York: Oxford University Press.
Strimbu, Kyle and Jorge A. Travel. 2010. "What are biomarkers?" Current Opinion in HIV and
AIDS 5(5): 463-466.
Vidal, Fernando. 2009. "Brainhood, anthropological figure of modernity." History ofthe Human
Sciences 22(1):5-36.
Vidal, Fernando and Francisco Ortega. 2017. Being Brains: Making the CerebralS ubject. New
York: Fordham University Press.
Williams, Janet B.W. 2001. "Standardizing the Hamilton Depression Rating Scale: past, present,
and future." European Archives ofPsychiatry and ClinicalN euroscience 25 1: Suppl. 2 11/6-
11/12.
Worboys, Michael. 20120. "The Hamilton Rating Scale for Depression: The making of a 'gold
standard' and the unmaking of a chronic illness, 1960-1980." ChronicI llness 9(3): 202-219.
Ziporyn, Evan. 2013. "Visiting Artist Arnold Dreyblatt's Magnetic Resonances." March 19,
Centerfor Arts, Science & Technology at MIT, Accessed May 21, 2019.
<https://arts.mit.edu/arnold-dreyblatts-magnetic-resonances/>
188
Chapter 3: Do Androids Dream of Electric Speech?
The small beam of white light shone steadily into the left eye of Rachael Rosen, and against her
cheek the wire-mesh disk adhered. She seemed calm. Seated where he could catch the readings
on the two gauges of the Voigt-Kampff testing apparatus, Rick Deckard said, "I'm going to
outline a number of social situations. You are to express your reaction to each as quickly as
possible. You will be timed, of course." "And of course," Rachael said distantly, "my verbal
responses won't count. It's solely the eye-muscle and capillary reaction that you'll use as indices.
But I'll answer; I want to go through this and -. " She broke off. "Go ahead, Mr. Deckard."
- (Dick, Philip K. 1968. Do Androids Dream ofElectric Sheep? New York: Random
House. Pp. 46)
In the first week of my fieldwork at the West Coast University (WCU) Research Institute, an
engineer I call Klaus asks me to meet him in his large, sunlit office a few feet away from the
cubicle that the Institute's head administrator has secured for me. My corner cubicle, #304-c, sits
at the periphery of a uniform cubicle sea, and it hews close to a wall of offices and conference
rooms named after key figures in the history of American computing. If I slide open the short
plastic door of #304-c, I can see who is coming and going from the Grace Hopper conference
room directly in front of me. And if the door of the room is open, I can see all the way out the
curved glass windows to a busy four-lane highway lined with skinny sidewalks and traffic
signals, and to the hazy sky dotted with seagulls, palm trees, distant hills, and billboards
advertising headphones and television shows. As early as seven years ago, most of this area was
undeveloped marshlands, some place local school children visited on bird-watching fieldtrips.
Now, it has been reborn into an up-and-coming technology hub that bursts forth with multi-
million-dollar condos, upscale grocery stores and yoga studios, and sleek industrial parks,
including the one in which the Research Institute sits.
Like many other things at the Institute-the Game Room, the Meditation Room, the
Kitchen, the Lounge, the Theatre, the Research Subject Consenting Room-cube #309-c is
189
clean, ergonomic, artfully constructed, and lacking in warmth. It is immersed in a quiet that feels
eerie given that one side of the building faces the traffic-congested street, and given the over 300
cubicles on the floor, most of which are filled throughout the day with visiting researchers or
post-docs, graduate students, and undergrads who are shuttled in from the main WCU campus 20
miles away. When my surrounding cube-mates do begin talking, faces unseen from within their
little plastic boxes as they make idle chat about lunch plans or the news, they fall silent if I try to
join in and resume their conversation, acting like I've said nothing at all. Three days after my
arrival, the day Klaus calls me into his office, I overhear people whose voices I can already
recognize suspiciously mulling over my presence from a few cubes away: who is the girl in
Jackie's old cube? What exactly is she here for? What does she want from us? What are we
USER INTERFACE (ABBY)
FTWA
allowed to tell her?
I've come to the University to work as a research assistant and learn more about at a
technology that a federal agency contracted Klaus and his colleagues in the psychology
department to build, a technology I call the Virtual Human Interviewer (VHI). The program
190
officer of the federal defense agency funding this project wanted members of the engineering and
psychology departments to collaborate to create an intervention for the high incidence of veteran
and soldier suicide and the under-reporting of mental health issues. The agency propositioned
Klaus and his colleagues to build a system that could tirelessly and systematically identify the
nonverbal signals of post-traumatic stress disorder (PTSD) that soldiers inadvertently convey,
inspired by the evolutionary psychology theory of "honest signals".41 Unlike conventional modes
of listening in psychiatric assessment, in which a mental health care worker attends primarily to
the content of a person's speech as they answer a set of interview questions about their general
psychiatric state, the VHI is incapable of analyzing semantic content. It attends only to the
form-the sonic contours-of speech.
The VHI has two components: first, there is the software, which Klaus and the
engineering team built (which I call VirtuSense). Secondly, there is the user interface, which the
psychology team built. To be interviewed by the VHI, subjects are hooked up to a microphone
and sit in front of a large screen and a small web cam. On the screen is an animated character: an
adult woman with olive skin and dark brown hair, who I refer to as Abby. 42 Abby appears to ask
the research subjects a series of interview questions based on a combination of psychiatric
assessment scales for PTSD and depression. As you speak to Abby, VirtuSense processes the
audio-visual input that the microphone and webcam capture. VirtuSense analyzes this input and
4 Klaus eventually revealed to me that the program officer drew direct inspiration from a now out of print book by
Alexander "Sandy" Pentland called Honest Signals: How They Shape Our World (2008). Pentland is a computer
scientist, often heralded as the grandfather of wearable technologies and one of the most cited authors in computer
science. He directs the Connection Science and Hyman Dynamics labs at the MIT Media Lab. This particular book
draws from theories in evolutionary psychology to posit the existence of an unconscious social signaling system that
runs alongside language, and that was developed as a precursor to spoken language and is still used to this day by all
communicating humans. Pentland suggests that wearable sensor devices can be used to capture, interpret, and
operationalize these signals (i.e., using them to better understand and get a leg up on business negotiations,
interpersonal relationships, etc.)
4 While this is a pseudonym, it resembles the name that the team had given to the system: a proper noun, gendered
female.
191
then calculates a score for the assessment scales. This software also enables Abby to provide
real-time, non-verbal feedback, in response to the paralinguistic signs that you display to the
system's sensors. If you smile, Abby smiles. If you lean forward, she does too. She nods as you
answer the interview questions, prompting you with positive minimal responses like, "hmm,"
"ok," or open-ended follow-ups in response to one-word answers like, "can you tell me more
about that?" According to the psychology team, this interactive animation is meant to illustrate
that Abby is listening, all in order to establish a sense of rapport and encourage the research
subject to keep talking, producing enough speech data as possible for VirtuSense to calculate a
robust assessment score.
As soon as I take my seat in Klaus's office, he tells me that he has set up this meeting so
that I can get to know the VHI data as soon as possible. Klaus is trained in machine learning and
speech signal processing. As the bookshelves in his office testify, although he has no clinical
training, he reads up on psychiatry and psychology often, paying special attention to diagnostic
inventories and shifting trends in diagnostic criteria. A blonde, Austrian man, his square, stern
face hides a laid back, lassize-faire attitude. He embodies the stereotypical hacker demeanor that
anthropologists like Christopher Kelty (2008) and Gabriella Coleman (2014) have observed: a
tendency to subtly, ironically, subvert convention and authority while working from inside it.
Much to my relief, Klaus begins writing up a list of people involved in the VHI project, walking
me through who they are, where they sit, and what they do. I do not mention the icy reception I
got from my cube mates, but Klaus makes it clear that people will be more forthcoming about
speaking with me if I introduce myself as his student. I soon learn that he is well liked and well
respected across the Institute. With an official appointment in WCU's engineering department,
he is the PI of the Institute's Multi-Modal Analysis team (the engineering team). He oversees
192
several graduate students, post-docs, and a few lucky and talented undergrads, all of whom he
gathers together to meet once a week in Grace Hopper.
Klaus tells me that he has set up this meeting so that I can "get to know" the VHI data as
soon as possible. He often advises his students to review items in the data set before they begin
their work-this allows them to develop a system that is informed by the data's nuances and
textures. I'd naively been expecting us to examine lines of code together, so I'm startled when he
pulls up an archived video of an Abby interview with a research subject, a veteran. This is when
I realize that the 500 or so video recorded interviews are the VHI data. Or, to be more precise,
the research subjects' speech is the data, the fundamental building blocks of the whole VHI
system. Like many of WCU's research subjects, this subject is a veteran, recruited from either
Craigslist or the local Department of Veteran Affairs (VA). Klaus begins the video, and Abby,
the user interface, introduces herself: "Hi," says Abby, in a Standard American English voice,
"I'm Abby. Thanks for coming in today. I was created to talk to people in a safe and secure
environment. I'm not a therapist, but I'm here to learn about people, and I'd love to learn about
you." This phrase, "I'm not a therapist," is key. It is meant to indicate that interactions with the
VHI do not constitute professional medical care.
The veteran says yes, and the interview is on its way. The interview starts off with Abby
asking light-hearted questions about the veteran's favorite place to travel. Gradually, the
questions creep into darker territory, like, "what's a memory you wish you could erase from your
mind?" Although the vet indicated on a form that all research subjects fill out before their
interview that she had no upsetting dreams about her past, the vet describes flashbacks of a near-
death experience from her deployment and confesses to Abby, "I've had every fucking dream
there is to have." The psychology researchers take this misalignment between what the vet put on
193
paper and what she says to Abby to be proof of the system's success. They reason that Abby
strikes a sweet spot in the uncanny valley: subjects reveal more to her than they would to a
human, because she is clearly not a human. At the same time, they reason that her interactive,
responsive yet nonintrusive feedback component makes the assessment process feel more like a
"natural," dyadic conversation.
As the interview progresses, the questions grow more probing, and the contents of the
veteran's answers grow more graphic, to the point that I'm uncomfortable to be listening
alongside Klaus. But Klaus is not listening to the content. He wants me to guess if the
"computer" (VirtuSense) assessed the veteran as showing signs of PTSD, depression, or neither.
He tries to direct my ears to the kinds of things that the software is supposed to pick up. "Listen
to the breathiness of her voice," he urges me, "or how she slurs her words a little."
I guess that she is showing signs of depression, but Klaus tells me that I'm incorrect. He
plays the interview again, but I still can't hear the breathiness or the slurs. And what I also can't
see is that there are other people present in the video-people whom I wouldn't learn about until
much later on in my fieldwork: two younger, female members of the psychology team I call
Nava and Taylor, who had watched and listened to the veteran's interview from another room,
monitoring the content of her speech for any mentions of suicide or homicide in a way that
VirtuSense couldn't, because the system is not designed to analyze content and cannot catch the
semantic nuances of suicidal speech-it cannot even identify individual words. For legal liability
reasons, for the sake of the wellbeing of the research subjects, and because the VHI did such a
good job of not attending to speech content-despite Abby's animation suggesting otherwise-
there always had to be humans in the loop.
194
Klaus shows me several more videos, following this same procedure: he plays the video
and then asks me to guess VirtuSense's assessment. The research subject's responses to the
assessment questions-tales of assault, estrangement, and violence-continue to disturb me, but
Klaus's attention remains fixated elsewhere. I remark that the videos are tough to take in.
Queuing up another video, Klaus says, "that's why we need virtual humans." If this was what he
wanted to me "get to know about the data," then his message is a contradictory one. It honors the
work of people doing psychiatric assessment-presumably, the professional figure after which
Abby was modeled, one who primarily plays a listener or facilitator role in a conversational
interaction aimed at gathering information about the other conversational partner. Klaus
recognizes that the job of these professional actors is demanding and draining, in part because
taking up the potentially disturbing content of a would-be patient's speech while remaining as
calm, receptive, and understanding as possible is emotionally exhausting work. At the same time,
to suggest-as Klaus does and as his colleagues did, in building the VHI in the first place-that
it is possible for a machine to do this listening instead of a person inadvertently devalues that
labor, implying (intentionally or not) that it does not require the type of skilled, tacit knowledge
that automated systems are incapable of capturing.
This first encounter with Klaus and the VHI brings to mind the Voight-Kampff test
depicted in the 1982 film, Blade Runner, which was inspired by Philip K. Dick's 1968 novel Do
Androids Dream ofElectric Sheep? In a post-apocalyptic future where androids and humans co-
exist and are indistinguishable from one another, the Voight-Kampff test is supposed to help
bounty hunters sort out the humans from the machines. Like the VHI, when the suspected
android answers a series of interview questions, the Voight-Kampff apparatus focuses in on
minute, unconscious bodily reflexes that reveal the speaker's inner state irrespective of the
195
content of their answers. It's telling that the 2017 sequel to Blade Runner re-imagines the
Voight-Kampff as a test for PTSD that seeks out signals of emotional trauma in the voice. This
speaks to the cultural pervasiveness of the idea that emotions and psychic pain are contained in
the voice, are unconsciously expressed, and can be made knowable by listening, but must be
listened to in a certain way with the aid of technological intervention.
Despite these familiar resonances with the Voight Kampff test, my meeting with Klaus
and his attempt to get me to guess the VHI assessment was not so much a test as it was a
demonstration.a  demonstration that I am human, and that there are limits to what I can hear in
mental illness. He provided a simulation of what the software listened for by showing how out-
of-reach these signs were to me. This was also an enactment of why the VHI is necessary: I was
focused on the content-indeed, driven to distraction by the veteran's words-while Klaus's
software could focus on things that were adjacent to the content.
But in addition to demonstrating the power and necessity of the system, Klaus had also
performed a sleight of hand, a kind disappearing trick. His demonstration left out the role of
Nava and Taylor, who monitored the subjects from a hidden room, whose responsibility it was to
listen to the content of the speech that the system ignored. This is not to suggest that Klaus was
trying to hide these women from me or that he didn't want to reveal their presence, although the
larger purpose of the study was to convince research subjects that Abby was all machine and that
there were no humans listening to them, because the team wanted to investigate if people are
more emotionally open when they think they are talking to a computer. Perhaps Klaus didn't
bring up Nava and Taylor to me because they weren't a part of his definition of the software
system, VirtuSense, which was his primary research interest. He wasn't very interested in Abby,
196
the user interface. Nava and Taylor weren't his students. They weren't on his team-they were
on the psychology team.
As members of the psychology team with few academic or professional credentials, Nava
and Taylor were also responsible for explaining the study to research subjects, securing their
consent, and then debriefing them after their interview. In the very early stages of the research, it
was their job to interview research subjects face-to-face and then transcribe and code their
interviews, outlining the basic interactional infrastructure that would be built into Abby. The
team then selected Nava to be the "voice" of Abby, and she spent many hours in a recording
booth reciting the lines of speech that Abby now speaks. In later stages, as the psychology team
was trying to figure out what Abby's animation should look like (what her "active listening" and
rapport-building body language look like) these two young women actually controlled the bodily
movements and the timing of Abby's questions as research subjects interacted with the VHI.
Neither of these young women had extensive clinical training-Nava was still an undergraduate
at the time of the study. Still, they played a fundamental role in making sure that the rest of the
team got the data that they needed, by managing the comfort of the research subjects (the data
source) by managing the extraction of data (answers to the interview questions).
Klaus's conjuring trick connects with a pervasive feature of automated systems that
scholars in anthropology and science and technology studies (STS) refer to, following Hamid
Ekbia and Bonnie Nardi (2017), as heteromation. According to Ekbia and Nardi, it's not very
productive to think of automation as machines doing things autonomously, with no human
intervention. Instead, it's more productive and actually much more accurate to think about
automation as a mixture of human and machine work-in other words, heteromation. By
bringing the humans who play a fundamental role in automated systems back into view, humans
197
like Nava and Taylor, the concept of heteromation gives us an anthropological grip on studying
automation as a cultural process rather than one that is set aside from culture. It also opens up the
space to explore why humans like Nava and Taylor are so hard to find in representations of
autonomous systems, so that we can dig into the politics of their invisibility.
ANIMATING ASSESSMENT
In this chapter, I tack back and forth between the ethnographic present and the history of the
VHI's development (which is wrapped up in the history of the Institute itself) as retold to me by
the researchers who worked most closely together on the project, and by comparing these
narratives with institutional documents, publicly available materials (press coverage, the
Institute's website, etc.) The VHI has yet to come to full, clinical fruition. It has been more or
less shelved since data collection ended in 2015, and it will not be put to use in its desired
clinical contexts, like the local VA, anytime soon. For these reasons, rather than looking at its
reception in the popular press or among mental health care professionals alone, this chapter
attempts to trace out the hopes, imaginaries, and legitimizing rhetoric that drove the dreaming
up, development, and testing of the VHI, and that continues to animate it.
I draw from archived videos of interactions between research subjects and the VHI and
my own interactions with the system and with other human-computer assemblages. I also draw
from my experience collaborating with researchers to carry out a (ultimately failed) comparative
study, in which we attempted to run the VHI sensory processor through a life-size, humanoid
robot. Getting to know the system through its frustrating, puzzling failures and shortcomings
helped me to understand the friction between how the virtual human is designed to appear to
198
research subjects, and how the perceptual system takes up and interprets human speech, bringing
into greater relief the tension between the multiple modes of listening and participation
frameworks that the system entails and partakes in. This firsthand experience with the two
components of the system (the virtual human and the sensory processor) butted up against and
fell short of the hopeful and promissory representations of the system-what it supposedly does
and how it works-in grant proposals, drafted and revised articles for publication, promotional
videos, and conversations with the press and the general public at symposia and during monthly
open house tours of the Institute.
Analytics like heteromation underscore the image of the autonomous machine operating
with little to no human intervention is a cultural myth, reinforced in the U.S. and the U.K. by
popular and conventional histories of computing, which struck from the historical record the
oftentimes gendered labor and laborers that made the development of contemporary computers
possible (Daston 1994; Light 1999; Chun 2011; Hicks 2017). If computers and other machines
appear to be acting on their own accord and with their own agency, this illusion is achieved
through the erasure of the humans who maintain and mediate the interaction between machines
and the people that use them, "removing some people out of the loop so that others [i.e., end
users] may feel close to the machine" and are given the impression that the machine is
completely autonomous, and that their interaction is unmediated (Irani 2013:733). For my
informants, "closeness with the machine"-the VHI-is the interactional goal; the user interface
in particular has been designed with the hopes that users (the research subjects) will come to trust
it and as a result, emote openly in front of the system's various sensors. This "closeness" itself
turns on the illusion that the VHI is entirely machinic, with no human intervention, even while it
199
depends on the downplaying of humans like Nava and Taylor, whose vigilance and attention
keeps the interaction socially meaningful as well as psychiatrically safe.
I show how researchers use the virtual human craft and stage an interaction that
capitalizes on a language ideology dominant in American psychiatry that privileges the
referential function of language, all in order to provide the system's sensory processor with data
that is analyzed in a way that conflicts with that ideology. My interlocutors contrasted the
sensory processors' "machine listening" against what they called "human listening," "listening as
a human," or listening that had "the human touch," the kind of listening that is wrapped up with
Euro-American conceptualizations of empathy that the virtual human pantomimes, and also the
kind of listening that must go on behind the scenes to ensure the system's proper functioning.
Using the VHI and its related components-especially the user interface, Abby-my informants
attempt to encourage trust, rapport, and intimacy, engineering feelings of closeness between the
research subject and the technology to encourage emotional expressiveness. Analyzing the
design and development process lays bare how ideas about the relationship between language
and self that circulate in contemporary psychiatric encounters in the U.S. depend upon a model
of the self as an individualized, authentic core, interior to the person and inaccessible to the
public. Building rapport and enabling trust amounts to cajoling a person to allow an interlocutor
to access this private, secreted core.
The trope of animation is analytically useful in piecing apart the VHI, and not only
because the VHI interface is an animated character on a screen. Animation is a useful trope
especially in regard to the questions of interaction, affect, and labor that are ethnographically
central to this dissertation. The literature on animation expands upon Goffman's (1974; 1981)
key texts on participation framework, taking seriously his invitation to move beyond a
200
performance model of expression and self and pursuing more nuanced analyses of agency,
intentionality, and technology in linguistic interactions (Gershon 2015; Manning 2018). Goffman
pointed out that an interaction never really just involves two people-the speaker and the
hearer-but instead involves multiple parties, or multiple participants with different levels of
engagement (see also Goodwin and Goodwin [2004]). There are ratified and non-ratified
addressees (intended and unintended recipients of speech), the principal (the person or parties
whose viewpoints motivates the speaker's talk), the author (the person who composes the form
and content of the speaker's utterances) and, finally, the animator, "the talking machine, the body
engaged in acoustic activity" (Goffman 1981: 144). The notion of the animator and its attendant
action, "animation," implies that the participant producing oral speech may not always mean
what they say, challenging the relationship between speech and intentionality that many
anthropologists have found to be a key feature of Christian, Euro-American models of language
and mind (Throop and Murphy 2002; Robbins 2004; Desjarlais and Throop 2011; Duranti 2014)
and which I argue are central to psychiatric interactions in the U.S. That Goffman refers to the
animator as a talking machine points to the potential utility for "animation" to illuminate
situations, like my ethnography, in which non-human machines mediate, modulate, and intercede
upon human speech.
Moreover, animation challenges the dramaturgical, performance model of interaction, in
which interactional actors "play" social "roles" that do not necessarily align with their true
selves, which remain intact and can always be returned to. "Performance" maintains the
connection between language and authenticity, insinuating that the model of the self that is
conventional to U.S. psychiatry is a feature of all interactions, and all interactional partners.
Ethnographically exploring the VHI and its distributed assemblage of animators and actors
201
illustrates how authentic, intimate encounters are made, and the felicity of these encounters
(along with the psychological health of the speaker) is maintained. The concept of animation also
helps me to flesh out the distinction between what I call "linguistic labor" and "emotional labor"
(Russell Hochschild 2012). Russell Hochschild developed this term to refer professions that
involve displays of positive affect, like wait staff and flight attendants, even if these emotions are
inconsistent with what the person is actually feeling. The "labor" of emotional labor comes from
the misalignment between how the person feels, and what they show to be their feelings; this
reiterates the front-stage, back-stage dramaturgical setup of performance theory, in which there is
a true self to be found all along. Linguistic labor decenters the role of emotions and focuses
instead on communicative strategies which craft the impression that speech in an interaction is
being taken up in one way or another-for example, that listening to a person's story elicits
sympathy from the listener and affective investment in the speaker.
Additionally, animation confronts us with questions of resemblance and similitude. As
Silvio (2010) writes, animation encompasses "a range of technologies and skills that are used to
create the 'illusion of life' in the guise of puppets, dolls, and masks" (426). The life-likeness,
liveliness, and like-ness of Abby-the illusion that the interface is alive and the extent to which
interacting with the system resemblance a socially legible form of interaction-is distributed
across multiple people, and depends on the concealed labor of Nava and Taylor (Suchman and
Stacey 2012). Keeping this in mind, I parse apart what it means to make Abby a "virtual human"
by leaning into virtuality's connotations of almost but not quite-of seeming like the real thing in
an incomplete way. Specifically, I focus on the psychology team's arguments about how they
used Abby to manage the flow of research subjects' speech and to shape subjects' impressions of
how their speech was being listened to. Engineers relied on research subjects' highly emotional
202
speech in order to build their software-their speech was the team's data, and therefore
foundational to the connection-making work that the software is supposed to do. The engineers
thus recognized how important it was for the psychology team to develop a user interface that
could elicit this data in a standardized way, and in a format that would be socially legible to
research subjects. As researchers put it, they needed the interaction to feelfamiliar to research
subjects: to feel like a communicative interaction with an interlocutor with whom they wanted to
share their answers. There is a politics to this "familiarity," this interactional likeness.
Researchers' ideas about what might make Abby familiar to research subjects articulate broader
expectations regarding what kind of human listens to you in the thoughtful, empathic way that
Abby is supposed to imitate.
They also articulate the value of this kind of human. Some researchers like Klaus have
specific research interests invested in the VHI, but the broader, shared goal of the team is to
market the VHI as a public health tool: a technology for streamlining psychiatric assessment.
While diagnosis is the medico-legal designation of an illness, psychiatric assessment is a more
informal triage process. Assessment involves sorting potential patients into categories: people
who might be showing signs of psychic distress and are therefore in need of medical diagnosis
(which would grant them access to insurance-covered treatment) and people who are not sick.
My informants argued that a tool like the VHI, which they were developing to do this sorting
work on behalf of humans, would save money, time, and save people from burning out in
emotionally laborious jobs. Anthropological and STS scholarship on automation has also pointed
out that in order to automate human labor, that labor has to first be conceptualized as mechanical
and unskilled (Irani 2015; Hicks 2017; Eubanks 2018; Taylor 2018; Ticona and Mateescu 2018).
Therefore, in order to understand what it means to automate psychiatric assessment, we have to
203
understand how the organizational hierarchy within the teams conceptualizes the skills and the
work associated with psychiatric assessment, especially psychiatric listening, as mechanical
labor, and how this hierarchy replicates hierarchies of clinical labor within mental health care in
the U.S. To ask who Abby is supposed to resemble, which kind of professional she is supposed
to listen like, is to dig deeper into the political economic implications of the VHI and hierarchies
of value within U.S. mental health care, illustrating how "claims about automation are frequently
also claims about kinds of people" (Irani 2018).
WHAT DO I KNOW? WHO SHOULD KNOW? HAVE I TOLD THEM?
How did the Institute-and the VHI-come to be? In this section, I summarize the history of the
Institute and of the VHI project, followed by three examples (the fishbowl, the stolen sign, a lie
by omission) that illustrate how an ethos of paranoia, opacity, and illusion-which are reflected
in and refracted through the design and logic of the VHI itself-are made concrete in the
Institute's material, informational, and social infrastructures. The system was not as smooth and
seamless as it appeared in videos I watched with Klaus, or in any of the promotional videos
available on the Institute's YouTube page. Gaps in narrative about how it worked that were
gradually revealed and then slowly filled in, not just based on what people told me in interviews
or in private conversations, but what theyfailed to tell me: by a collection of secrets, silences,
and lapses in information.
The Institute was developed in the mid-I990s through a partnership with WCU and
several military and defense organizations. These organizations were looking to draw upon
advances in special effects and computer graphics coming out of the film industry and the
204
cutting-edge computer science research of nearby Silicon Valley to create training, simulation,
and medical interventions for civilian and military populations. For instance, the Institute
specializes in building immersive, augmented and virtual reality environments used for both
exposure therapy for veterans with PTSD and also for resiliency training for soon-to-be-deployed
soldiers. While technically a satellite campus of WCU, the Institute bureaucratically exists
beyond the university. It is not totally beholden to the university's governance and regulations,
and human subjects research conducted through the Institute must pass through the WCU
Institutional Review Board (IRB) and various military IRBs.
Within the Institute, researchers, post-does, and interns are partitioned into different labs.
By the time I had arrived there was a sense of competition and among the labs-they all had to
vie, separately, for funding. It had not always been that way. Not unlike the building of the Mars
rover project that Vertesi (2012) describes, the VHI was a kind of totem, a point of convergence,
and labs across the Institute gathered around the common goal of developing, designing,
building, and testing it. The VHI team in its entirety included Klaus and his Multi-Modal
Analysis lab (consisting of researchers trained in engineering and computer science), the Virtual
Human lab (generally for researchers trained in social and organizational psychology and
interested in human-computer interactions) fronted by co-PIs Allan and Valerie, both of whom
were trained in psychology. The Art Department provided additional support, responsible for
Abby's physical appearance, along with the Natural Language Processing lab, which was
responsible for VirtuSense's speech recognition properties, and the Special Effects lab, which
was responsible for developing the VHI's capabilities for tracking gesture, posture, and head
SOut of all the labs collaborating to build the VH1, the NLP lab was the most bitter about the project. Because the
VHI is not meant to be able to parse and analyze semantic content of speech, the system's NLP capabilities are by
design quite poor and not very advanced.
205
position. There were also a series of employees, interns and WCU undergraduates like Nava and
Taylor, whose paid and unpaid labor-things like voice acting for the virtual human, piloting the
intervention, taking part in face-to-face interviews with research subjects, recruiting research
subjects, acting in promotional videos-helped to make the whole project possible.
At the time of my arrival, the VHI was more or less shelved and in a state of disarray.
There was scant opportunity for cross-lab collaboration to be found. Everyone's objectives no
longer aligned. Any issues that the VHI had were left unaddressed until at a time when someone
would be able to secure funding to continue working on the project. This had taken me by
surprise at first. Based on promotional materials that the Institute was producing and recent
interviews about the VHI with the press that I had followed closely, I was under the impression
that the Institute was still actively using the VHI in research studies. Moreover, none of the
people with whom I had spoken in the process of gaining access and preparing for my fieldwork
had mentioned that the VHI was no longer in use. This was the first taste I got of the
unpredictable paths and channels through which information about the VHI circulated, and the
degree to which demos and other public facing materials rhetorically reinforce one interpretation
of the VHI's functionality and efficacy while redirecting attention away from others.
The industrial park in which the Institute's building now sits contains manicured gardens
with cacti, succulents, bright and flamboyant wildflowers, glittering, artificial streams, two
miniature soccer fields, and a hatch shell for outdoor summertime concerts. The Institute moved
to its current location, away from its former, ocean-side and much smaller and humbler environs,
around the early 2000s, when the area was just beginning to be developed. The Institute's move
also coincided with one of its primary federal defense funders setting up its West Coast
headquarters in a building connected to the Institute's main building by a causeway that offered
206
an outdoor seating area with benches, tables, and potted plants. By the time data collection for
the VHI project ended in 2015, there was a well-established material and metaphorical pathway
between the Institute and the security and defense sector. Many researchers, especially post-docs,
left the Institute forjobs at the military unit (like Jackie, whose cube I had taken over) and would
still join their old co-workers and cube mates for lunch on the causeway. As this connection
between the military and the Institute concretized in the wake of the Institute's move, the
atmosphere and demographics of the Institute shifted. Without much explanation, several
employees lost their jobs, especially many of the women in leadership positions. In the shadow
of these unexpected firings, the place became less open, and more charged with paranoia. Some
researchers murmured to me, at the tail end of happy hours outside of the Institute's chilly
interiors, that their formerly progressive-feeling workplace had become an old boy's club.
When they felt comfortable talking about it with me, researchers expressed ambivalence
and frank cynicism about the Institute's close ties to the military and its changing atmosphere. At
least two researchers who were non-U.S. citizens and were disturbed by the casual and
historically deep connections between technologists, computing, and the military in the U.S.,
pointed out that the American flags flanking the Institute's entrance only showed up once the
military moved in next door. Over lunch one day at a taco stand near the WCU campus, I timidly
explained to Hillary (a research affiliate working under Allan and Valerie) and Zach (a PhD
student supervised by Allan and specializing in robotics) the phrase that other scholars had used
to describe close ties between technologists and the defense sector: the military-industrial-
entertainment complex." They both shook their heads bashfully and Hillary said, with a chuckle,
4Julian Bleecker's (2004) term, "the military industrial light and magic complex," may have more accurately
captured the Institute's particular blend of Hollywood swagger and special effects technology with illusion,
spectacle, and military money.
207
"yep, sounds about right."
I was initially unsettled by how physically present the military was at the Institute. It was
not unusual to encounter uniformed army or navy personnel lounging and joking at one of the
restaurant style booths in the kitchen. Once morning, I found myself making small talk by the
coffee machine with a woman who designed "intuitive to use" weapons. Her products were
much easier to operate than the coffee machine, she griped and boasted. I had also
misunderstood that the Institute's military ties meant that funding was plentiful and stable. In
fact, the Institute was a precarious place to work. Researchers and employees, even the security
guards and people working in H.R., frequently came and went as they graduated or as they
sought employment elsewhere. Because of the Institute's unusual relationship to WCU,
academics employed as head researchers or PIs could not seek out tenure either at the Institute or
at WCU, although they often taught classes and advised students. They had to continually apply
for external grants to fund their researchers and the non-graduate student researchers who
worked for them.4 5
Once a year, PIs had to hold days-long marathon meetings with military officials, arguing
in favor of sustained funding for their job, their research, and their lab. These meetings would
always take place in a large conference room, nicknamed the "fishbowl," on the second floor
located in front of the elevators. The wall of the room that faced the elevators was entirely made
of glass and completely transparent, offering a clear view into whatever was going on inside the
room. However, in the event of a particularly important meeting, the glass wall was covered by a
veil of running water that could be turned on or off, making it difficult to distinguish who was in
the room, revealing only the blurry outline of their figures. At first, I found the gentle sound of
208
the water pleasant and calming, but I realized that it also had the effect of making it impossible
not only to see but also to hear whatever was going on in the room. I came to take the running
water as a sign that the meeting in the fishbowl might have serious consequences for my
informants and the sanctity of their research projects and their jobs. And I came to see the
fishbowl as a metaphor for the Institute, and the play and performance of secrecy and
transparency that characterized it. In the fishbowl, serious and consequential matters were
discussed "out in the open," technically public and available to all but yet concealed, the facts of
the matter distorted and obscured. Things were not as they seemed, and the distinction between
what was private and taboo and what was already known and common coin to all was unclear.
WHAT DO I KNOW?
WHO SHOULD KNOW?
HAVE I TOLD THEM? .f-0
Another illustration of the Institute's ethos of and predilection for concealment, and of
the shaky distinction between secret and matter of fact, is the case of the stolen sign. The sign
had hung from the ceiling, in between the second-floor elevators and the fishbowl. In red, bold
block script, it asked, WHAT DO I KNOW? WHO SHOULD KNOW? HAVE I TOLD THEM?
209
Next to this message were three circles meant to represent three people, all of whom were
connected by three arrows, forming an interlocking loop. Hillary and I both interpreted the sign
to be encouraging a citizen watch campaign along the lines of "if you see something, say
something," suggesting that researchers help keep each other in check and be on the lookout for
suspicious behavior, seeming to imply that the Institute had security issues.
I did not notice that the sign had gone missing until I received an Institute-wide email
about it, sent by a military liaison who rarely ventured below his office on the fourth floor. In the
email, the liaison claimed responsibility for making and hanging up the sign. He had used similar
signs previously, in a variety of contexts in groups and organizations of varying sizes (from 30 to
400 to 42,000 people) to help keep the flow of information running smoothly. Communication
and openness, he reminded everyone, were key to the "organizational health" and mission of the
Institute. The signs were meant to help everyone recall that they might know something that
could have benefited others, and to circulate that information as liberally as possible. The sign
was not meant to belittle the expertise of researchers, but rather to prevent bottlenecking or the
siloing of information, which everyone recognized to be endemic to the Institute. For example,
many of the people who gained expertise in operating the VHI for public demonstration purposes
no longer worked at the Institute or else claimed they had forgotten everything they had learned.
When Hillary was called upon to demo the VHI at a public WCU symposium, she eventually
turned to me for help. Because I had spent so many hours, so many days, trying to piece together
an institutional history of the project through interviews, toiling through the Institute's online
archives, conference proceeding, research papers, and press releases, I had the clearest sense out
of anyone else there of the VHI's backstory.
So the sign was not a citizen policing campaign, but a proactive bureaucratic, and
210
cybernetic call to disciplined information sharing for the benefit of the Institute as a whole. The
week following the sign's reported absence, Klaus, Hillary, a project manager (PM) of another
psychology PI and I took ourselves out to lunch at a restaurant located a 10-minute drive from
the Institute. Halfway through our meal, the PM revealed that he knew who stole the sign. Klaus
and Hillary, delighted, begged him to confess, but the PM would not yield to their pleas. He was
not going to reveal the thief's identity to anyone, he told us. He would simply ensure the sign's
silent return and would do his best to ensure the thief did not lose face over such a trivial prank.
This led to a rowdy joke from Klaus, directed at the PM: "WHAT DOES HE KNOW? He knows
who stole the sign. WHO SHOULD KNOW? The guy who made the sign. HAS HE TOLD
THEM? Fuck no!"
The stealing of the sign, the protection of the thief's identity, and Klaus's joke are a
means of protest and resistance against the heightened feeling of suspicion and an asymmetrical
transparency at the Institute. Why should researchers be transparent with each other, sharing
what they know with someone who might not know it, when they were all competing with each
other for funding, and when the logic driving decisions about their jobs-like the prompt and
mysterious firing of many employees-would never be as transparent? This will-to-not-know
and refusal to keep information circulating was a form of self-preservation, a way to protect
one's self and one's peers from recourse of the men running the Institute and the control they had
over the allocation of resources. Researchers at the Institute practiced self-preservation through
small acts of refusing to let information be known to as many people as possible or simply
thumbing their noses at this sentiment. This practice did not just concern office gossip, like who
had stolen the sign. It also involved refusing to circulate information about the VHI's own
shortcomings, which might have broad implications for not only individual researchers but for
211
the sanctity and reputation of the Institute as a whole.
Because the VHI was so charismatic, it attracted frequent and sustained attention from
the popular press. The VHI was the Institute's prized prototype. Some even called it the mascot
of the entire Institute, but it had significant shortcomings. There was a misalignment between
what they hoped it could one day do (operate on its own without an architecture of human
support, accurately provide assessment scores, respond in a socially appropriate way to a
person's tales of distress) and what it could accomplish and execute in its current, shelved form.
For instance, promotional videos of the system and videos recorded for public press coverage
give the impression that the VHI is incredibly, socially adept. Interactions captured in these
promotional videos seem smooth-there are no gaps or awkward pauses, Abby nods her head at
all of the right places, and Abby apologizes if the subject's speech overlaps with Abby's. But
only after having watched many videos of research subject's interactions with the VHI and
parsing through the archive to find earlier versions of Abby's script did I realize that all of the
publically available versions of the video feature older versions of the system: the version of the
system in which Taylor and Nava pupeteered Abby's bodily movements and the timing of the
questions. This stage of the system's development is referred to as the Wizard of Oz-the WoZ
or wizarding stage-standard terminology for human-computer interaction (HCI) studies. 46
Researchers use the WoZ phase to figure out which components of the system work best.
Interactional data gathered from the WoZ stage, like the optimum time Abby should pause
before answering a question, get built into the final, automated version of the system. But the
4 This term is a reference to the wizard of L. Frank Baum's The Wonderful Wizard of Oz. Characters spend the book
on a quest to find the wizard, promised to be capable of magically solving their woes. When they finally arrive, they
realize that the wizard possesses no magic; the form they encounter as "the wizard" is actually an automaton,
controlled by a regular, run-of-the-mill human man, who conceals himself behind a curtain. Likewise, in the WoZ
experimental paradigm, the research participant should experience the technology being tested as totally
autonomous. Meanwhile, researchers keep the humans who animate the supposedly autonomous technology's
interactions, hidden from the user.
212
automated version of the VHI was far less smooth than the WoZ system. Klaus, Taylor, Nava,
and others who were involved in moving the VHI from the WoZ phase to the automated phase
conceded that talking with the automated system was awkward: there would be too-long phases
between the research subjects' speech and Abby's questions, for example.
Discussing the VHI to outsiders-including non-Institute affiliates, like visiting
anthropologists-required not only choosing one's words carefully, but also choosing when to
let misinterpretations or misunderstandings go uncorrected. I experienced this firsthand when I
found myself on the receiving end not of a lie per se, but of a failure to disclose. The admission
of an omission was let loose during an interview with Nava, one of the two young women who
controlled the VHI and helped to develop its interactional infrastructure. As I will explore in
more detail, Nava was the voice actor for the virtual human; Abby speaks in her voice. She
conducted face-to-face interviews with research subjects, interviewed research subjects using the
virtual human in the WoZ, and then monitored subjects' interactions with the VHI in the
automated phase. She had been an undergraduate at the time and worked on the project until data
collection ended in 2015, subsequently leaving the Institute to pursue a doctorate in neuroscience
elsewhere in the state. I had been asking her to speak in more detail about her involvement in
facilitating face-to-face interviews and then tagging that data for analysis, when she casually let
slip her understanding that VirtuSense cannot map sentiment to paralinguistic, non-verbal
features of speech in real-time, as Abby's conversational partner is speaking:
"it's [VirtuSense] not saying like this line was delivered in this tone, it can't map to that
degree...all they're [the researchers] doing is looking more on a global scale overall, this
is the sort of like inflection that the user was displaying [...] the truth is it's not it's not
looking at the tone of that specific statement, there's no tool I know that can do that."
213
In other words, the emotional, psychic tenor of a speaker's voice is calculated at the end of a
subject's conversation with Abby, once the interaction has come to a close. This analysis does
not unfurl as the conversation progresses, which was the impression that I had had. I had clearly
expressed this assumption to others at the Institute, including in private, one-on-one interviews.
By the time I interviewed Nava, I had been at the Institute for roughly two and a half months and
was about half-way through conducting interviews with researchers pulled from the list Klaus
had made me. But none of them had clarified or pointed out my misunderstanding. What's more,
this information contradicted what Klaus and Valerie had told me about Abby's responsive
capacities. They had conveyed to me that Abby is able to respond with socially appropriate body
language because, supposedly, VirtuSense tracks the emotional tenure of conversation in real
time. Confused, taken aback and not sure who to believe, I told Nava that I had not realized this
to be the case, expressing shock that no one had ever explained it to me that way. Nava was not
shocked at all to hear me say this, though. On the contrary, and speaking as a veteran of the
Institute, she told me this kind of thing was par for the course. "To be honest," she began,
I feel like they leave a lot of things open ended because they want you to interpret it in a
beneficial light which is what will happen, you're gonna interpret it the way that you
want to and they're not gonna correct you [...] when you ask the right questions it'll
come out for sure...or even just looking at the dials very carefully.
Nava's comments ring true about the system itself: Abby, the system's interface and her listening
"body language," co-fabricate a not entirely honest depiction of thoughtful listening. Abby's
designers cash in on the familiarity of her attentive body language in order to encourage a
specific (mis)interpretation of how the system is listening to ensure that research subjects speak
(and emote) as much as possible so that VirtuSense gets the data necessary for its analysis. But I
also like to think of Nava's comments, her insistence that researchers purposefully leave things
open ended and prone to whatever suggestive (mis)interpretation I might take up, alongside
214
Klaus's insistence that I get to know the VHI data because it would help me know what kinds of
questions to ask. Perhaps this was Klaus's subtle way of suggesting-without directly saying
so-that I look closely at the dials, as Nava put it: that I "learn how to ask" (Briggs 1984), learn
to take what was given to me and examine it critically, pursuing the gaps in people's
explanations because the truth would not be articulated in a straightforward way.
Faced with inquisitive outsiders like myself, researchers at the Institute created their own
fishbowl, playing with the transparency and opacity of facts and fiction, through strategies of
deferral, misdirection, and concealment. The polyvocality of letting things go unsaid-of
refusing to reject an outsider's misinterpretations and instead allowing them to flourish-is
refracted through the VHI and its virtual human interface. Just as the team relied on outsiders to
fill in the blanks on their own as to what the VHI was doing-how and what it was listening
for-as I will discuss, the design of the interface offers up a space of projection and fantasy, an
openness as to what Abby's animated bodily movements me, and as to who Abby is supposed to
be. And, just as the VHI calls into question what it means to listen in psychiatric contexts,
conducting fieldwork in a space like WCU's Research Institute challenged my own assumptions
about what it means to listen in ethnographic contexts. People did not always say what they
meant, and not necessarily due to a desire to conceal information from me or to keep something
secret. Before every interview I conducted, after presenting researchers with my consent form,
they would ask me, "am I allowed to sign this?" And in the interview itself, likewise, they would
ask, "am I allowed to say this?" These were rhetorical questions, of course, since as an outsider I
had no sense of the limits of what they could or could not reveal. Nevertheless, the questions
were telling: it was clear that researchers themselves did not fully understand, or trust, the limits
of what could and could not be known. Fieldwork in such spaces requires a pursuit of thin
215
listening, following Jackson (2013) and Benjamin's (2019) conceptualization of "thin
description" as an anecdote to Geertzian thick description. Like thin description, thin listening is
a method of humility, a method of attending to surfaces "such as screens and skin," key features
of interfaces like Abby (Benjamin 2019: 45). Thin listening implies that there is no absolute
knowledge to be acquired, no god's ear trick in which all ethnographic data will be revealed
evenly and completely, not only for epistemological reasons, but out of respect of other people's
(like research subjects') boundaries.
GHOST STORIES
In this section, I follow up on my initial encounter with the VHI in Klaus's office, focusing this
time on Klaus's team and the component of the technology they built: VirtuSense, the system's
software. I compare the psychology team's visions for the VHI's application, and the way in
which they envision users interacting with the user interface, with the ways in which the
engineering team interacted with the research subject's data. I describe the different ways in
which Klaus's students confronted the research subjects' speech, as opposed to Nava and Taylor.
While the two women encountered and interacted with research subjects face-to-face, or through
the interface of the virtual human, the engineering team members approached the research
subject's interviews with the VHI as data-auditory and visual data-that can be reduced to its
formal qualities. Thus, I explore the motivation behind Klaus's reduced listening-a mode of
attending to the acoustic components of speech alone. The approach that the people on Klaus's
team take toward the data is not a cold, detached mode of listening that denies the individual
personhood of the research subject, but rather a professional mode of interpretation, which we
216
might call following Charles Goodwin (1994) and Thomas Rice (2010) "professional listening."
Unlike the psychology team at the Institute, as the PI directing the engineering team,
Klaus is not interested in the nuances of how or why people come to trust Abby. As far as he is
concerned, Abby (the interface) is useful because it ensures standardization, since Abby asks the
same questions in the same way regardless of any external factors. He is much more concerned
with VirtuSense, the system's software he helped to build, and the pursuit of what he calls the
"vocal thoughtmarkers" of psychological distress: signs that suggest the presence of either
depression or post-traumatic stress disorder (PTSD) or (in his work outside of the VHI project)
signs that suggest that the speaker might commit suicide. Klaus used the term "thoughtmarker"
to put into words what his research centers on, and to describe how it aligns with but also departs
from Ted and Sushant's research at ECU. Klaus uses "thoughtmarker" rather than "biomarker"
because he does not ask after or look into human biology, although other researchers do use this
term to describe markers of neurocognitive processes (Just et al 2014; Rea 2014). His goal is to
use artificial intelligence techniques of pattern recognition to identify connections between
human behavior (with an emphasis on acoustic features of spoken utterances) and standard
diagnostic criteria, more or less black-boxing the brain. Unlike Sushant's team, Klaus takes an
almost behaviorist approach to studying thoughtmarkers. He is concerned with automating the
connection between inputs (the psychopathological processes of mental illness) and outputs (the
sounds of speech) but not necessarily in understanding the nature or the causal mechanics of that
connection.
4 Not only do Klaus and Ted know each other, but Klaus organized a special session at a major, international
speech signal processing conference on vocal biomarkers of neuropsychiatric disorder and invited Ted (plus a PI
from my third fieldsite) to speak. Ralph ended up presenting the group's research at the session; Ted sat in the
audience with me, in the row in front of me.
217
In public presentations, promotional videos, interviews with the press, and when guiding
visitors through tours of the Institute, researchers on the psychology team took pains to declare
that Abby was not a therapist, that encounters with her produced an assessment rather than
diagnosis, and that Abby was absolutely incapable of conducting psychotherapy. Allan, Valerie,
the Virtual Human lab's co-PIs, and researchers throughout the Institute were keen to emphasize
that the VHI is an assessment tool. The purpose of the technology is not to make a diagnosis. Nor
did Institute researchers wish to build a tool that could stand alone and in place of professional
clinical judgment. Instead, and following the dictates of their military funder, they wanted to
build an assistive tool that could help a mental health practitioner make a diagnosis, providing
them with additional insight alongside whatever expertise they brought to the clinical encounter,
helping them to determine the extent to which the patient was in need of care. Sometimes, Allan
and Valerie would introduce Abby as a triage technology. Painting a scenario in which Abby was
the first "person" a potential patient would interact with, a gatekeeper determining whether or
not the patient would see a human professional (if they were in dire need of care), be sent home.
At the same time, there was less of a clear and straightforward story of who Abby was supposed
to be, and how she came to look and move, and "listen" the way she does, a mystery I discuss in
more detail elsewhere in the chapter. Researchers gave even less of a straightforward story about
the research subjects-who they were, and the kinds of things they spoke about with Abby. This
was partly because very few researchers, I found, actually interacted directly with either research
subjects or with their data (the recorded interviews).
It was roughly three weeks after my meeting with Klaus, at his birthday party, that I
finally had the chance to speak in depth with two of his students-Edward, an undergraduate,
and Alok, an advanced PhD student-who were some of the few engineering students who had
218
worked on building VirtuSense and were still at the Institute. Their cubicles were positioned far
from mine and I rarely saw them milling around the Institute's lounge or kitchen. My only
extended interactions with them were during the weekly, thirty-minute check-in meetings that
Klaus held for all of his students in Grace Hopper. Almost all of the meeting time was spent
discussing the finer points of the machine learning side projects in which Klaus's students were
involved, and I would occupy myself by taking detailed, verbatim notes (as if that would make
the material less opaque) and nodding along (like Abby-as if I followed) laughing when
everyone else did though the jokes made no sense to me. Apparently, I was not alone in my
confusion. One of my cubemates-a French post-doc working under Klaus-once ushered me
into his cube and asked me in hushed tones howI , a non-initiate of machine learning, managed
to follow the conversations in the check-in meetings. He wanted some tips, since he confessed
that even he struggled to understand what was being discussed.
Klaus's birthday party-he was approaching his mid-30s-was held at a hip local pizza
place with 1970s-style wood paneled interiors and a DJ spinning vinyl records of pre-disco funk.
It was a relaxed and friendly affair. Researchers who were normally brusque and distant were
bubbly and talkative, sharing pitchers of beer, laughing loudly and playing pool. Everyone in
attendance gathered around a single table to present Klaus with a cake and sing him happy
birthday. He threw back his head in his typical, uproarious laughter when he saw that someone
had scrawled HAPPY BIRTHDAY DR. MULTI-MODAL in loopy cursive on the cake's
surface. I took advantage of the frenzied moment to pull up a chair between Edward and Alok,
who were sharing a can of Pepsi (Edward, a junior WCU engineering student, was not yet 21).
With the raw and intense VHI video I had reviewed with Klaus still on my mind, I was
hoping that the two students would speak candidly with me about their experience working with
219
the data. After all, Klaus had marked their names down on the list of people with whom I should
speak, saying that Edward in particular was responsible for pre-processing the audio and video
data. I had no real understanding of what this work entailed, but I assumed that if either of the
students had processed the videos, then they must have, at some point, listened to them,
especially since they were working for Klaus, who was so concerned with the relationship
between acoustic features of speech and mental states.
My curiosity fired Edward up right away, and he responded viscerally to my question
about his experience working with the data. "Oh, what a nightmare, we could tell you ghost
stories about the VHI data it was so scary," he warned with exaggerated seriousness, his eyes
widening from behind his frameless glasses. "If you want to hear about the data it'll be a ghost
story! We have to sit around in a circle on the floor and turn off all the lights and I'll put a
flashlight under my chin like oooOOOo and we'll make like, s'mores." Alok, and other students
surrounding us who had started to eavesdrop, snickered. Once again not in on the joke, I asked
him to tell me more-why was the data scary? Playing the mature older brother, Alok interceded
before Edward could say more: "He means it was a nightmare because the dataset was such a
mess and required a lot of pre-processing." The French post-doc, who also had been listening in,
stepped in to further disambiguate: "Edward's still young so he's never worked with a real
dataset before," he said, more of an aside to me than a comment to the group,
he's used to working with data for class that's been all cleaned up for you already. The
VHI dataset is ok, messy but ok, more like a normal data set. It's just normal. There are
issues with the audio not aligning with the video, or sometimes there's no audio, or
sometimes you can't see Abby or the audio is not very clear and there's lots of noise, so
you can't make assumptions about the state the data is in before you begin extracting
features from it.
The dataset was "normal" because it contained irregularities, the outcome of unanticipated
malfunctions, things like a research subject putting the headset on backward and covering up the
220
microphone with the hood of their sweatshirt. Examples of "messy" data also include data with
poor audio quality, or instances when the video and audio were out of synch. Poorly synched
audio and video made it difficult to cut up portions of the video into analyzable chunks (in a
single chunk, the audio and video might not correspond). This was particularly troublesome due
to the goal of analysis: correlate the visual (facial expressions, gestures) with the auditory
(acoustic features of speech). So before Edward could work through the VHI data, it needed to
be "cleaned up." He had to first determine which data was usable, although in his naivete and
zealousness he had started working on the data before realizing some of the data might not be
useable. Edward affirmed that the biggest lesson he learned from the nightmarish ordeal with the
VHI data is that it is important to review and understand the state of the data before attempting to
work with it. This echoed Klaus's justification for inviting me into his office to get to know the
VHI data. Edward assured me that, in the future, he would follow Klaus's advice and avoid
making assumptions about the state of a dataset by reviewing it thoroughly first. Even more
confused, I decided to ask Edward outright: "when you worked with the data, when you
reviewed it, did you listen to it? Because some of the stuff people say is really, really, messed up,
and I'm wondering what you did about that." The atmosphere shifted, and I was met with blank
stares from Alok, Edward, and the rest.
I was not sure what I had said wrong, or what the blankness of everyone's faces meant. It
was only later, after having talked with Klaus about his students' abrupt and mysterious reaction
to my question, that things made sense: Edward, Alok and the others probably had not watched
(or rather, listened) to enough videos for long enough or for enough times to fully grasp or
internalize their contents. Even when Edward had played the video, he did not absorb the audio
or visual content-he was focused on analyzing the videos for characteristics that would get in
221
the way of later analysis, like a lack of audio-visual alignment.
Klaus conceded that even he has not listened to all of the videos in the dataset (there are
close to 500 videos). Over the past five years or so, he has only reviewed maybe 50 or 60 of
them total. He explained the steps of processing to me, to help me better understand why his
students might not be familiar with the dataset's narrative contents:
[we will] sporadically like listen to a minute or two and then a few [videos] we watch in
their entirety but just to make sure that the virtual human doesn't do weird stuff. What we
often [do]- and like my students rely on-is, we do basic feature extraction methods that
you might call machine listening, where we basically do signal processing and that kind
of extracts features or characteristics of the signal and then we basically analyze with
respect to the statistical validity or the statistical differences and then also, double check
that the measures that we extract are also like helping a classifier to identify differences
in uh people's behaviors and identify if a person is depressed or not.
Their listening was cursory because they listened in order to assess the formal properties of the
video in order to separate unusable data from usable data.
If Abby's responses were lagging, if there were extended pauses between Abby's
questions and a subject's response, if her audio was out of synch with her body movements, if
the participant was unable to hear Abby's questions, the data could not be used. Klaus and his
students cared first and foremost about the quality of the signal. Their aim was to find statistical
validity. For example, how did qualities of the acoustic signal, the frequency of consonants and
vowel sounds, line up with or deviate from the phonetic norms of Standard American English, or
the acoustic qualities associated with the sound /ba/ versus /pa/? After isolating these qualities,
the final step would be to train a classifier to recognize them in a stream of speech. It was part of
their job, then, to detach form from content. They were unfamiliar with how disturbing the
videos were because focusing on that was beyond the task at hand. As far as they were
concerned, processing the videos meant being able to weed out signal from noise, and this did
not involve internalizing the videos as narrative testimony.
222
On the one hand, it might be tempting to read this as a sign of negligence, or a willful
following of professional codes of practice and interpretation at the expense of emotional
attachment or investment in the research subject's upsetting, confessional illness narratives-a
case of computational detachment. One thinks, for instance, of the Rodney King trial in the early
1990s. Though white supremacy was the animating logic of the officers' acquittal, as Goodwin
describes (1994) describes, the legal defense team stripped the video footage of its racist
motivation by slowing down the beating of King, breaking the violence into disconnected,
formal events, refraining the beating into the expert practice of "de-escalation" to make the point
that the officers were simply doing their job (see also Feuerherd [2018]).
On the other hand, it was indeed outside of the task at hand for team members like
Edward to listen to and absorb the videos contents, also because as an up-and-coming engineer,
he lacked the proper training that would prepare him for this difficult work. Health care
practitioners who must listen to and analyze traumatizing, disturbing stories from their patients
as part of their jobs receive extensive training on distancing themselves and attending to the
secondary trauma that patients' stories might ignite. But as I will discuss in Chapter 4, mental
health care professionals-like people conducting psychiatric screening-also listen strategically
and selectively with the intent of filling out a psychiatric inventory, although they may perform
intersubjective sharing as a means of establishing trust (a tactic that Valarie and Allan tried to
build in to the VHI's interface). Maybe, for Klaus's team, to absorb the content would amount to
a violation of the subject's privacy. For instance, because I had been added to the VHI study's
IRB protocol, I was technically research personnel, and the subjects had consented to allowing
research personnel to access and analyze their recorded conversations with Abby. Nevertheless,
though I had the subject's consent, watching the videos felt voyeuristic-had the research
223
subjects really understood that anyone, so long as they had the team's consent and filled out the
proper paperwork, could access their files? Paradoxically, Klaus's and his team members' failure
to absorbs the videos-their thin listening, the fact that they forgot what the videos contained
aside from how much pre-processing they required-respected the boundaries of the research
subject's privacy, affirming the gravity and the intimacy of the things the research subjects
shared with Abby.
I'M DOING ALRIGHT...I GUESS
When I had watched the videos in Klaus's office and then later on my own, the progression of
the interaction between research subjects and Abby always impressed me. Almost without fail,
subjects would go from being reserved, stiff and unsure, to relaxed, their responses growing in
detail and length, becoming more reflective, more involved in retelling their own stories. What
were the mechanics of this trick? How does Abby transform from being more or less a
standardized, pen-and-paper psychiatric inventory to a human-like interlocutor with whom
strangers are willing to share their most private stories, like the memories they wish they could
erase from their minds?
In this section, I attempt an answer to this question, focusing this time on the design and
development of Abby, the interface. I parse out what the VHI discloses about culturally specific
conceptualizations of empathy, which I contend are wrapped up in ideologies about the ability of
speech to convey the contents of a speaker's self. Thus, I stitch together the language ideologies
in operation in the VHI system with models of the self as a container for private and otherwise
secret, concealed information that exists in an indivisible, unique "core" at the center of every
224
person (Rosaldo 1984; Lutz and White 1986; Lutz and Abu-Lughod 1990). In particular, I zoom
in on the gap between the listening that Abby performs with her non-verbal responses to a
speaker's utterances and bodily expressions, and the limited and reductive "machine listening"
that VirtuSense is capable of.
As noted, the VHI is arrested in its development stage. Part of the reason why the system
cannot be deployed in clinical contexts, outside of a controlled research study conducted at the
Institute or off-site under the supervision of Institute researchers, is because of the system's
limitations when it comes to analyzing semantic content. The VHI is designed for the reception
of speech, and so Abby, the user interface, does not say much. Researchers constantly reminded
me and would recite during tours of the Institute held regularly for the general public that Abby
is strictly a "listening agent," emphasizing the system's receptive passivity. VirtuSense has poor
natural language processing abilities. The only verbal responses that the decision tree in Abby's
programming code allows for, aside from the questions in the assessment scales, are open-ended
follow ups ("can you say more about that?") all to get the user to speak more and speak
continuously. The system's passivity brings to mind ELIZA, the chat-bot creation of MIT
computer scientist Joseph Weizenbaum designed to answer interlocutor's questions in the form
of a Rogerian psychotherapist employing the "echoing" technique by simply reiterating,
verbatim, the text that the interlocutor had typed in the form of a question. Despite the supposed
passivity of this echoing, as was the case with Abby, users felt great catharsis in chatting with
ELIZA, and described their interactions with system to be therapeutically efficacious (see
Wilson 2010)
As Valerie explained, Abby's main job is to "evoke emotion" and encourage users to
"open up" so that the user produces as much data as possible for the software to analyze. A
225
handout given to research subjects scaffolds the interpretation that Abby understands and is
thoughtfully attentive to the narrative content of their inner selves: Abby meets your smile with a
smile of her own because "the software tries to take your feelings into account." Abby provides
scaffolding as well: "I'm here to learn about people," she says, "and I'd love to learn about you."
In other words, Abby performs one mode of listening - built to look and sound as if she is
attentively listening to the semantic content of a speaker's verbal utterances, all in order to
enable VirtuSense's mode of listening which is agnostic to the semantic substance of your
speech, a mode of listening for sound features that Klaus and his students identified to be salient
markers of psychic distress.
I was only able to understand the full extent to which VirtuSense is incapable of
attending to semantic content by witnessing the system from the inside out, in the course of a
failed experiment involving a humanoid robot that I worked on alongside Hillary and Zach. The
experiment had been Valerie's idea-she wanted to make use of a robot that a Japanese
researcher had lent us while he was visiting to present at a public symposium Allan had
organized on human-robot interactions. The experiment seemed doomed from the start. Hillary
and I frequently had to call Klaus down to the WCU campus to help us figure out what was
going wrong with VirtuSense. What's more, Hillary had predicted the experiment's inevitable
failure even before the robot had arrived at the Institute. She knew that VirtuSense was
incompatible with Windows 10, but Allan and his project manager insisted on only securing a
computer for the study that ran on Windows 10. Hillary, Zach, and I nevertheless went through
the motions of putting it together.
According to Valerie, the purpose of the experiment was to determine whether the VHI
interface indeed had an impact on people's willingness to trust and disclose personal information
226
in the course of the system's assessment interview. She wanted to compare people's interactions
with a virtual character (Abby) to an embodied, real life character (the android), exploring if
interacting face-to-face with an embodied, human-like form as opposed to interacting with a
screen would impact people's feelings of trust and rapport. To achieve this, we would have to
figure out a way to run VirtuSense through the android, so that the audio-visual data captured by
the webcam and microphone could be analyzed.
Both Abby and the android were designed to have similarly gendered bodies. Many
researchers told me that they are both supposed to look like women because the team wanted
subjects to experience the user interface as non-intrusive and non-aggressive. Nevertheless, the
experiment with the android had a significant variable that Hillary and I tried to avoid bringing
up: the android was designed to resemble a Japanese woman, and Abby was not. In an effort to
direct subjects' attention away from this discrepancy, Allan and Valerie had suggested that we
fix up the android to resemble Abby as much as possible. Allan assigned Hillary the task of
procuring an outfit for the robot that would match Abby's. He asked Hillary and I to fix and re-
fix the android's hair to resemble Abby's as well. Shopping for the android and styling its hair
took time. This was yet another instantiation of lower level research personnel conducting
gendered, domestic labor-Hillary and I had an administrative position to both the psychology
and engineering teams. In this instance, the labor took a very blatant form of social reproduction:
ourjob was to recreate gendered presentations of hair and dress on a piece of machinery, all in
pursuit of making the robot look convincingly like a virtual woman (Abby), who the Art team
had designed to look convincingly like human woman. It was our labor that animated the robot's
gender, and reaffirmed Abby's gendering.48
" For a discussion of the historical resonances of human-like robots in Japan, and the use of robots to reify and
reproduce notions of gender, kinship, and the family, see Robertson (2017).
227
The subject pool would consist of undergraduates from a WCU Introduction to
Psychology course. Some would be interviewed by the robot, and some would be interviewed by
the virtual human, with the same questions asked every time. Hillary, Zach and I joked that we
knew the outcome of the study before it got started. We didn't need the study, and the thousands
of dollars it took to assemble it, to prove that the android (with its oversized hands and
yellowing, corpse-like skin) was terribly creepy.
Hillary styles the robot's hair for a public showcase of the study, while other researchers attend to the
computers in the background.
We set up shop in a cloistered set of offices, set along the perimeter of a cavernous, high
ceilinged reading room in WCU's gothic style library. In one office, we propped two camcorders
on tripods, one camera trained on the robot and the other trained on a seat in front of the robot
where research subjects would set. Zach and Hillary carefully taped a Microsoft Kinnect to the
wall over the robot's shoulder. In the other room, we set up two computer monitors to view the
228
video feeds, and a third for operating VirtuSense, which was housed in a USB thumb drive that
Klaus and Hillary called the 10K dongle (the amount of funding set aside for developing
VirtuSense alone).
The offices were dusty and smelled of mold, and the bundle of wires that connected the
android, the android's computer, its speakers, and the camcorders, all of which ran into the other
smaller office, prevented us from completely closing the door that separated the two. A large
compressor powered the android, and when it was turned on, it kicked up hot dusty air and
produced a sound that made it impossible to think and that the half-closed door only amplified.
We had to shout to be heard over the compressor and were constantly worried it would set
something in the old offices on fire. Two weeks in, people occupying the surrounding office
were banging on the office door to complain about the compressor's sound on a regular basis.
The experiment start date was deferred nearly a month before the project was abandoned
altogether because we could not get VirtuSense to operate through the android's body, just as
Hillary had predicted. But its failure opened up for me an otherwise unavailable view of
VirtuSense, exposing its inner workings. For instance, when we were trying to determine if the
VHI would recognize a participant's speech, one of us answered the robot asking the VHI
assessment interview questions, while the others remained in the smaller office with the
computer, watching the Dialogue Manager, which illustrated the words that VirtuSense's natural
language processor picked up. When Hillary responded, to the question "What are some things
you really like about living here?" The Dialogue Manager showed us that the NLP interpreted
her response as "an to in the uh or go to" (which is not the answer she had given).
VirtuSense was not only incapable of picking up the semantic content of spoken
utterances-it was also agnostic to even the source of speech, and given the right conditions,
A
229
would recognize the speech its own system produced as if it was the speech of a human research
subject. This was not too far from the truth-after all, the system used a human voice (the voice
of Nava). It was designed to treat any form of human speech, regardless of the source, as data.
We discovered this after a series of frustrating and frightening days, during which the robot
would begin repeating the VHI standard assessment questions, pause mid-word, and then
apologize for interrupting ("oh, I'm sorry, please go on") even though Hillary, Zach, and I had
not said anything. It took a frantic phone call to Klaus to figure out the cause of this demonic
display: we had turned the speakers up to high, and VirtuSense was capturing and processing its
own speech as if it was the speech of a human interlocutor. The system was essentially
interrupting itself.
SAre you Okay With Ot?
Sso, how are you doing today?
IM DOING OK AY
ther'e good 4
Iwhereareyouf fo 1iginafy
MtWht are some things you really Mke about Ni1gha lre?
0 sTerryup n
TO14THE H R (S' T
u What are some things you really ike about living here?
isorry, plaee00 o
lr IN1  THE PASI
a HE UMA
tWhet are some things you really ike about "g here?
VaIT sH E sorry,pienAse connyseA1IN THE ARM
vJUST THAT THEIR LIVES
what are some things you really Vike about BWbig here?
Displayofthe~~~ialogue~~anager veJrUSuT THzATn TH"tM hLJUeS yt«GmAM 0s.51L4093~5 Shtsaewmentnfg teIIitrutnisl~~.
Display of the Dialogue Manager (visualizing the systemn's NLP) showing the VH41 interrupting itself, i.e.,
VirtuSense misrecognizing Abby's speech as the speech of a user.
Witnessing this ghastly malfunction illustrated the breadth of the gap between the mode
of interpreting language that Abby's body performs, and the mode of analyzing language that
230
VirtuSense is designed to execute. Abby's interactive, mirroring body language-the affirming
nods that guide a speaker on, the probing follow-ups, the smiles that "take your feelings into
account"-altogether performs a mode of interpreting speech that circulates in U.S. mental
health care, linked to what E. Summerson Carr (2010) calls "the ideology of inner reference." In
this ideological framework, speech's primary function is referential, and is directly tied to and
therefore expresses a speaker's authentic and otherwise interior self. In this way, listening to
speech provides a pathway to the intersubjective knowing of another's self and is central to
Euro-American conceptualizations of empathy. The ideology of inner reference entails a
listening ideology and a listening ethics-a way that speech should be interpretively and
sensorially attended to that matches with what speech is doing, and how it works. In turn,
VirtuSense effaces the ideology of inner reference, listening not to you butfor your speech's
sounds. In turn, Abby's animation reinforces or rather exploits the ideology of inner reference
and its adherents, with an embodied performance of empathic listening, anticipating that the
vulnerable research subjects will recognize and participate in it, encouraging them to produce
"illness narratives" that are meaningful to them but are meaningful to VirtuSense in a radically
different way.
The ideology of inner reference has vaguely psychoanalytic undertones. It implies that
the self is otherwise interior and hidden form the world in which the speaker inhabits. This self
also has a depth to it and is cushioned by layers that wrap around and shield its more private,
indivisible core. Valerie and Allan on the psychology team expressly designed Abby according
to a psychological theory that reinforces this model. Otherwise known as the onion theory of
interpersonal communication, psychologists Altman and Dalmas (1973) developed Social
Penetration Theory in order to describe the formation of intimate relationships. When Hillary and
231
I were asked to salvage the robot study, Valerie gave us an assignment: produce a series of
staged videos, shot to seem as if the interview and the robot were interviewing me. In one video,
my character's responses should be "mundane," and in the other video, they should be
"intimate." We were to make four videos in total: one intimate, one mundane, with the robot
interviewer, and one intimate, one mundane, with virtual human interviewer. Finally, she asked
us to send the videos out to Amazon's crowdsourcing "microwork" platform, Mechanical Turk
(AMT), for AMT workers to view and rate.4 9
Hillary and I wrote a first draft of the script modeled after the research subject population
but as we worked, we realized we did not fully understand what Valerie was seeking. We told
the story of a woman (my character) who had grown up in and out of foster care and as a result
did not have many friends as a child. She was now estranged from her family, after a rough patch
of substance abuse in late adolescence. For the "mundane" version of the script, the answers my
character gave were short and terse. For instance, for the question, "how are you doing today?"
the mundane character responded, "I'm doing alright." In the "intimate" version of the script, the
character expanded on her terse responses, revealing more about how the questions made her
feel. For instance, she would respond, "I'm doing alright.. .I guess," pausing heavily and casting
her eyes down.
Valerie rejected the draft, asking us to "make the intimate one more intimate." Hillary
asked her to be clearer. What, precisely, did she mean by "intimate"a nd "mundane"? Valerie's
" Employers - anyone from university researchers, like my informants, to tech start-up - send out tasks to be
completed through AMT, and AMT workers or "turkers" execute these typically menial tasks (such as click through
hundreds or thousands of images and identifying images that contain cheetahs versus domestic cats) for below
minimum wage. Irani (2013) has observed that AMT platform-the website through which employers request
jobs-keeps turkers hidden from employers and from the people who benefit from their labor, such as internet users
making google image searchers of cheetahs. Through this "redistribution of tedium" (Irani 2013:729) AMT's
infrastructure helps sustain the illusion that "innovation economy" of Silicon Valley runs on creativity rather than
drudgery.
232
response was prompt: she included in her email the "intimacy measures" they had used when
designing the questions to be asked during the VHI assessment, and the measures were based on
SPT.
"I'm doing alright...I guess": video still of the ethnographer performing the "intimate" script in a mock-assessment
interview with Android Abby.
According to SPT, he outer layers of the self are superficial and not terribly important or
unique to a person. The closer you get to the core, the more private, unique, and individual the
layers become. In interpersonal relationships, people build intimacy and trust by transmitting the
contents of increasingly deeper layers through speech, getting closer and closer to the self.
Between two humans, each of whom possesses individual selves, this exchange is mutual. But
when it comes to interactions with non-human entities-virtual humans like Abby-the goal is
to encourage disclosure of the contents of these deep layers in the absence of reciprocated
disclosure.
233
If the goal of interactions with the virtual human is disclosure, then how does the team
engineer the desire to disclose in a situation in which a non-human agent is only "here to learn
about people" and would "love to learn about you" rather than share anything about themselves?
The psychology team pursued rapport through the design of an agent that users would find
"familiar": both in terms of the interaction the agent engages the user in, and in terms of the
agent's embodiment. In the following section, I describe how Institute team members utilize race
and class as flexible resources for achieving rapport.
PAY NO ATTENTION TO THE WOMEN BEHIND THE CURTAIN
Researchers' ideas about what might make Abby familiar to research subjects articulate broader
expectations regarding what kind of human listens to you in the thoughtful, empathic way that
Abby is supposed to imitate. Here, I expand on Carr's argument about the relationship between
the ideology of inner reference and the interpretive practices of U.S. mental health care by
underlining that language ideologies are not only ideological-they are embodied and enfleshed.
They not only structure and are structured through expectations for how speech is listened to.
They are also wrapped up in and reproduce expectations for which kinds ofpeople listen in that
way, especially in terms of gender, race, and class.
When I asked members of the psychology team why they made Abby look like a woman,
everyone agreed: they wanted to ensure subjects felt like a non-aggressive and understanding
agent was listening to them. There was less agreement when it came to Abby's race. Some
insisted that Abby was racioethnically ambiguous by design, because this allows research
subjects to project their own identity onto her and identify with her as a result. They would cite
234
Abby's voice as evidence for their claim. The psychology team had unanimously selected Nava
to be the voice of Abby. I asked her why they picked her, and she guessed that it was because she
had "no accent." Although Nava is American-born Iranian, she told me, "The team thought I
sounded like I could be from anywhere." Once disarticulated from her body, the team imagined
that Nava's voice could shed its specificity and became a resource for transforming Abby into a
racioethnically blank, projective screen. Note, however, that the "unmarked," accentless, and
disembodied voice is not necessarily a neutral voice-it is the white voice. This aligns with what
Reed and Philips observe in their 2013 article on realism in performance capture technologies:
whiteness tends to operate for the team who developed the Abby interface as "transparent
universality." Together, Abby's body and ad Nava's voice formed a mirror through which
research subjects could see, hear, and recognize something about their selves. Nava told me that
this seemed to work with research subjects. Oftentimes, during debriefing period, a number of
subjects of varying races and ethnicities thanked Nava and Taylor for giving them a chance to
talk with a doctor that actually looked like them, meaning, a doctor that shared their racioethnic
identity.
Yet while some researchers argued that Abby could be anyone, others argued that they
designed Abby to have an embodied specificity-to be both racioethnically and professionally
marked. Specifically, they told me that Abby was fashioned after Googled images of "Latina
social worker." If you examine Abby's programming, you find that someone gave her a Latina
surname. The morphing together of these Googled women-the chain of assumptions and
associations, the linkage of skins and screens, race and visuality, that they form-brings to mind
Haraway's analysis of a figure she calls SimEve. SimEve is the name Haraway uses for the
image on the cover of Time magazine's special fall 1993 issue on immigration, which shows the
235
smiling face of a racioethnically ambiguous woman meant to represent the "New Face of
America," the impacts of multi-racioethnic marriage (259). But Haraway asks, what does this
sterile, computer mediated coupling that produced SimEve dry up and hide away, in terms of
colonial histories of violence and resistance? Likewise, we may ask of Abby, what does her
automated "relating" and resembling cover up? What are the implications of using Googled
images to design Abby after a Latina social worker, especially given that Google has been
shown, through its search protocols, to sediments racist and misogynistic associations (Noble
2018) rather than producing neutral, value-free pairings between words and images?
Nevertheless, the researchers who told me the story of Abby's techno-pastiche origins
wouldn't cite Abby's skin color or the Latinadad inscribed in her metaphorical DNA as evidence
for her being a Latina social worker. Instead, they would cite the interactional framework of the
interview itself, while also referencing the socioeconomic status of research subjects and the
VHI's target population. Being interviewed by Abby was supposed to feel like being interviewed
by a Latina social worker, because the local VA was in a predominantly Latinx neighborhood,
and assessment with a social worker in a public health setting like the VA (rather than diagnosis
with a physician or treatment with a therapist in private practice) was probably the only kind of
mental health care resources that the research subjects had access too. If the gendering of Abby
signals expectations about women as good, passive, listeners, then the racing of Abby signals
expectations about what kind of woman is most likely to fill this passive listening role in
administrative mental health contexts, along with expectations about the socioeconomic status of
the people who interface with care workers like Abby.
There is a hierarchy of value that maps on to the distinction between therapist and social
worker and the sociocultural capital that separates the two jobs (time set aside for extensive
236
schooling, money for frequent and costly licensing and credentialing, etc.) This distinction is
evident in the different degrees of medical judgment that social workers vs. therapists are
licensed to make. The premise of the VHI as a tool and Abby as not a therapist re-inscribes these
hierarchies of clinical labor, which value the work of diagnosis and treatment as "real" medical
practices, while instrumentalizing (and dehumanizing) the work of assessment. Making the
Virtual Human Interviewer familiar and "real enough" means making sure that interactions with
it are not-quite-professional.I t means rendering the listeners in the system invisible-rendering
the traces of the Latina social workers, and Nava and her accentless voice, virtually human.
Abby's designers call upon race, gender, and ethnicity as a form of flexible capital
(Nakamura 2014: 933): malleable, pliant, capable of shifting depending on the rhetorical needs.
Through the body and the interactional habitus of Abby, the association between race, gender,
and a form of passive and administrative professional listening are reinforced and reiterated.
Abby's animation both depends on sewing together of these traits, while also naturalizing and
reproducing them as coupled together. The notion that Abby's listening habitus-the system's
active listening body language-can be automated bears further analysis. Abby's racioethnic and
gender presentation, along with the signs that the system's animation expresses means to indicate
that the system is attentive to the semantic content and narrative contours of a speaker's answers
to the assessment questions: together these comprise the system's rapport-building capacities.
These material-semiotic flourishes-an understanding head-nod, a familiar-looking, passive,
administrative listener-give the feeling of intimacy, closeness, and proximity, a sense that the
interface is tuned in to the inner most regions of the speaker's self. Together, they give the
impression that the expression of empathy is an automatic, ingrained response-part of the fabric
237
of being human but also, at the same time, a reflex that can be formally reproduced in a non-
human machine, that requires no expertise, and that is not truly work.
REACHING OUT
Before the untimely beginning-and eventual end-of the android study, I worked alongside
Hillary at a public exhibition on human-robot interactions, helping to showcase the android and
recruit potential research subjects for the study. While the demonstration Klaus put on for me in
his office at the start of my fieldwork was to show me the limits of my own human listening, for
this demonstration, Hillary and I had once again to downplay the role that human mediation
would play in the robot study, and that it had played in the development of the VHI's various
components.
Hillary and I were responsible for running the exhibition and explaining the study to
anyone who passed by our table. There was a rush of people in the exhibition hall and Hillary
and I spent roughly three hours fielding their questions and comments. While other exhibitors sat
behind their table with their prototypes and technologies displayed on the surface of the table,
Hillary and I sat the android down in a single chair, placing it in the position that a human
exhibitor would occupy, while placing the other chair in front of the table, inviting people to sit
in front of the android and gaze upon it. We displayed a promotional video of the virtual human
on a projector screen above our table, so that we could explain the relationship between the
android study and the VHI. People would sit in the chair and wave their hands in front of the
robot's eyes, asking us, "can she see me? Can she hear me?" Or they would ignore Hillary and I
altogether and respond to the VHI interview questions that played from a speaker sitting on a
238
windowsill behind the android's shoulder. No, we would explain, the android is not receptive to
audio or visual data at the moment-there was no video camera or microphone set up to capture
this data, and even if there were, we had not yet enabled VirtuSense so the data would not be
processed.
One the most difficult and frequent comments came from people who found the whole
premise of the VHI alarming. These people accused Hillary and I of trying to build a "robot
therapist," or trying to "replace humans." Hillary dealt with this kind of accusation whenever
giving public demos of the VHI. She would explain that the system was still in development and
not ready for actual clinical use and would emphasize that the point was to conduct assessment.
The VHI couldn't provide therapy, she would say, let alone replace a human therapist, and it
couldn't even make diagnosis-only a licensed, trained, professional human could diagnose
another human.
Keep in mind, as well, that Abby is not a therapist because some members of the research
team had designed her specifically to look (and interact) like a Latina social worker. There is a
hierarchy of value that maps on to the distinction between therapist and social worker and the
sociocultural capital that separates the two jobs, and while the caring and services professions
within U.S. health care are gendered, they are also, as Evelyn Nakano-Glenn (1992) points out,
racially stratified. The premise of the VHI as a tool and Abby as not a therapist re-inscribes these
hierarchies of clinical labor, which value the work of diagnosis and treatment as "real" medical
practices, while instrumentalizing (and dehumanizing) the work of assessment, placing it at the
margins of biomedicine and figuring it as unskillful. It's telling that people's anxieties and
disgust upon witnessing the VHI revolved around the automation of therapy, and that Hillary's
239
alibi-they were only trying to automate assessment-seemed to put people at ease. Hillary's
alibi resembles a description of the VHI that Taylor once gave me:
When you go to the doctor you're going to see a nurse first, she's going to draw your
blood and get all your baselines. So [the VHI assessment] is your objective measures. It's
getting your tone of voice, your measurements [...] and it's giving a numerical output,
which would then tell a doctor [...] they're showing all different signs...[that] may
indicate that they're [...] maybe showing signs of PTSD or depression, and then...an
actual human can make a diagnosis.
According to Taylor, Abby is like a nurse because the system only provides initial indicators of
how a patient's doing before an "actual human" doctor steps in with a truly medical call. But
Taylor's analogy does not quite fit. It suggests that the nurse's embodied presence, holding the
needle, is unnecessary. To make her comparison work, you have to treat the needle, the nurse,
and the analysis of the blood as one, ignoring all that goes into finding a vein, inserting the
needle, and making sure the patient stays still.
Taylor's comparison is also a humble one, because it downplays the difficulty of her and
Nava's own work. Interacting with Abby was not exactly like getting blood drawn, because
drawing up and listening to the content of research subjects' personal stories is a charged process
that can be traumatic for listeners and re-traumatizing for speakers, especially if the system
malfunctioned, like the time Abby responded "that's great!" after a research subject described
the passing of his wife. Taylor and Nava were actually present in many of the videos I watched
with Klaus, and in promotional videos of the Virtual Human Interviewer system. In fact, in all of
the system's promotional videos, Taylor and Nava were controlling Abby. They did this for the
second, WoZ phase of the technology's development, during which Taylor and Nava
puppeteered Abby's actions from another room, producing the interactional data that was later
coded and used to build the framework for the fully automated version. And in all videos, WoZ
240
or not, Taylor and Nava had monitored the interactions from another room, through VirtuSense's
cameras and microphones, listening alongside the system in a way that honored the ideology of
inner reference. Neither of them had clinical experience, but it was their job to keep an ear on the
interactions in case a subject disclosed suicidal or homicidal intentions (which VirtuSense's poor
natural language processing could not pick up).
This remote listening took a heavy emotional toll on both young women, precisely so
because they had to attend so closely to the words of the research subjects' speech, and because
they could not let the subjects know that they had been listening. Doing so would break the
illusion of Abby's total non-humanness and therefore disrupt the experiment - their labor had to
remain invisible. As Winnie Poster (2019) describes, in the context of the outsourced labor of
call centers, operators interact with customers through increasingly computerized interfaces
meant to hide the location and racioethnic identity of the operator-for instance, through a
variety of pre-recorded audio samples that the operator plays in response to the caller's questions
or qualms. The operators engage in what Poster calls "cyborg identity management": they
perform their humanness to ensure the caller that they are talking to a human, that the interaction
is not automated and anonymous but personal. Nava and Taylor perform their own kind of
cyborg identity management-"covering up how much of the technology they are using to
mediate the conversation" (Poster 2019: 259)-but in the opposite direction. The goal is to
conceal the human mediation in the interaction through techniques that downplay their proximity
to the research subject, a difficult feat considering the nature of the conversations between Abby
and the research subjects that Nava and Taylor witnessed. As Taylor described to me in an
interview,
being removed from the situation and listening to someone talk about things that are very
difficult, for them [...] you could feel it in the room or [Nava and I] would out loud we
241
would sigh or we would say oh my god [...]and the whole point of Abby is to be a
listener, she's not supposed to be a therapist, she's not supposed to be there to give tons
of feedback [...]But, on the flip side of that as a human it is difficult to hear someone
kind of go through something and you almost want to reach out and give someone a
tissue or reach out and hold their hand, but that's not really the intent behind any of that
anyways, so. I had to remind myself of that.
Part of their cyborg identity management required them to distance themselves from the research
subject, to avoid the urge to discuss the conversation they had either witnessed or taken part in
while the research subject spoke with Abby. Even a gesture as simple and as powerful as
handing a subject a tissue would give away the young women's position as interactional
mediator. During the debriefing period, Nava and Taylor had to figure out strategies for, as
Taylor put it, "reaching out" to repair any psychological damages Abby may have done, all
without revealing that they had observed the entire interaction, this would have broken the
experimental paradigm of Abby's total non-humanness. In debriefing, they would open up a
space for the subjects to share their pain independently, asking them open ended questions about
how the conversation went. On several occasions, they sit and spoke with the subjects for hours
after the interview had come to a close.
Nevertheless, while the two young women conceded that interactions with the VHI could
be painful or even harmful, they also told me positive stories about working with the VHI. They
said that many subjects found the whole interaction therapeutic. That is, even though Abby was
not a therapist and the VHI could not conduct therapy, Nava, Taylor, other researchers, and
subjects' themselves reported that there was something beneficial, cathartic perhaps, about "just
being listened to": being listened to by Abby, and then by Nava and Taylor. Moreover, it is not
as if all research subjects acted and interacted genuinely with Abby. At times, they flipped the
script of the conversation through recalcitrance and refusal, not unlike the "non-compliant"
research subjects at ECU who refused to conduct the tasks in the fMRI machine according to the
242
Santiago's directions.
Other research subjects seemed to take pleasure in Abby's artificiality. One man
repeatedly asked her out on a date, not because he thought that Abby was a real woman, but
because he seemed to get a kick out of pretending that she was real-that she could exist off the
screen and enjoy a ride in the man's convertible with him. As Wilf (2019) describes, sometimes,
when interacting with robots and other machinic, human-like forms, humans take pleasure in
their awareness that the machines are not actually humans, that what they are witnessing is a
strategically crafted, mediated performance. Diving into and widening the gap between the
virtual and the actual can be a form of play. In this regard, the interface can offer the momentary
suspension of reality, and respite from a world that has forgotten and even discarded people like
the study's research subjects-disabled, unemployed, homeless, veteran. The interface, the
screen, the animated virtual human, is not only a calibration of habitus and machine, automation
and affect. It is also a powerful, capacious space for fantasy and projection, a realm not only of
illusion and misdirection, but of possibility (Helmreich 1998).
CONCLUSION: VIRTUALITY'S ECHOES
The virtual human's animation and the software (which typically remains concealed from users
and others who interact with the system) would seem to suggest that the VHI conducts two
modes of listening, which map on to the hierarchy of clinical judgment that separates diagnosis
and treatment from screening. The VirtuSense mode of listening to the sounds of speech beyond
semantic meaning coincides with diagnosis, a technical skill, a form of medical judgment that
requires more training and credentialing to be able to conduct. The virtual human mode of
243
listening-listening to and silently, non-verbally responding to the narrative contours and
affective texture of speech-coincides with assessment, an automatic, reflexive practice (or so
the story goes) that requires less training to be able to conduct. Nevertheless, both kinds of
listening are necessary to the whole enterprise, the whole process of what the VHI is supposed to
be doing: connecting speech sounds with interior states. VirtuSense may be able, in theory, to pin
down sounds that circulate beyond the reach of the human sensorium. But in order to have any
material for analysis, the system requires data. Abby, the interface, can also, in theory, do
something that humans cannot: tirelessly listen to stories that are tangled together with emotion,
without requiring any breaks or time to recover. But in practice, the line between the technical
and mechanical, or treatment and assessment, is blurred.
The presence of Nava and Taylor-hidden in the room and embedded in the body and
voice of Abby-also complicates this neat divide. In troubling the binaries, they also show us
something about psychiatric screening: it is both humanistic and technical, requiring honed skills
as well as the capacity to be emotionally present and compassionate, which is itself a skill.
Reading the cracks and troubles in the system tells us something about the nature of language
and interaction in psychiatric encounters: while the expression of empathy is an important skill,
it is not necessarily an authentic expression, nor is it affectively motivated alone. The
interlocking ideologies of language (the ideology of inner reference) and self (Social Penetration
Theory) form the basis of empathy, intimacy and rapport in interactions with the VHI, and
watching this coordinated activity work (and fail) gives us to get an empirical grip on the
otherwise phenomenological realm of intersubjectivity. Studying the VHI in (inter)action
illuminates the crucial role that linguistic practices play in affective states that otherwise seem
immaterial and ephemeral.
244
The pressing together of race and gender to form the voice and the body of the virtual
human is supposed to render its animated, passive, affectively invested listening all the more
convincing. Here, I am not trying to say that the VHI's creators and authors, like Klaus and
Valerie, and its various distributed participants, like Taylor and Nava, set out to create a racist
and sexist technology, or are personally responsible for the ways in which their technology
reiterates the bundling together of qualities with types. The figure of the non-human, human-like
machine as a feminized, raced servant that passively supports its users (or else threatens to
overthrow its masters) hails from the much larger, Euro-American legacies of computing and
colonialism (Suchman 2007; Philip, Irani, and Dourish 2012). In Irani's (2015: 733) words,
"hierarchies of value have long overlapped within hierarchies of gender in the historical
imagination" surrounding "artificial life," which is notoriously gendered, with male artificial life
as monstrous (like the Golem and Frankenstein's creation, who often seek to kill their
fathers/creators), while female artificial life forms appear as lovers or mothers (or witchesO to the
(usually) mean who create them (Helmreich 1998). Think also of Pygmalion's Galatea, inert
matter built in the image of her creator's desires. Think also Blade Runner's Rachel Rosen, who,
in the film, is an innocent adolescent unaware of her status as an android but, in the original
source text, is a cunning and manipulative seducer, who exploits the unheimlich empathy that
Decker the bounty hunter extends towards androids in a way that causes him to question his own
fundamental assumptions about the humanness of empathy.
Think also of ELIZA, the chatbot therapist who reproduces a pantomimed version of
Rogerian "echoing," a psychotherapeutic technique Rogers developed. In Rogerian
psychotherapy, the therapist attempts to render their own selves as transparent as possible-they
are to be a mirror reflecting back the client's thoughts and problems in a different context so that
245
the client can process them. While echoing is a complex technique achieved through verbal
strategies of removing the self when reiterating the client's talk (Carr and Smith 2013). Joseph
Weizenbaum parodied this technique with the ELIZA program, which repeats almost word-for-
word what the interactional partner has typed. Scholars and computer scientists have critiqued
people's enjoyment of the chatbot, citing something pathological about feeling soothed by a non-
human entity inertly performing passive mimesis (Turkle 2007). But these readings do not
account for the complicated nature of repetition and resemblance, including the unstable
correspondence between original and imitation, like between Nava's voice, Abby's code and the
subjects who project their identity on to her, and the Googled Latina social workers. As Inoue
writes, "once the supposedly inert subject of the verbatim copy is recovered, a little universe of
determinate and far from unmotivated subject positions, contextual framings, and mechanical
and technical effects come into view" (2018: 218). Instead, with ELIZA as with Abby, following
Spivak (1993) writing on another kind of Echo", I suggest that we "seize on the glimpse of
difference" between the copy and the original, Abby and the various people refracted and
captured through her (Inoue 2018: 218).
Virtuality exists somewhere in this space between the copy and the original. It is not quite
a faithful rendering, a direct miming or a mirror. The "virtual" has connotations of almost, but
not quite, an "as if' that is never final yet never fully independent of the thing it approaches
(Boellstorff 2015). Nava and Taylor listen to the research subject's speech as if they are Abby
50 Inoue invokes Spivak's (1993) critique of Freud's On Narcissism, the tale of Narcissus and Echo from Ovid's
Metamorphosis to discuss the disjuncture between copy and original as a space for agency and subversion. Spivak
argues that Freud's interpretation "ignores the structure of gender in the relationship between Narcissus and Echo
while also exploring "her own ethics of speaking for subaltern women" who are "figured as Echo...women who do
not speak and only respond to, and thus repeat forces that structure them" (Inoue 2018: 223). Spivak's intervention
is to explore lapses in the correspondence between original utterance and its repetition, and how these spaces of
difference "afford an intricate ethical position that prevents subaltern agency from being a knowable subjectivity"
(ibid).
246
they also listen as if they are mental health care workers, despite their lack of training. And Abby
listens as if she is a Latina social worker, who also listens almost as if they she is a therapist,
although she is not.
Virtuality has other connotations as well: connotations of virtue, which I have attempted
to encapsulate through the pseudonymous moniker, VirtuSense. Here, it is productive to think
alongside Lisa Nakamura's (2019) invitation in scrutinize the "virtue" of virtual reality (VR)
documentary-style films that are supposed to invoke feelings of empathy for the people and
places the experience allows viewers to feel close to. In her talk, "Virtual Reality and the Feeling
of Virtue: Women of Color Narrators, Enforced Hospitality, and the Leveraging of Empathy,"
Nakamura explores the use of women of color narrators in VR films that promise to put audience
members immediately and directly "in the shoes" of someone in a refugee camp, of a migrant
laborer, of a person walking down the street experiencing racist micro-aggressions, and so on.
VR is meant to capture and mimic the perspective of someone occupying this oppressed,
subaltern subject position, "immersing" the viewer in an otherwise distant experience (with the
assumption that the viewer does not occupy any of the identities depicting in the film as radically
alien, out of reach, and other). The illusion of proximity depends in part upon narrators or
"guides" in the film-primarily women of color-who explain the scenes, provide scaffolding,
and treat the audience member as a friend or confidant.
According to Nakamura, the immersive aspect of the film-the fact that the technology
ports the viewer to an otherly world, "virtually," "as if' they are there-"enables a fantasy of
virtuous empathy" (2019). The feeling of proximity, closeness, the "almost but not quite" effect
of virtuality is a decoy, a stand-in, for structural change or political action. It gives the
impression that empathy, as a feeling, is a proxy forjustice, and that racism, sexism, and other
247
forms of violence, are (like empathy) feelings rather than structures. If an ethnographic study of
the Virtual Human Interviewer has shown us anything, it is that proximity and closeness, like
"immersion," is the outcome of hyper-mediated practices (Helmreich 2007) rather than
inevitable, automatic pretext of conversations about psychic suffering. Like the guides in these
VR films, Abby the non-intrusive interviewer guides research subjects on a journey inward. Any
closeness or trust that the subjects feel toward Abby may have been engineered, but the closeness
that Abby and Nava feel toward the subjects was very real.
248
References
Altman, Irwin and Dalmas Taylor. Taylor, D. 1973. Socialpenetration:T he development of
interpersonalr elationships. New York: Holt.
Baum, Frank L. 1900. The Wonderful Wizard of Oz. Chicago: George M. Hill Company.
Benjamin, Ruha. 2019. Race After Technology: Abolitionist Tools for the New Jim Code.
Cambridge, UK: Polity Press.
Bleecker, Julian. 2004. "The Reality Effect of Technoscience." PhD diss. University of
California Santa Cruz.
Briggs, Charles. 1984. "Learning How to Ask: Native Metacommunicative Competence and the
Incompetence of Fieldworkers." Language in Society 13(1): 1-28.
Carr, E. Summerson. 2011. Scripting Addiction: The Politics of Therapeutic Talk and American
Sobriety. Princeton: Princeton University Press.
Carr, E. Summerson and Yvonne Smith. 2013. "The Poetics of Therapeutic Practice:
Motivational Interviewing and the Powers of Pause." Culture, Medicine and Psychiatry3 8:83-
114.
Chun, Wendy. 2000. ProgrammedV isions: Software and Memory. Cambridge, MA: MIT Press.
Coleman, Gabriella. 2014. Hacker, Hoaxer, Whistleblower, Spy: The Many Faces ofAnonymous.
New York: Verso.
Daston, Lorraine. 1994. Enlightenment Calculations. CriticalI nquiry 21(1):182-202.
Desjarlais, Robert and Jason Throop. 2011. "Phenomenological Approaches in Anthropology."
Annual Review ofAnthropology 40:87-102.
Dick, Philip K. 1968. Do Androids Dream ofElectric Sheep? New York: Random House.
Duranti, Alessandro. 2014. The Anthropology ofIntentions: Language in a World of Others.
Cambridge, UK: Cambridge University Press.
Ekbia, Hamid and Bonnie Nardi. 2017. Heteromation, and Other Stories of Computing and
Capitalism. Cambridge, MA: MIT Press.
Eubanks, Virginia. 2018. Automating Inequality: How High-Tech Tools Profile, Police, and
Punish the Poor. New York: St. Martin's Press.
249
Feuerherd, Peter. 2018. "Why Didn't the Rodney King Video Lead to a Conviction?" JSTOR
Daily, February 28. < https://daily.jstor.org/why-rodney-king-video-conviction/> (accessed July
20,2019).
Gershon, Ilana. 2015. "What Do We Talk about When We Talk About Animation." Social
Media + Society.
Goffman, Erving. 1974. Frame analysis: An essay on the organization of experience. New York:
Harper and Row.
Goffman, Erving. 1981. Forms of talk. Oxford: Blackwell.
Goodwin, Charles. 1994. "Professional Vision." American Anthropologist 96(3): 606-633.
Goodwin, Charles and Marjorie Harness Goodwin. 2004. "Participation." In A Companion to
Linguistic Anthropology. Alessandro Duranti, ed. Pp. 222-244. Malden: Basil Blackwell.
Haraway, Donna J. 1997. Modest_ Witness@SecondMillennium.
FemaleMan©_MeetsOncoMouseTMN ew York: Routledge.
Helmreich, Stefan. 2007. "An anthropologist underwater: Immersive soundscapes, submarine
cyborgs, and transductive ethnography." American Ethnologist 34(4): 621-641.
Hicks, Marie. 2017. ProgrammedI nequality: How Britain Discardedi ts Women Technologists
and Lost its Edge in Computing. Cambridge, MA: MIT Press.
Inoue, Miyako. 2018. "Word for Word: Verbatim as Political Technologies." Annual Review of
Anthropology 47:217-32.
Jackson, John Jr. 2013. Thin Description:E thnography and the African Hebrew Israelites of
Jerusalem. Cambridge, MA: Harvard University Press.
Just, Marcel Adam, Vladimir L. Cherkassky, Augusto Buchweitz, Timothy A. Keller, and Tom
M. Mitchell. 2014. "Identifying Autism from Neural Representations of Social Interactions:
Neurocognitive Markers of Autism." PLOS One 9(12): 1-22.
Irani, Lilly. 2015. "The Cultural Work of Microwork." New Media and Society 17(5): 720-739.
Irani, Lily. 2018. " 'Design Thinking': Defending Silicon Valley at the Apex of Global Labor
Hierarchies." Catalyst: Feminism, Theory, Technoscience. 4(1): 1-19.
Kelty, Christopher M. 2008. Two Bits: The Cultural Significance ofFree Software. Durham:
Duke University Press.
Light, Jennifer. 1999. "When computers were women." Technology and Culture 40(2): 455-483.
250
Lutz, Catherine and G.M. White. 1986. "The Anthropology of Emotions." Annual Review of
Anthropology 15: 405-436.
Lutz, Catherine and Lila Abu-Lughod, eds. 1990. Language and the Politics of Emotion.
Cambridge, UK: Cambridge University Press.
Manning, Paul. 2018. "Animating virtual worlds: Emergence and ecological animation in
Ryzom's living world of Atys." FirstM onday 23(6-4). <
https://firstmonday.org/ojs/index.php/fm/article/view/8127/7414> (accessed July 19, 2019).
Nakamura, Lisa. 2014. "Indigenous Circuits: Navajo Women and the Racialization of
Early Electronic Manufacture." American Quarterly 66(4): 919-941.
Nakamura, Lisa. 2019. "Virtual Reality and the Feeling of Virtue: Women of Color Narrators,
Enforced Hospitality, and the Leveraging of Empathy." Lecture, Princeton University Thinking
Cinema Series, Princeton, NJ, March 3.
Nakano Glenn, Evelyn. 1992. "From Servitude to Service Work: Historical Continuities in the
Racial Division of Paid Reproductive Labor." Signs 18(1): 1-43.
Noble, Safiya Umoja. 2018. Algorithms of Oppression: How Search Engines Reinforce Racism.
New York: New York University Press.
Pentland, Alex. 2008. Honest Signals: How They Shape Our World. Cambridge: MIT Press.
Philips, Amanda and Alison Reed. 2013. "Additive race: colorblind discourses of realism in
performance capture technologies." Digital Creativity 24(2): 130-144.
Philip, Kavita, Lilly Irani, and Paul Dourish. 2012. "Postcolonial Computing: A Tactical
Survey." Science, Technology, & Human Values 37(1): 3-29.
Poster, Winifred R. 2019. "Sound Bites, Sentiments, and Accents: Digitizing Communicative
Labor in the Era of Global Outsourcing." In digitalSTS: A Field Guidefor Science & Technology
Studies. Janet Vertesi and David Ribes, eds. Pp. 240-262. Princeton: Princeton University Press.
Rea, Shilo. 2014. "Carnegie Mellon Researchers Discover Brain Representations of Social
Thoughts Accurately Predict Autism Diagnosis." CarnegieM elon University News, December
02.
<https://www.cmu.edu/news/stories/archives/2014/december/december2_thoughtmarkersautism.
html> (accessed July 20, 2019).
Rice, Thomas. 2010. "Learning to Listen: Ascultation and the transmission of auditory
knowledges." The Journalo fthe Royal AnthropologicalI nstitute 16: S41-S61.
Robertson, Jennifer. 2017. Robo sapiens japanicus:R obots, Gender, Family and the Japanese
Nation. Berkeley: University of California Press.
251
Robbins, Joel. 2004. Becoming Sinners: Christianitya nd Torment in a Papua New Guinea
Society. Berkeley: University of California Press.
Rosaldo, Michelle Zimbalist. 1984. "Toward an anthropology of self and feeling." In Culture
Theory: Essays on Mind, Self, and Emotion. R. Shweder and R. LeVine, eds. pp. 137-157.
Cambridge, UK: Cambridge University Press.
Russell Hochschild, Arlie. 2012. The Managed Heart: Commercialization ofHuman Feeling. 3rd
Edition. Compton: University of California Press.
Scott, Ridley. 1982. Blade Runner. Film. Burbank: Warner Brothers.
Silvio, Teri. 2010. "Animation: The New Performance?" Journal ofLinguistic Anthropology
20(2): 422-38.
Spivak, Gayatari Chakravorty. 1993. "Echo (nymphe)." New Literary History 24(1): 17-43.
Stacey, Jackie, and Lucy Suchman. 2012. "Animation and Automation: The Liveliness and
Labours of Bodies and Machines." Body and Society 18(1): 1-46.
Suchman, Lucy. 2007. Human-Machine Reconfigurations: Plans and Situated Actions. 2d
Edition. Cambridge, UK: Cambridge University Press.
Taylor, Astra. 2018. "The Automation Charade." Logic, August 1.
<https://logicmag.io/failure/the-automation-charade/> (accessed July 19, 2019).
Ticona, Julia and Alexandra Mateescu. 2018. "Trusted Strangers: Carework platforms' cultural
entrepreneurship in the on-demand economy." New Media and Society 20(11): 4384-4404.
Throop, Jason and Keith Murphy. 2002. "Bourdieu and phenomenology: A critical assessment."
Anthropological Theory 2(2): 185-207.
Turkle, Sherry. 2007. "Authenticity in the age of digital companions." Interaction Studies 8(3):
501-517.
Vertesi, Janet. 2012. "Seeing like a Rover: Visualization, embodiment, and interaction on the
Mars Exploration Rover Mission." Social Studies ofScience 42(3): 393-41.
Villeneuve, Denis. 2017. Blade Runner 2049. Film. Burbank: Warner Brothers.
Wilf, Eitan. 2019. "Separating Noise from Signal: The ethnomethodological uncanny as aesthetic
pleasure in human-machine interaction in the United States." American Ethnologist 46(2): 202-
213.
252
Wilson, Elizabeth. 2010. Affect and Artificial Intelligence. Seattle: University of Washington
Press.
253
Chapter 4: Listening Like a Computer
"Auditory hallucinations frequently appear only in the night-time, or at least much more then.
They seem, as a rule, not to possess complete sensory directness. They are voices "as in a
dream," "from the underworld," "voices in the air, which come from God," more rarely
gramophone or telephone voices, wireless telegraphy."
- (Emil Kraepelin, Manic-DepressiveI nsanity and Paranoia2 002[1921])
Every year, the Bipolar Research Unit of Midwestern University's (MWU) Depression Center
holds Bright Nights, a community forum event on bipolar disorder and the Unit's current
research projects. Bright Nights serves a multitude of functions. It indeed provides the general
public a chance to hear from and present questions to the Bipolar Research Unit's (BPU) PI, the
head of the clinical staff, the head of the BPU's stem cell research team, and two patient-research
subject panelists. But because the BPU runs almost exclusively on philanthropic donations, it is
also a fundraising event, meant to pull on the heartstrings of audience members, instilling hope
in the groundbreaking potential of BPU's research and encouraging financial support. More than
that, it is a recruitment event, intended to inspire potential subjects to lend their bodies and
voices for the greater good of locating biological markers to help predict, intervene on, and
improve the wellbeing of others who have been diagnosed with bipolar disorder.
This year's Bright Nights was convening in the meeting hall of a country club in a rural
town with a population not unlike the majority of the Bright Nights attendees: white, elderly, and
middle-to upper-class. The country club was opulent bordering on musty, clinging to its grandeur
though the place had long passed its prime. The fabric of the tufted armchairs and sofettes,
tablecloths, and voluminous drapes was old and worn but velvet, in shades of wine and
aubergine. Exaggerated crystal chandeliers dripped from ceilings crested with ornate crown
molding. The gold paint was flaking off of the large, framed mirrors that hung above solid
254
mahogany end tables, upon which sat brass candelabras and wide, Grecian vases containing
artificial flowers: birds of paradise, orchids, lilies, roses. Although, in an ironic twist, the hall
was dimly lit, if the lighting had in fact been brighter, I would have expected to find a film of
dust settled over everything.
The country club was a good forty-minute drive away from the even more rural, bucolic
setting of the BPU headquarters in the Depression Center, which is attached to a public geriatric
hospital and sits on a plot of land set off the highway among flower beds and fields of tall grass.
I had shared'a car with two of my informants, Adele and Cheryl, both white women in their
fifties and members of the clinical team of the BPU. With Adele-the research manager of
BPU's clinical team-at the wheel and the early summer sun setting slow in pink and orange
streaks, the women told me stories about their BPU patients and the passing scenery, pointing
out shuttered factories where relatives and parents used to work, places they had visited as
children and places they now visited with their grandchildren. Their talk of elementary school
fieldtrips and memories of their mothers made me feel at ease and as though I had known them
for years, even though I had only arrived a few weeks prior to begin my ethnographic study of
the BPU's efforts to develop a mobile phone application that can predict the onset of a
pathological episode through the analysis of acoustic features of speech.
As a Bright Nights volunteer, it is my job to stand opposite Adele at the country club
hall's entrance and welcome everyone who enters, handing them a golf pencil and a blank index
card. In what I hope is a warm and cordial tone, I instruct them to write down on the card any
questions they might have during the panelists' talks, because the cards will be collected and
distributed to the PI, who will address the questions to the audience. Some attendees already
have their hands full with plastic glasses of complementary wine or powdered lemonade, and
255
others are plagued with arthritis, so I am asked to tuck the cards and tiny pencils into purses, the
pockets of blazers, or atop plastic plates already filled with cheese and crackers. Others ask for
extra cards, anticipating many questions.
Cheryl, Adele, and other volunteers recognize some of the audience members. Quite a
few are either long-time research subjects or family members of the subjects that they have come
to know over the past years, since the BPU is conducting one of the largest, longest running
longitudinal studies of bipolar disorder in the United States. It is also not uncommon for subjects
and researchers to share a mutual friend from childhood; perhaps they even attended MWU
together. Among the rush of people entering the room, the BPU PI (Primary Investigator)
approaches me with a patient of his who will be speaking on the panel: a lean man with optic
white hair pulled into a ponytail, both ears pierced several times with slim, silver hoops. Dressed
in a navy-blue suit and tie, he is significantly taller than I am, with etched lines in his sunburnt
face, dark eyebrows and a spray of freckles across the bridge of his narrow nose. This is Jacob,
one of the earliest research subjects of the longitudinal study and also enrolled in the cell phone
study.
After a quick introduction, the PI steps away. Jacob immediately shifts to stand very
close to me as if we are engaged in a conspiratorial conversation, so close that I am nervous. He
begins touching-palpitating-the crook of my elbow as he speaks, and since my arms are
crossed in front of my body this puts his hand in the dangerous vicinity of my chest. His voice is
gravelly but fast, electrified and hypnotic, one word carrying over and falling into the next like
overlapping waves in a storm. He wants to know what an anthropologist is doing here studying
the cell phone project, and I try to give him as inconsequential and as positive of an explanation
as possible while still remaining true to at least a partial portion of my research interests. I say
256
that I am interested in the collaboration between clinical psychiatric people and computer science
people, and that their coming together to use technology to try and intervene on the subjective
nature of diagnosis-is like the meeting of two different "cultures."
He praises the project and the project's PI with extravagance: they are doing the best
thing, they are far ahead of everyone, this is phenomenal work and phenomenal that I am
studying it and so great to be a part of it. He does not want me to forget the human dimension of
things. Curious to hear his reaction, I tell him that, at the other sites I studied, people expressed
fear and anxiety over the prospect of using technology to address mental health issues, because
they were concerned this meant technology might replace the jobs of clinical professionals. He
scoffs at these straw men I have presented to him. People are scared or skeptical, he explains,
because there is a lack of education. They are not thinking of the human lives at the other end, or
that cell phones can in fact save people's lives. By way of example, he launches into his own
story about how his iPhone saved his life, years ago, when he first attempted suicide during his
first manic episode.
He had been an incredibly successful computer programmer. He had a beautiful wife and
two beautiful children, and it was Mother's Day. He got up in the middle of this beautiful
brunch, and left, driving and driving and driving until he ran out of gas. "And then," Jacob says,
leaning in even closer and gripping my elbow near my left breast, "I did an overdose." He
climbed atop a fountain and took a bunch of pills, a handful - or, he had taken the handful of
pills and then later found himself at the top of this fountain. He was hanging over the water when
his phone dropped from his pants pocket onto the ground below. Moments later,
unconsciousness overtook him and he too fell, into the water. Then, his wife called his phone,
and a stranger who was passing by on an evening walk noticed it ringing and ownerless on the
257
pavement. The stranger picked up the phone and spoke to Jacob's wife, who directed the stranger
to look around the area for Jacob, and sure enough the stranger discovered him floating face
down in the basin of the fountain. Jacob was rushed to the hospital and miraculously revived. If
his phone had not rung, he tells me, he would have died. Without his phone, someone eventually
would have found his dead body in the fountain and would have to tell his wife and children that
he senselessly took his own life. At a moment when he could no longer speak for himself, the
phone provided a conduit through which his wife could alert a passing stranger that Jacob was in
distress. His iPhone "saved his life" because of the vital connection it afforded and mediated,
enabling his wife to reach out, to find him, despite his attempt to disconnect from the world
altogether.
His story finished, Jacob moves on to questioning me about where I am living and how
long I will be in town, reassuring me that he will help me out with whatever I need, that he will
contribute to my research in whatever way possible, that I can share his story. Finally, the P-
who must have been standing watch and listening in-intervenes, wedging his own body
between Jacob's and mine, letting me know that Jacob will be stopping by the Depression Center
soon for their appointment together and that he could arrange a time for us to talk further. Jacob
turns away to mingle with other guests, and the PI replaces him, also leaning in close to me but
this time to tell me something that truly should be private. It surprises him-and this, he says, is
part of what makes bipolar disorder so fascinating as an object of study-that some patients can
be so high functioning and be doing so well, while others cannot save their own lives like Jacob
did (though in Jacob's narration, he did not save his own life-his iPhone did, along with his
wife and a stranger).
258
Months after this initial, hectic encounter with Jacob at Bright Nights, I encounter him
once again, though not in the form that I had anticipated. I am sitting in a stale, windowless
office in the Depression Center with two other silent researchers, bathed in twitching fluorescent
light and hunched over the keyboard of an old a desktop computer. Through top-of-the line
sound canceling headphones, I am listening to randomly ordered, 3 to 30 seconds-long excerpts
of the cell phone study's research subjects' weekly phone assessments with a BPU clinician. I
can only hear the research subjects' voices. The engineers on the team have ensured that the
mobile phone application used to gather data for the study does not record the voice of research
subjects' conversational partners, transforming the excerpted phone calls from dialogues to
monologues. My job, along with the two undergraduate engineering researchers in the office
with me, is to "annotate" the data, to rate the emotional tenor and "feel" of these excerpts. With
these ratings, the undergraduates and I will help produce the metadata necessary to building an
algorithm that BPU researchers hope will help to predict when someone with bipolar disorder
will have a pathological episode based on minute changes in the sounds of their speech.
Suddenly, a familiar, gravelly voice fills my ears. It is Jacob, without a doubt, but all the
electricity is gone. Instead, as I click through the excerpts, trying to assign numerical values to
the sound of his voice rather than the content, I hear a man who is deflated and hopeless. His
speech-more of a series of sighs than words-is listless and indistinct, barely audible though I
have the volume turned as high as it can go and I am mashing the powerful headphones against
my ears.
I feel sorry to have been intimidated by a man who could become so meek, so utterly sad.
And at the same time, I feel guilty that I know it is Jacob. The dataset that the other researchers
and I are listening to and annotating was gathered three years prior, long before Jacob met me
259
and long before he told me-and invited me to share-his story about the life-saving iPhone.
Although I am only supposed to be listening to the acoustic quality of research subjects' voices
and dis-attend to the content, this task becomes all the more difficult-fruitless, even-when I
cannot help but match the voice with a face, a hand gripping my elbow, and a tale of attempted
suicide.
As was the case at my other two fieldsites, my inclusion on the cell phone study's IRB
protocol ratifies and ethically validates my participant role in the interaction between Jacob and
the clinician. Indeed, when he signed the project's consent form years ago, Jacob agreed to allow
a researcher to later listen to his phone calls with the clinician. I am an "unaddressed recipient"
(Goffman 1981:133) of the interaction between Jacob and the clinician, a sanctioned eves-
dropper with a crucial caveat. While the task at hand-listening to and annotating the excerpts-
grants the other researchers and I broad access to the details of Jacob's day-to-day life, given
away in his responses to the assessment calls, the lead engineers on the team urge us, over and
over, to avoid letting these details enter our attention as much as possible, to not listen to the
details at all. Glancing at my own iPhone that rests near the keyboard, I try to imagine what it
must have been like to answer the call from Jacob's wife, and the connection between Jacob and
the stranger that his stranded phone made possible. What to make of the connection (or, if the
engineering team's job is to forget the details, the disconnection) between Jacob, all the other
research subjects, the other researchers, and I, that building the cell phone study's predictive
algorithmic infrastructure affords?
PHONES THAT SAVE LIVES
260
For a growing number of researchers of mental illness like the BPU's PI, the cell phone
represents a promising, untapped research tool. Cell phones can be repositories of data, the
volume of which offers up a level of specificity of analysis that would be difficult if not
altogether impossible to achieve through traditional surveys, inventory tools, or through face-to-
face interactions between researchers and research subjects. With cell phones, users passively
transmit multi-modal data captured by the phone's various sensors or actively enter in data as a
byproduct of using the phone as they normally would. They scan their fingerprints and their
faces in order to enable personalized security features, to be sure that only they can unlock their
phones. Calls made with the phone are time-stamped, indicating the duration and time of day and
date the call was made. Walking around with a phone in hand or in pocket produces gyroscopic
data, and GPS coordinates if the user has enabled location tracking. Even the quality of fingertip
touches to the screen can be a form of data, as researchers seeking new methods to track,
diagnose, and understand the progression of Parkinson's disease have attested (Zhan et. al 2018).
Thus, while the data that passes through phones might seem meaningless, unrelated to
mental health or unrelated to diagnostic symptom criteria set down in the DSM, researchers like
the BPU's PI argue that systematically recording and analyzing cell phone usage data can reveal
habits and behavioral patterns that have never before been correlated with the incidence of
mental illness (see Onnela and Rauch 2016). In their eyes, cell phones have the potential to allow
researchers to track down novel indicators of mental illness that previously have gone unnoticed
or that professionals never considered to be meaningful signs as at all. Researchers refer to this
method of data capture and the diagnostic precision it promises as "digital phenotyping," or, "the
moment-by-moment quantification of the individual-level human phenotype in situ using data
261
from personal digital devices" (Torous et al., 2015; Insel 2017).' Proponents of digital
phenotyping position the cell phone as uniquely capable of rendering the otherwise intractable
minutiae of everyday behavior into calculable, traceable material. The use of the word
"phenotype" signals the belief that proper-accurate, expansive-description of mental illnesses'
manifestation holds the key to pinning down something like a genotype, and therefore the
fundamental, biological mechanisms driving mental illness.
The BPU's cell phone study is an instantiation of the promissory pledge of digital
phenotyping. The logic driving the project is that the mysteries of mental illness can be unlocked
by pursuing connections between observable behavioral signs and internal states that have never
been studied in tandem, specifically, the pathological mood episodes associated with bipolar
disorder and the acoustic contours of the human voice. However, as I have argued elsewhere in
the dissertation, researchers do not merely locate or stumble upon the correlations and
correspondences that data-driven mental health research-like digital phenotyping-produce.
Researchers constitute and work to hold steady these connections in the very process of pursuing
them. When engineers collaborate with psychiatric professionals to transform cell phones into
research tools, they must make choices: what counts as potential data and therefore what aspects
of human behavior should be tracked (or ignored), how to track that data, where and in what
form the data should be stored, and, most importantly, how that data, once stored, should be
sorted, labeled, and analyzed. Because the discourse of digital phenotyping (and computational
psychiatry as a whole) erases such marks of human decision-making from the process of turning
digital data into symptomatic signs, ethnographic attention to the choices researchers make and
the alternative possibilities and connections these choices foreclose upon is crucial.
" Recall from Chapter 1 that digital phenotyping is an example of a data driven method for conducting psychiatric
research, and therefore falls under the umbrella of Computational Psychiatry.
262
Researchers also make choices that could have been otherwise about how to make bipolar
disorder audible. Packaged within these choices are claims about what bipolar is, and what kind
of data human speech contains. While the research teams discussed in other chapters emphasize
the ability of the technologies that they are building to capture sonic features of mental illness
that surpass human attention or awareness, members of the BPU cell phone study takes a
different approach. Their goal is to calculate and train an algorithm to pick up on changes in the
voice that human observers can hear but cannot systematically describe or put into words.
Therefore, in addition to a case study of the infrastructural arrangements, categorizing practices,
and labor required to make digital phenotyping possible, in this chapter, I focus on the figure of
bipolar disorder as a mood disorder that causes audible changes in the quality of~speech. As
scholars such as Emily Martin (2007) have explored, bipolar disorder is riddled with assumptions
about what mood and emotion even are-for instances, that they can be distinguished from and
are opposed to reason and logic-and about the dividing line between normal and pathological
affective experiences. In addition to grappling with these assumptions in operation in the cell
phone study, I introduce and distill a third: the assumption that affect is suspended in speech and
can be made knowable through listening.
Unpacking this third assumption requires probing the relationship between affect and
speech in the Euro-American imaginary and in biomedical framings of illness and the body.
Anthropological studies of affect have, through a comparative lens, helped to clarify and situate
the model of emotions as interior, private, and individual states within the cultural legacies of
North American psychiatry and psychology (Rosaldo 1984; Lutz and White 1986; Lutz and Abu-
Lughod 1990). As the linguistic anthropological literature on ideologies of linguistic opacity in
Oceana suggests (Duranti 1992; Rosaldo 1982; Keane 2008; Robbins 2008; Silverstein
263
2001[1981]; Throop 2010) ideologies of linguistic transparency5-2 or the notion that speech has
the potential to carry forth emotions and therefore set free an individual's unique, core, authentic
self-reinforce a model of emotions as residing inwards and essentially linked to personhood.
The foundational psychological research on emotions and affect in the U.S. define affective
experiences as that those that exceed rational control, and as emanating from a panhuman core
embedded in the body. Consider, for example, the influential work of psychologist Paul Ekman
and his efforts to categorize affective experiences universal to all humans. The basis for his
findings involved, in part, analyzing movements of facial musculature in response to visual
stimuli (Ekman and Friesen 1971; Ekman 1989; Ekman 1999). Referring to these responsive
facial expressions as "reflexes" places emotions so deep in the body that they are unreachable by
culture, and unmovable by conscious, agentive control. In other words, according to this model,
people do not learn to perform emotions through bodily movements, such as the patterned
coordination of the speech organs-emotions are ofthe body and happen to the body.5 3
These formulations of emotions as interior, private, and traceable, bodily reflexes that
operate beyond conscious control depend upon patently Eurocentric binaries and divisions (the
body and the mind, feeling and cognition, matter and spirit, rational and irrational). Moreover,
because it is beyond (or before) consciousness and rational control, the affective realm is beyond
language. Spoken utterances signifying emotion are hence called "paralinguistic" cues-
breathiness, speed, timbre, pitch, and so on-the body's "pre-lingual" ruminations, sounds that
52 These scholars contrast ideologies of linguistic transparency, in operation among speakers of Euro-American
English, with ideologies of linguistic opacity more prevalent in Oceana. For instance, Jason Throop (2003; 2010)
has found that among communities of speakers on the island of Yap in the Federated States of Micronesia, it is
unethical to ask after or pursue the connections between interior states and speech produced in conversations or
public settings. Speakers, in turn, should work to ensure that their speech is opaque as possible by carefully
controlling the semantic meaning of utterances, and by evacuating from their speech any signs that might reveal
inner states.
s3 Dror (2001) provides a history of the role of the behavioral sciences in the U.S. in fomenting the notion that
emotions have a bodily trace that can be tracked down and quantified.
264
bear communicative significance but are adjacent to language proper. Suspended in the body,
affect is therefore traceable in the "grain of the voice" (Barthes 1977)-beyond words, beyond
reason, but within speech. 5 The cell phone study reifies and reinforces the grain of the voice as
affect's siting, entrenching it further in biological essentialism, promising to make pathological
emotional experiences tractable and predictable by quantifying what is heard in affect.
Researchers strive not only to quantify what is heard in affect. The goal is also to
quantify intuitive interpretations of how affect sounds. The study takes as its starting point
conventional therapeutic wisdom about the observable, indexical signs that indicate bipolar
disorder's two pathological poles: quick speech evinces mania, and slow speech evinces
depression. These were the same observations set down in Emil Kraepelin's Manic Depressive
Insanity, 2002 [1921], which provides a long-term and detailed observational study of lafolie
circularo r "manic-depressive disorder," what would come to be called bipolar disorder."
However, it was the observations of "lay experts" (Wynn 1996) that inspired the BPU's PI:
people close with patients and people who themselves are "living under the description of bipolar
disorder" (Martin 2007:10). They would tell the PI that they could sense when their family
member or loved one was on the brink of a pathological mood episode because there had been
something ineffable about the person's voice-they had just sounded off The PI wanted to take
this intuition and concretize it, in his words, "teach the computer to listen like a human brain."
5 Note that the distinction between language (grammar, semantic meaning) and speech (the act of producing vocal
utterances), akin to the Saussurean distinction between langue and parole, maps on to the distinction between reason
and affect, content and forn, and also on to the distinction between mind (immaterial, emergent) and brain (material,
embodied).
5 Emil Kraepelin is the 2 0th century German psychiatrist to whom contemporary psychiatric researchers credit with
developing the basic nosological infrastructure of U.S. psychiatry, in part by dividing mood disorders from thought
disorders (Decker 2004). Emily Martin (2007) provides a comprehensive history of how the diagnostic category of
bipolar disorder has transformed over time, in terms of its nomenclature, theories of its etiology, and its associated
diagnostic criteria.
265
What is this intuitive audition? What work does it take to design a system that can listen
for and distill that ineffable thing some people-loved ones, therapists-can hear? This brings
back a question that resonates throughout the dissertation: what exactly does it mean to listen? In
this instance, what kind of listening is required to "teach a computer to listen like a brain"?
Although the PI's formulation takes the aural equivalent of the gut instinct or hunch and
collapses it back into a biological reduction that the neuroscientists and engineers of East Coast
University used regularly-the "listening brain"5 6- the clinicians and engineers working behind
the screen on the cell phone study put into practice much more nuanced and complicated
conceptualizations of what it means to listen, and what listening entails. The engineers and
clinical team members butted up against the limits of what listening can capture from the voice,
implicitly challenging the biological essentialism of the "listening brain" in their day-to-day
dealings with the research subjects' voice data.
In order to develop the app's predictive capacities, the BPU team would first need to
gather and categorize data. This was the stage at which the study stood when I arrived for my
fieldwork. The team had developed an app that recorded all of the research subjects' outgoing
calls. The engineering team was asking a sub-level question of the audio data gathered using this
app: can a non-machine listener consistently identify any common features in the voices of
people diagnosed with bipolar disorder? To that end, the research study in its current phase
revolves around two different listening practices, assigned to the two teams of experts involved:
clinical assessment (conducted weekly by members of the clinical team) and annotation (or the
56Recall, from Chapter 2, that the listening brain is opposed to the hearing ear. In this formulation, the brain is
germinal to and responsible for the act of listening itself, because the brain actively processes and analyzes sound,
whereas the ear passively receives and absorbs sound. Therefore, the brain/ear distinction replicates the
listening/hearing distinction that sound studies scholars, most notably Jonathan Sterne (2003), have located in
strands of early Christian theology.
266
quantification of the sounds of research subjects' speech, conducted by members of the
engineering team). I will compare these two listening practices, reviewing the skills they require
and the assumptions about voice, emotion, truth, and listening itself that are constituted in each,
thinking through them as two overlapping but conflicting acoustemological (and ethical) modes
of attending to the data. This means, in part, distilling the disciplinary tensions between
psychiatry and computer science, and their willingness to embrace or contest the "fuzzy"
realness of emotions.
To reiterate, part of what is at stake in the BPU's work is the semantic ambiguity and
polysemy not only of emotional terms like "mania" and "depression," but of the term "listening"
itself, especially with respect to agency and intentionality. One question that remains unanswered
is why people like Jacob consented to having their phone calls listened to in the first place.
Perhaps their willingness can be attributed to culturally specific expectations, discussed in the
Introduction, about the distinction between "hearing" (an unintentional reflex) and "listening"
(an intentional action) and the extent to which it was possible for researchers to turn off their
hearing in the act of listening. Researchers themselves, especially members of the engineering
team, grappled with the extent to which "listening like a computer"-detaching form from
content, ignoring the semantic substance of speech altogether-was humanly possible. Training
a computer to "listen like a brain" required demanding that the annotators (myself included)
listen like a computer, through ultimately insufficient tactics of "pure listening" to sound alone.
In lively debates about the intertwined relationship between speech form and speech content,
members of the engineering team drew from their own experiences as non-native speakers of
English struggling to understand emotional expression among native speakers. In these debates,
the BPU engineers theorized listening (and hearing) as culturally mediated rather than reflexive
267
or biological capacities alone. Such conversations about the limits of "listening like a computer,"
and frustrations over the annotation task itself, provided engineers a means through which to
subtly and quietly critique the technological prototype the study was supposed to produce, while
also critiquing psychiatric conceptualizations of affect and emotion.
The cell phone study does not only require collaboration between clinicians, engineers,
PIs, post-docs, and research assistants. It also brought the research subjects and members of the
research team into asymmetrical proximity and cooperation with each other. Aside from Jacob, I
never met or encountered any of the research subjects' whose voices I listened to, as I split my
time between shadowing clinicians (listening to their questions to the subject) and helping the
engineering team annotate the audio data (listening to the subjects' responses). The
undergraduate researchers and I found ourselves in the center of these faceless research subjects'
lives, absorbed in their calls with clinicians and, as the study's scope widened during the course
of my fieldwork, their personal phone calls. The nature of the annotation task was, contrary to
Jacob's warning, to "forget" the human on the other end, to treat their voice as all form and no
content, and to flag segments containing "identifiable" information that would anchor the speech
to an individual person.
But if the listening associated with annotation has biopolitical implications (Foucault
1978)-in that annotators are to listen to subjects as members of a population rather than
individuals-the listening associated with assessment was not so different. Clinical team
members had to listen in a way that would encourage rapport and therefore the disclosure of
personal information, information that would help them to calculate assessment scores and place
the subject in the category of symptomatic (either manic or depressed) or asymptomatic (i.e.
euthymic, neither manic or depressed). Because clinicians could not conduct psychotherapy over
268
the phone and their relationship with the research subjects had to be "non-therapeutic," their
operative was to gather data without providing treatment, avoiding as much as possible feelings
of responsibility toward the wellbeing of the research subjects. Taking into consideration Puig de
la Bellacasa's (2017) invitation to think through care as a matter of maintaining specific
arrangements of relations-whether those relations be liberating, oppressive, or somewhere in
between-I close the chapter by returning to my inquiry after the nature of the connections that
the cell phone study enables or troubles.
THEGAMECHANGER
DSM-5 (American Psychiatric Association 2013), the most recent edition of DSM, classifies
bipolar disorder as a mood disorder, characterized by the oscillation between two mood states-
often referred to as "mood episodes" (Martin 2007:47)-that reach pathological levels in their
depths and heights. The PI, an Icelandic man in his 60s who is also the director of the entire
BPU, likes to say that the two "poles" of bipolar represent the poles of possible human
experience: depression (devastating, debilitating sadness) and mania (soaring, incandescent
euphoria). The basic diagnostic criteria for bipolar disorder stipulate that depression and mania
must present in succession, sometimes spaced out across months, in order for a patient to be
given a bipolar diagnosis. As Martin writes in Bipolar Expeditions (2007), this makes bipolar
disorder a "meta-state" (47): not a member of a class of affective experiences but a condition that
includes classes of affective experiences typically thought to stand in opposition to each other.57
57 Although previous editions of DSM posited a stark distinction between the disorder's associated affective poles,
DSM-5 and contemporary research suggest that the division is not so easily identifiable, and that people can
experience "mixed" pathological mood states. Some research even suggests that, because there is such wide variety
in disease manifestation and due to the limitations of traditional diagnostic methods, people diagnosed with bipolar
269
As Martin notes, while bipolar is defined in DSM and in American psychiatry writ large
by the conjoined presence of these opposing affective states and subsequent behaviors (like
inhibition and excitement), basic diagnostic criteria for bipolar leave "emotion" and "mood"
undefined. The black boxing of these terms had consequences for the research team. As we shall
see, the distinguishing boundary between mood and emotion, the universality of these affective
states, and their "fuzzy" ontological status are issues central to the disciplinary frictions between
the engineers and clinically trained professionals working on the cell phone project.
Nevertheless, the team did share working, relational definitions of mood and emotion, which
they put into practice in the research design, data gathering, and data filtering practices.
Given my background in writing and my lack of training in either engineering or
computer science, engineering team members called upon me to help them write a training video
and a training guidebook for the annotation task. Co-writing the guidebook was an exercise in
distilling what mood and emotion meant for the team. However, since neither I nor my
engineering co-authors had significant clinical training, nor could claim any expertise on human
emotion, we relied on a series of texts from the Internet and a collection of the lead engineer's
grant proposals to write the guidebook.5 8 Over several weeks, we worked on the same Google
Document in the engineering office on our individual laptops with our backs turned to each
other, never meeting eyes but occasionally sharing a few appraising chuckles at our collective
ability to bricolage and bullshit. As the guidebook states, for the purpose of the study, mood is a
disorder are actually experiencing biologically distinct, heterogeneous disorders that cut across the singular category
(see, for example, Clementz et. al. 2016). Researchers often cite the heterogeneity of bipolar disorder as evidence for
the necessity of an RDoC approach to studying mental illness.
5 The engineering team did not consult the clinical team in the drafting of the guidebook. Clinical team members
tended to take on more administrative tasks and were stretched thin across several different BPU projects. For this
reason, the engineering team was hesitant to contact them for assistance-they wanted to avoid adding more work to
their already heavy load. Best intensions aside, the guidebook remained a source of unspoken tension between the
two teams.
270
"deep and lasting, enduring cognitive state." Emotion, on the other hand, is fleeting and reactive,
"action-oriented,observable expressed behavior that can be described in terms of valence
(positive vs. negative) and activation (calm vs. excited)." Emotion changes from moment to
moment, is more superficial and therefore easier to detect in speech than mood. The lead
engineer, the PI, and other bipolar researchers theorize that mood and emotion are interrelated,
such that changes in emotion precipitate changes in mood. Therefore, and since emotion is
supposedly more observable (including more audible) than mood, these researchers theorize that
detecting changes in emotion might be a way to anticipate and predict the extreme changes in
mood that define bipolar disorder. In other words, as the guidebook explained, "emotion is a
useful meta-feature for detecting changes in mood state using the speech signal."
The cell phone study is the newest layer of the multi-modal "data onion," as researchers
call it, of BPU's ten-year longitudinal study, all part of an effort to gather and analyze high
volumes of data. Researchers hope such Big Data holds the key to novel understandings of the
relationship between bipolar symptom expression and genetic predispositions for the disorder,
the kinds of findings that traditional, smaller-scale research projects have hitherto been unable to
produce. The active research cohort of over a thousand subjects includes entire families, because
bipolar disorder is thought to have a strong hereditary component, and the PI, trained in genetic
psychiatry, has been pursuing the genetic basis of bipolar since the start of his career. The
longitudinal study requires multiple, multidisciplinary teams to gather, manage, sort and analyze
its various streams and types of data: the neuropsychiatric team, the stem cell research team, the
microbiome team (looking at brain-gut interactions), and, for the cell phone study, a team of
engineers, a data scientist, and a mathematician, known throughout BPU as the "engineering
team." The clinical team, comprising of around 15 members, assists and provides administration
271
support to each of these other teams. While all other research teams are made up of MWU
faculty and students (including undergraduates, graduate students, post-docs, and visiting
research students), the clinical team is made up of mostly female "staff" rather than students,
people who have just completed their BS in psychology or BA in social work, or once-practicing
licensed clinical social workers and psychiatric nurses who have shifted from practice to research
work." Not unlike East Coast University and West Coast University, clinically trained team
members or members with more qualitative rather than quantitative training tended also to be
predominantly women, and were responsible for the face-to-face interactions and emotional
labor necessary to the research.
In order to enroll in the longitudinal study, members of the clinical team psychiatrically
assess potential research subjects to confirm their diagnosis of bipolar. Subjects must agree to the
collection of a wide variety of biological samples (blood, urine, saliva, skin, feces). In addition to
providing this biological data, longitudinal subjects also participate in a yearly and lengthy life
history interview and psychiatric assessment with a BPU clinical staff member, usually
conducted over the phone. These interviews produce assessment scores that quantify the
participants' fluctuations in bipolar symptoms from year to year. Although subjects do not
always speak to the same clinical team member every year, the length of the longitudinal study
and depth of these phone calls mean that research subjects become fairly used to sharing
personal information over the phone with someone they have never met and may never speak to
again. Thus, for the cell phone study, which requires subjects to undergo weekly phone
assessments with a BPU clinician, BPU primarily recruits subjects from the longitudinal study
5 The three or four student members of the clinical team present during my time there were visiting students. Two
thirds of them were men pursuing research-based graduate degrees either in clinical social work or clinical
neuroscience.
272
cohort who already feel comfortable with phone-based assessments. Another advantage of
drawing from the active subject cohort is that these subjects had been definitively diagnosed with
bipolar disorder.
The cell phone study's basic research question resembles the question that research teams
at the other two sites are pursuing, with a unique emphasis on prediction: are there acoustic
features in the speech of people diagnosed with bipolar disorder that indicate a pathological
mood episode is forthcoming? In other words, are there vocal-acoustic harbingers of mania or
depression? The BPU's PI envisions the study producing a kind of early warning system: a cell
phone application that detects these predictive acoustic features using an algorithm that the
engineering team has designed, trained on numerical ratings of the sound of subjects' speech and
their weekly psychological assessment scores. If the app detects these telltale-warning sounds, it
will send a "warning signal" (e.g., a text notification) that a mood episode is imminent to a
designated list of individuals, like the user's clinician or their family and friends.
Beyond research questions and the desired end use of the application, BPU researchers
suggested to me behind closed office doors and in the privacy of carpooled rides between MWU
and the Depression Center that the PI-with his fiery, entrepreneurial charm-was seeking a big
breakthrough, a disciplinary "game-changer," a term he used often. He was nearing the end of
his career and his pursuit of the "gene for bipolar," which had once been his life's project, had
proved fruitless.o Moreover, because of the BPU's reliance on philanthropic funding and
because the PI was constantly searching for commercial sponsorship, some guessed that the PI
6(In the shadow of the Human Genome Project, scientists have come to accept that the relationship between the
genetic code and genetic expression is far less linear and much more stochastic than initially anticipated. This is
especially the case when it comes mental illness, since the methods for phenotypic description of mental illness
remain hotly debated. Hence, anthropologists and historians of the life sciences refer to the contemporary moment as
the "postgenomic" era (e.g. Richardson and Stevens 2015).
273
cooked up the cell phone study in order to produce an innovative and flashy prototype that would
satisfy donors and attract a high-powered sponsor in the technology or health insurance sector.
Cynicism and office gossip aside, in our interview the PI told me, with his customary frankness,
that his ultimate goal with the cell phone study was to acquire the funding and produce basic
science findings that would help his patients as much as possible. To the PI, impressing donors
and sponsors, accruing research funds, building a game-changer prototype, and helping his
patients-and anyone else diagnosed with bipolar-were pragmatically and indissolubly
entangled.
LISTENING/NOT LISTENING
Just as the cell phone study's existence cannot be traced back to a singular, motivating force
"internal" to science but rather a series of interlinked initiatives-attract funders, acquire funds,
help patients, innovate-likewise the project's sole emphasis on acoustic features of speech was
not driven by pure scientific curiosity alone. Yes, the BPU was interested in better understanding
the connection between speech, emotion, and mood in people living under the diagnosis of
bipolar disorder. The lead of the engineering team, Meredith-an intense but kind-hearted New
Englander in her late thirties-was especially interested in this connection, and the role that
machine learning could play in locating it. But the emphasis on acoustic features rather than
semantic or syntactical structure of language-i.e., speech form rather than linguistic content-
had much to do with the team's interpretations of their own IRB protocol and concerns about
violating subjects' privacy.6'
" Recall that, within communication engineering and speech signal processing community, syntax is a language-
dependent component of communication. Syntax provides structure to semantic meaning and so it closely hews to
274
Meredith did her graduate training at WCU and completed a post-doctoral fellowship at
East Coast University under Ted, the prestigious behavioral speech signal processing scholar. In
our one-on-one interview, she explained to me that the cell phone study was exciting from a
speech engineering point of view because it offered an opportunity to capture relatively
unstructured, "natural" speech, "in the wild" or in situ.62 That being said, she confessed that she
herself would not consent to the ubiquitous monitoring and passive audio recording that the
study required-she was a self-professed conservative when it came to data privacy issues. In the
early days of the cell phone study, Meredith and the engineering team operated under the
assumption that no humans could listen directly to any of the audio recordings of subjects' phone
calls, and that the content could not be analyzed in any capacity, even by an automatic speech
recognition algorithm. Proceeding forward with the interpretation that any form of analyzing
audio content would infringe upon research subjects' privacy, the team used an "off-the-shelf"
(preassembled, gold-standard) algorithm to analyze the rhythmic patterns of research subjects'
speech in the assessment phone calls with the BPU clinicians.
Two or three years into the study, however, Adele and others on the clinical team pointed
out that, technically speaking, members of the clinical team could probably overhear the phone
calls with research subjects and possibly even the research subjects' end of the conversation. The
desks in the office where most of the clinical staff sat were spaced tightly together. The sounds
speech content. Because syntax varies cross-culturally, it lacks a stable biological basis. Communication engineers
and speech signal processors tend to find the Chomskian notion of a universal grammar apparatus to be a cognitivist
historical relic with no scientifically observable basis. On the other hand, paralinguistic variations in the production
of speech have a traceable, material existence-these features can be distilled down to sound waves, which have
properties that can be analyzed mathematically without regard to the sound's association with the meaning of an
utterance.
6 The majority of other data sets that speech signal processing researchers use to study the relationship between
acoustic features and psychological or affective states are "unnatural": either structured responses to interview
between two people with no known psychiatric diagnosis or actors acting out lines or scenarios, rather than someone
with a clinician-confirmed diagnosis responding to an open-ended question or talking informally to someone they
know (a form of "natural" speech).
275
of the calls flooded this shared space in a way that no one could control. It was inevitable, Adele
and others argued, that members of the research team were "listening" in some capacity to the
phone calls, simply because no one could not control when or if they overheard the subjects
speaking on the other end of the phone, and because there was no way of knowing if the
officemates were actively (and secretively) attending to or processing whatever content of the
calls resounded throughout the cloistered office. The clinicians suggested, then, that the clinical
researchers on the team might already be taking part in the very act that Meredith worried would
overstep research subjects' privacy: listening to their phone calls. In other words, in the cramped
space of the clinician office, the calls (in terms of their semantic content) were never really
private to begin with, so there would be no ethically significant breach of privacy if the
engineering team members were to start listening to the calls as well. This argument probed and
dissolved the distinction between hearing (a happenstance absorption of sound that occurs
automatically, simply due to being in the presence of sound) and listening (an agentive,
intentional and directed auditory processing), underscoring that it is difficult to say when hearing
alone morphs into listening.
Heeding Adele and her colleagues, Meredith and the PI moved to petition the MWU IRB
to add in a line to the study's protocol and informed consent form specifying that, at some point
down the research pipeline, members of the study team might listen to their phone calls, though
they would only listen to the sounds of speech and not the speech's content. The amendment was
passed, without the specification as to how researchers might simultaneously listen (to sound)
and not listen (to content), or which of the research team members would be doing the
listening/not listening. Beyond the lack of details, however, the amendment pried apart listening
from hearing once again. Meredith and the PI's edits distinguished listening from hearing
276
through the re-inscription of agency to the auditory act in the splicing up of speech into sound
and meaning. Specifying that researchers would be directing auditory attention to speech sound
while consciously re-directing attention away from speech meaning imputes agency and
intentionality to the listener: the listener must work to break apart speech into distinct
components, rather than absorb speech holistically in the way that "overhearing" implies.
Anthropologist and sound studies scholar Tom Rice notes that different uses of the term
listening in the U.S. and U.K. "imply subtle shifts in acoustical agency, which references
nuanced varieties of active-receptive and passive-receptive auditory agency" (Rice 2015:100).
The IRB amendment illustrates how rhetorically tweaking the valences of passivity, agency, and
activeness involved in the auditory up-take of sound can be used to manage expectations for
privacy and the team's commitment to it, in part by reinforcing the notion that the most
meaningful core of speech resides in semantic content alone. Adele and others seized upon the
vagueness of the hearing/listening distinction to push for making the recorded assessment calls
available to the engineers' ears. Meanwhile, in the revised IRB protocol, "listening" loses its
vagueness, and comes packaged with an implicit ethics: researchers will not attend to what
subjects say but how they say it, giving the impression that the content of speech will be
protected or somehow blocked out, and that form and content are distinct entities. Thus, while
Meredith and the PI made the revision because clinicians realized that the calls, as an auditory
event, could never be contained or kept away from the ears of others, the revision itself gives the
impression that the analysis of speech data will somehow offer privacy by keeping the real meat
of conversation-the content-out of audible reach. In so doing, the IRB revision reinforces the
semiotic ideology (Keane 2003) that language is primarily referential-that speech form is an
277
accessory to real speech meaning, and that the sounds of speech are a superficial "garb" that can
be stripped away from signification (Keane 2005).
Immediately following IRB approval of the amendment, the engineering team's own
attention shifted to the kind of data that humans, rather than preassembled algorithms, could
gather if they listened to the calls. Hassan, a post-doc in his early 30s working under Meredith
who had completed his PhD in the Middle East, insisted that the team gather together a group of
undergraduates to join the engineering team and begin labeling the audio recordings so that they
could start to build a predictive algorithm based on "human judgment." Making an application
that captured and could replicate an "average" person's intuitive interpretation of how a voice
sounds required a complex, heteromated assemblage and the coordination of many moving parts.
There were the cell phones themselves, the databases through which the calls passed, the audio
recorded call files which had to be sorted, filtered through, and categorized. There were the
diagnostic inventories for interpreting the content of subjects' speech and determining how
"bipolar" they were that week, the systems for labeling the sounds of subjects' speech, and the
different and sometimes competing modes of listening and interpretation that these two
quantification processes entailed. In order to lay out the stakes at play in the cell phone project
and foreshadow the issues that will be probed in subsequent sections, the following section
reviews the data collection process, tracing the research subject's speech as they utter it into their
phone's microphone, as it moves across databases, between BPU offices, and through different
classificatory regimes.
MAPPING THE DATA PIPELINE
278
Subjects enrolled in the cell phone study are given a retrofitted smart phone within which the
engineering team has installed the study app. The app, which is always running so long as the
phone is powered on, appears discretely on the home screen of the study phone as a simplified
version of the MWU logo. During 6 to 12 months in which subjects are active in the study, the
team encourages subjects to use the study phone in place of their personal phone, as often as
possible. In addition to making phone calls and sending text messages, this includes using social
media applications installed on the phone and the phone's web browser. For some subjects,
enrolling in the study afforded access to a smart phone for the first time, or access to their own
smart phone for the first time. Many had only ever owned a flip phone with no Internet
capabilities or had a single smart phone that they shared among spouses, partners, or dependents.
The BPU paid for unlimited cellular and data plans on the study phones, and so most research
subjects were enthusiastic to use the phone and its Internet capabilities without having to worry
about running out of data or minutes. This indeed was one of the allures of participating in the
study. It was not uncommon for research subjects to try and strike a bargain at the end of the
study, proposing that they keep the cell phone in place of collecting the stipend they had earned
for their participation.63
In addition to using the study phone as often as possible, subjects had to participate in a
weekly phone call with a BPU clinician, which the team referred to as "assessment calls."
During these 20-30-minute long assessment calls, the clinical team member or staff-person
(typically, staff-woman) would ask the subject a series of questions based on two gold-standard
assessment scales: the Hamilton Depression Rating Scale (HAM-D) and the Young Mania
63 Participation in medico-scientific trials often unlocks access to resources that are unevenly distributed-
particularly access to medical resources (Rapp 2000; Petryna 2009; Nguyen 2010; Benton 2015). Cell phone study
subjects' pleas to forgo cash payments in exchange for continued use of the Internet-enabled study phone speaks to
how desirably and to how unevenly distributed digital communication resources are in the United States.
279
Rating Scale (YMRS). Based on the subjects' answers to questions from these scales, clinical
team personnel would assign two numerical scores to the call, quantifying how manic or
depressed the subject was that week. These scales are so-called clinician-rated inventories rather
than patient-rated inventories like the Beck Depression Inventory (discussed in Chapter 2). In
other words, clinical professionals (social workers, diagnosticians, clinicians, etc.) generate a
score according to their interpretation of a patient's responses, rather than patients themselves
generating the score. The accuracy of the scores generated through clinician-rated inventories
hinges on the clinical professionals' interpretive abilities as well as their verbal strategies for
establishing trust and rapport with research subjects. The clinical team's scores had broad
implications for the study as a whole, and the design specifications of the study's eventual
outcome: the cell phone application.
Assessment calls were divided into three "mood" classes, concurrent with the scoring
criteria associated with the YMRS and HAM-D. If the call had a YMRS score of 10 or more and
a HAM-D score less than 10, then the call fell into the "manic" category. If the call had a YMRS
score lower than 10 and a HAM-D score great than 10, then the call was "depressed." If both the
YMRS and HAM-D score were less than 7, then the call was "euthymic," meaning
asymptomatic or "normal," neither definitely manic nor depressed. Although all of the subjects
needed to have a diagnosis of bipolar disorder in order to participate in the study, in order for
their calls to be included in the dataset, the clinical team member had to score at least two of
their calls in the "symptomatic" range. Otherwise, the subject and their hundreds of phone calls
were excluded from the dataset altogether. 64
64Calls were also excluded if the audio quality was poor-if there was too much background noise that made it
difficult for the engineering team members to hear the subject's voice. For instance, if the subject used headphones
during the call or if they made the call on speakerphone, the microphone picked up more background noise than
usual.
280
In this way, the clinical team played an incredibly crucial role. Their judgment of the
subjects' weekly symptoms determined which calls were bipolar enough to even count as data.
Their scores determined which of the calls the engineering team would annotate, which had
implications for the capabilities and limitations of the predictive cell phone app's algorithmic
infrastructure. If the data upon which the app was built only included extreme cases of bipolar
symptom manifestation, then the app itself would only be capable of identifying these extreme
cases. The clinical team's interpretive practices and the broader teams' classificatory strategies
also concretized the image of bipolar disorder as a disorder of extremes, and of clear-cut
binaries, an image that does not align neatly with the experiences of all people living under the
diagnosis of bipolar disorder. The team excluded what one of Emily Martin's ethnographic
interlocutors referred to as the "white spaces" of being bipolar: moments of stasis, in-
betweenness that do not fall clearly on one end of the two pathological poles or another, or lapses
in symptomal experiences altogether (2007: 187).
As soon as the research subject presses the "call" button on their BPU-supplied phone,
either answering or initiating a call, the app begins making a recording of the sound that the
phone's microphone captures. The app is designed to only record the sound captured by the
study phone's microphone. It does not record the audio of the subject's interlocutor or the voice
of whomever else is on the line (because only the research subject has consented to have their
calls recorded, and it would be prohibitive to try and consent anyone and everyone that the
subject spoke to over the phone). After the call ends and either the subject or their interlocutor
presses the "end" button, the recorded audio file is encrypted, transformed into a data file which
cannot be listened to, and then stored directly on the phone. Once the phone is connected to Wi-
Fi, the encrypted file is uploaded from the phone's storage to a secure server somewhere on the
281
main MWU campus. The files are then de-encrypted and arrive at their final destination in the
form of a listenable audio file: a Depression Center database managed by Chen, a Chinese data
scientist in his late twenties on the engineering team. The uploading and de-encryption process
takes 24 hours, after which subjects are given the option to delete the data file from their phone's
storage folder.
When clinical personnel first sign research subjects on to the study, reviewing the IRB
protocol and securing their informed consent in one of the BPU's windowless offices, Chen joins
them, explaining this whole process, demonstrating how to delete the files, and how to connect to
Wi-Fi if the subject has never done so. If they like, he helps them set up a passcode on their
phone to help ensure their privacy. This is one of the rare occasions in which an engineering
team member interacts face-to-face with research subjects. Otherwise, engineering team
members interact only with a recording of the research subjects' speech.
A considerable amount of time passes between the point at which the calls arrive at
Chen's database and the point at which engineering team members begin listening to and
annotating them. Only when the subject has completed the study and they have an entire 6-12-
month corpus of calls does Hassan begin processing the calls of subjects that fit the
"symptomatic" criteria. Hassan uses an algorithm of his choosing (the COMB-SAD algorithm)
to split the calls into 3-30 second segments. This algorithm is supposed to help him by combing
through the hundreds of calls and cutting up the audio to produce short, 3 to 30-seconds long
segments that contain continuous speech. Despite the aid of the algorithm, Hassan's task of
filtering and splitting the phone calls was time-consuming and laborious. He would work through
the day and into the night until dawn, monitoring the algorithm as he ran it over a single research
subjects' audio files, checking to see that the algorithm had segmented the calls correctly by
282
listening to samples of segments to confirm that the sample of segments he selected contained
more speech than sound. As a final step and in an effort to discourage annotators from paying
attention to the content of the call, Hassan shuffled the segments out of chronological order.
Even after his long hours and careful efforts, Hassan and his algorithm did not always
catch all of the errant, noisy or speech-less segments. Sometimes, annotators would come across
segments that contained no speech and only contained ambient sounds (the sound of a car
backing up, the sound of a door opening and closing, the sound of a dog barking). One subject
had a large collection of pet birds. In many of that subject's segments, the overlapping
conversations of their pet birds overpowered the sound of the subject's one-sided conversation
with the clinical team member-the human's speech was inaudible over the birds' speech. Other
segments contained elongated sighs, or laughs, or coughs. Hassan and Chen instructed annotators
to manually mark these kinds of segments to be excluded from the dataset, categorizing them as
either "too noisy" or "not enough speech." There had to be enough speech-like sounds for the
annotators to listen/not listen to and judge; it was too difficult to determine the emotional feel of
a sigh, cough, or laugh in abstraction, without the presence of other forms of speech. Thus, while
the clinical team members controlled the boundaries and definition of what constituted
symptomatic speech, the engineering team was responsible for distinguishing the boundary
between significant sound and insignificant sound, defining what counted as "noise." It was up
to the annotators-myself included-to demarcate the threshold between meaningless speech
(containing only sounds) and meaningful speech (containing enough words).
All of this interviewing, scoring, categorizing, filtering, excluding, and shuffling
produced the segments that I helped to listen to and score yet again alongside two MWU
undergraduates-Josh and Aubrey-in our role as annotators on the engineering team. When I
283
arrived at the BPU in May 2017, MWU's IRB had since approved the revised protocol and the
engineering team had begun analyzing and adding their own numerical value to segments of the
assessment calls that were symptomatic enough. In total, at the time of my fieldwork, the dataset
contained audio-recorded calls from 43 participants, with about 21 weeks per subject, totaling at
39,445 calls featuring over 2,880 hours of speech. Within the dataset of symptomatic subjects,
audio recordings were divided up into two categories: assessment calls and personal calls.
Assessment calls, which made up 933 items in the data set, are calls conducted with a BPU
clinical team member.65 Personal calls, on the other hand, are all other calls made with the study
phone. Toward the end of my fieldwork, the BPU made yet another IRB amendment, asking for
permission to access the personal phone calls of consenting participant. After a clinical staff
person individually called and re-consented all 43 of the participants in the data set, during my
last month at the BPU, the engineering team and I began to sift through these personal phone
calls, processing them and preparing them for annotation by removing segments that contained
indefinable information from the dataset.
El ^mam is oxgf fmI t" ~usamwd
0 0 0 0 Q 0 0o 
1 2 3 4 5 6 7 * *
Ce -------------------------- -- ------------- .
o o 1 C3 0a 0 0 D
1 2 3 4 s 6 7 U
65 These calls were conducted around 2015, when the IRB passed the amendment. A number of the BPU clinicians
who conducted them had long since left the BPU to pursue graduate careers.
284
The user interface of the annotation software features two scales, visualized with the schematized outline of
a person. On the activation scale, a small grain within the center of the person's torso grows more and more erratic
as the numbers increase. The person above the 9 rating, the highest possible rating for activation, has exploded. On
the valence rating scale, the person above the number one is frowning deeply. The frown slowly transforms into a
smile, with the broadest smile occurring above 9.
Although both modes of listening-annotation and assessment-were necessary to
assembling the basic skeleton of what would one day be the predictive algorithm, clinical and
engineering team members listened with different and at times conflicting ethical and
acoustemological imperatives, ideas about the relationship between mood, emotion, and speech,
and standards for achieving objectivity. The teams were physically disconnected, sat in separate
offices, and did not cross paths often, aside from Chen's brief interface with research subjects
during the consenting process. Although the PI held monthly meetings for everyone at BPU to
attend for the purpose of updating each other on their work, the engineering team rarely if ever
attended these meetings. Members of the clinical team involved in the cell phone study also
attended infrequently. Josh, Aubrey, Hassan, Chen and I all sat in the same office, talked often,
and at times passed the headphones around so that we could all listen to and discuss the same
troublesome, noisy, or difficult to annotate audio segment. We would eat lunch together in the
office or at a picnic table overlooking the rippling grass fields behind the geriatric hospital.
Hassan would lead mini lectures on machine learning, guiding Josh, Aubrey, and Chen through
calculating the agreement across our labels, and helping Chen troubleshoot data management
issues. Afterwards, I would brew the team a new pot of coffee in the communal kitchen, ferrying
the coffee back to the engineering office in old BPU mugs and single-use paper cups.
The clinical team, on the other hand, was far larger and far less cohesive, consisting
primarily of university employees rather than students, post-docs, or faculty. Though many
junior clinical team members-like Lauren, who I will introduce in the next section-sat in the
same shared, open-plan office, they were all responsible for different tasks. More senior team
285
members-Adele, Cheryl, and Rochelle-had individual offices and held managerial or
supervisor positions over the junior women. If they visited the open-plan office, it was to
delegate tasks to the junior women who worked there, or to partake in the snacks that women in
the office left on a communal table in the middle of the office.
ASSESSMENT: PROFESSIONAL LISTENING
In this section, I focus on the work of conducting weekly assessments. This will serve two
purposes. First, it will help set up a comparison between assessment and annotation, two ways of
listening to research subjects' speech that are necessary to building the app's predictive
algorithm. Secondly, it offers an expanded case study on the expertise and skills that conducting
effective psychiatric assessment requires, a counterpoint to the treatment of assessment as
unskilled labor. I hone in on the assessment call techniques of Lauren and Rochelle, two
different clinical team members at different stages of their careers with different degrees of
professional experience. In so doing, I underline that assessment is a skilled practice, while
distilling the ethical commitments that the listening of psychiatric assessment-versus the
listening involved in psychiatric treatment-entails for seasoned clinicians on the teams.
I find the term "professional listening," a play on Goodwin's "professional vision"
(1994) useful for making sense of what it is BPU clinical team members do when they lead an
interview over the phone with a research subject who they may have never met and who they
may never speak to again. Echoing Haraway's (1988) contention that there is no "god's eye
view"-no objective or neutral standpoint from which to interpret the world around us -
Goodwin defines professional vision as "socially organized ways of seeing and understanding
286
events that are answerable to the distinctive interests of a particular social group" (606). As
Cristina Grasseni (2004) argues, seeing is always looking -the pointed directing of sensory
attention to some components of a phenomena, marking them as significant while letting other
components fall into the background and to the edges of the attention. Goodwin is not only
interested in asserting that all sensory interpretation is situated and "perspectival" (606). His
work is often concerned with detailing how ways of interpreting and making meaningful streams
of sensory phenomena are "lodged within endogenous communities of practice"--how they are
indebted to professional norms, and chained to historically articulated, disciplinary conventions
(ibid.) In other words, he writes against the privileging of expert interpretations--like the
testimonies of expert witnesses in court-as superior and objective simply because they bare the
moniker of expertise.
At the same time, Goodwin forwards an anti-essentialist approach to studying how
professional frameworks of interpretation are related to a professional's object of scrutiny. He
emphasizes that professional modes of sensory interpretation are reinforced and relayed through
material artifacts and materially grounded practices-like coding schemes, pointing, and
highlighting-which play a pivotal role in guiding, hewing, and regimenting what the
professional expert ultimately sees or hears. For instance, Tom Rice (2010) provides an excellent
example of the formation of "professional listening" as it unfolds in the day-to-day practices of
medical auscultation training, in which students learn how to interpret the meaning of the body's
internal sounds mediated by a stethoscope. Like my informants, for seasoned physicians and
apprentices of medical auscultation alike, biological sounds are not inherently meaningful on
their own. Instructors must work to make bodies audible to their students, directing them on how
to listen, and how to listen through technical apparatuses. In the same way, clinical team
287
members at BPU always listen to research subject's calls through their disciplinary frameworks,
such as psychiatric conceptualizations of mood, emotion, and personhood, and through the
discipline's tools for quantifying fluctuations in affective states. Thus, "professional listening" is
an especially useful analytic for understanding how the assessment inventories used in the
interviews-HAM-D and YMRS-structure and guide how clinical team members attend to and
interpret speech, shaping what the clinician is listening for.
HAM-D and YMRS are tools of standardization and alignment, technologies for
converting subjects' responses into numerical values that represent how depressed, manic, or
"normal" the subject is on any given week. In order to complete the task of assessment,
clinicians must learn to interpret subjects' responses through the inventories, and through the
dual, opposing lenses of "mania" or "depression." Inventories not only shape what clinical team
members listen for through the phone-what information is salient, what is less important and
what should be foregrounded. Inventories and the task of assessment itself shape what the
clinician says and guide how the clinician converses with the subject. Because clinical team
members must figure out how to encourage the disclosure of personal information that will help
them assess the subject's mood, they must learn to deploy verbal tactics for establishing rapport
and trust in the absence of any other cues beyond their own voice.
Lauren, a junior member of the clinical team in her early 20s who was new to conducting
assessment calls, had begun working part-time at the BPU while finishing her Bachelors of
Science degree in psychology at MWU. She had graduated in the spring and had transitioned to a
full-time position in the summer of 2015, planning to take a year off from school while applying
to graduate programs in social work. Outgoing and not the least bit self-conscious, Lauren agreed
right away to let me sit next to her at her desk and observe her conducting assessment calls,
288
scheduled for Wednesday afternoons and early Friday mornings.
Lauren sat in the open-plan office alongside five or six other junior clinical staff. Like
Lauren, most of the women were white, recent college graduates who had majored in
psychology. Their desks were positioned closely together, offering little personal space. All at
once and without appearing distracted, Lauren and her office mates would take phone calls
related to the longitudinal study: making or canceling appointments, clearing up billing errors,
scheduling life history interviews, helping with recruitment, and so on. At any given time during
the day the office was buzzing with their various phone conversations, or with their casual talk as
they stood around a long, rectangular table in the center of the office, labeling and packing
research subjects' blood samples into Styrofoam containers filled with dry ice. This table sat next
to the table where they would gather snacks to be shared with everyone else at BPU. Aside from
their desks, the table for sorting blood and the snack table, the room was filled with filing
cabinets containing reams of old paperwork and the "swag" subjects received commemorating
various participation milestones (a pen for one year, a water bottle for five years, a tote back for
ten years).
I struggled to concentrate on Lauren's voice alone as I listened alongside her every
Wednesday and Friday. It was easy to become preoccupied with the other, unrelated calls or
conversations going on around me, including the conversations of people milling in and out of
the office to peruse the snacks or chat with the women sorting blood. Lauren's calls were short,
sometimes under 20 minutes. Lauren blamed the short length of the calls on the subjects she was
assigned to: two curt, middle-aged men who were not very symptomatic. She said the men
seemed annoyed with the calls, which I found odd. If they didn't like conducting these weekly
interviews, why had they signed on for the study? Eventually, one of the men dropped out,
289
telling Lauren that he found it tiring to constantly answer the same questions about how his week
had been and how he was feeling, especially since he felt that he was doing quite well. Only
when this subject exited the study did I start to consider how difficult it might be to keep an
interlocutor engaged in a conversation that followed the same sequence and the same series of
questions every time, including questions that covered extremely personal topics (bowel
movements, thoughts of self-harm, shifts in libido, among other things).
As time went on, I found that Lauren followed more or less the same pattern for each
call, day after day, week after week, subject after subject. She would ask the questions in the
numerical order in which they appeared in the HAM-D, would pause waiting for the subject to
answer, and move on to the next question, asking a follow up if the subject gave a one-word
response. Adele, Cheryl, and Rochelle told me that this rigid, recipe-like approach was a
necessary step in the process of memorizing and internalizing the inventory. Novices had to be
sure that the conversation was as standardized as possible and had to be sure that they asked all
of the questions. Only after they had committed the inventory to memory by rehearsing and
repeating it in the same generic format could they afford to be a little more creative with their
approach.
To conduct the assessment call, clinical team members used a supplemental HAM-D
guide called the SIGH-D (Structured Interview Guide for the Hamilton Depression Rating).
HAM-D contains 21 "areas" associated with DSM-defined symptoms of depression (such as
depressed mood, insomnia, feelings of guilt, work and interests, motor control, suicide). The
SIGH-D breaks these areas up into categories of questions and follow-up questions. For
example, for the HAM-D "feelings of guilt" area, SIGH-D instructs the interviewer to ask the
interviewee, verbatim, "Have you been especially critical of yourself this past week, feeling
290
you've done things wrong, or let others down?" Underneath the main questions, which should be
asked using the exact language that appears on SIGH-D, are a series of suggested follow-up
questions or "anchors." The interviewer can use the anchors to encourage the interviewee to
respond more specifically to the primary question, for example: "if YES [to the first question]:
what have your thoughts been?" "Have you thought that you've brought (THIS DEPRESSION)
on yourself in some way?" "Do you feel you're being punished by being sick?" Anchors are tied
directly into the scoring guidelines. For "feelings of guilt," a score of zero indicates the absence
of feelings of guilt, a score of 1 indicates "self-reproach, feel he [sic] has let people down," a
score of 2 is "ideas of guilt or rumination over past errors, sinful deeds," while a score of 3 is
"present illness is a punishment, delusion of guilt." Note that the guidelines for a score of 3
mirror the anchor for the "feelings of guilt" area ("do you feel you're being punished by being
sick?") Cheryl, Adele, and Rochelle told me that novices tend to rely more heavily on the
anchors precisely because the questions prompt the interviewee to respond in a way that
corresponds directly with a score. While bordering on the tautological, this correspondence helps
train the novices in interpreting the subjects' responses.
.-Am 4j
291
Mu has Vwmgy beentsiast Me= 55U Gomm
0 -rnn
MR yu beentirad alltitTU m? I- tuaTLnmin Iiab. bark or head.
sadeches. - " bT, mncieadres. 2 - playing wiUth~ hair. aty-.
Thiae e, a (nou hod ay-bckshe. Lamnofaeerandfatigumility. 3 - moing it, omtasitftil
.I , , orMusce es 2 - MVclew-ct ayfto (24) 4 - )a-whisng, n"I biting haiT-
pAL-ing. biting of lips (32)
This a. hem ycsufalt n nervines
In yourlib. bw*o ed
Noeyube empaily citicalof rULV= O CUU 10M 17-rm EMMM
yoursef tipst waki. fastingicu'v 9MM: (33-3d)
0-
IF YU: Watham yourtoghts bam? I - saif-repooa. fe.1..hkhmkmlt
peopleadow
istsyoubenfeling uilty ta y 2 -idesofgilt or rufftiati ove
thingt ht yu'tIs  or rotd om Pat rror ccsnful ed
3 - prn llness Isaa pnishmnt.
Deluions .ofgilt
(IMd MIK 21an yoursfInsm 4 - Tuar acoeatoy a rasicisoy
mol w or aprie
treaning visal ui a
Do you felyu'rebeing pAs~Udby
being nick?
Trsa st wdL, bve ouad OW IUMm
lindng or that yould hebotbotoff 0- ,a
aae& abo~t haing atLm15of I1- foals life.inantwth liing
hutdngr-akilayourslf? 2 - wishe heaes ea or wV dxx4.
of Mesbi athatsamf
33Y US Wht ham yu d =4. abot? 3 - suicidal Ideasor gastur
Hoe you am~ilyd we *rtingbo 4 - attiat Sicide (26)
- you beenfeeling peially ANIfMWV RYO
0 o mdifficult
Ham 'muban IstT" a( lotbout I - abjactie onste atn irriaility
littlaecbumtpt-t th~thingsi. 2 - wmayingabotairmt
3 - avrvhanui~vattitue pprent In
ITYU: Like what, fr xmpiA7 toosor speec
4 - fenar s wihot queticAMn (27)
The SIGH-D (HAM-D interview guide), featuring anchors on the left column and scoring criteria on the
right.
Because they had mastered the HAM-D, senior staff took a more holistic approach to the
assessment calls. They were required to ask questions in the exact language that appears on the
SIGH-D at some point during the call. But in so doing, they would collaborate with the subject to
craft a narrative 66 about the past week that just so happened to contain answers to the question,
the information that they needed. This is what I heard when shadowing Rochelle, a senior
clinical team member in her forties who was previously employed as a social worker. She had
her own, private office, situated one long, air-conditioned hallway away from the junior staff
66 Mattingly uses the term "therapeutic emplotment" (1994) to describe how clinicians and patients co-construct
meaning out of injury by collaboratively creating a narrative trajectory of the patient's experiences from illness to
wellness. Discursively inserting the patient into this narrative structure and then accounting for their place within it
throughout the process of treatment, Mattingly argues, is central to the healing process in and of itself. While
interviewers like Rochelle cannot conduct therapy with research subjects over the phone, they display techniques of
therapeutic emplotment in order to make meaning out of the subject's report of their symptoms over the past week.
This is yet another method through which they establish rapport. Co-narrating the subject's experience transforms
the conversation from a one-sided interview to a more collaborative endeavor, giving the sense that the interviewer
and the subject are working together, rather than the interviewer guiding the interaction in pursuit of the information
she needs.
292
office. On the wall outside her office door hung a single painting -Magritte's Golcanda- an
eerie and lonely image that contrasted so strongly with Rochelle's warm personality that it must
not have been hanging there by her choice.
Like Lauren, Rochelle was responsible for conducting assessment calls with two research
subjects. Rochelle would use the initial questions on SIGH-D but never used the anchors. She
would weave through the SIGH-D, flipping back and forth between the pages of the guide as the
conversation progressed, filling out the score as the subject spoke without interrupting. She
would always respond to whatever the research subject said, and asked follow-up questions by
inviting the subject to tell a story, even if what the subject had said had nothing to do with the
question she was focused on ("what happened next? How did you feel after your boss canceled
the meeting?") She opened and closed the calls with questions about things that had happened to
the subjects the week before, or with topics that had nothing to do with mental health at all.
When she learned that one of her subjects had the same breed of dog as her, she interlaced bits
and pieces about their dogs into the call, sharing training tips or stories about trips to the dog
park. Although I knew that Rochelle was using the SIGH-D and HAM-D to structure the
conversation, it was hard for me to keep track of the inventory and guide as I listened-they
would melt away.
Senior clinical team members like Rochelle and Adele would tweak and manage the
impression of what they were looking for, carrying on the conversation by responding to the
subject's speech as if it was tied to the subject's self rather than tied to data or information
pertinent to the assessment scales. Adele had been a social worker for years, not only caring for
patients in a state psychiatric hospital but also conducting field research and site visits for a
federal research institute. She was the most experienced interviewer on the team, and so she
293
oversaw assessment call training, which involved junior members shadowing senior members
and then conducting a mock interview with a senior staff member. Adele admitted that novices
struggled the most with quickly establishing trust and rapport with a research subject, which
could have consequential results. Without a sense of rapport, in her experience, the subject was
more likely to give single-word, recalcitrant responses to the interview questions.
In my conversations with senior team members about their techniques I had observed,
they would cite the notion of "rapport" as the grounds for ensuring that research subjects disclose
private, personal information. In so doing, they would invoke the ideology of inner reference
(Carr 2010), a language ideology that circulates in American mental health care contexts that is
linked to, as we saw in Chapter 3, Social Penetration Theory (the "onion" theory of the self). As
discussed in the Introduction, according to the ideology of inner reference, speech is primarily
referential, and expresses a speaker's otherwise interior, hidden self. Adele often described
personal details as interior, as hidden, or below the easily observable surface, and described the
interactional achievement of trust and closeness in language that coincided with the psychology
team at the West Coast University Research Institute. As Adele put it, the goal of rapport is to
coax the speaker in to "open up," to create conditions that enable the plumbing and excavations
of personal details buried "deep" within the subject. One way to open a subject up was to draw
on what you knew about them (their age, their gender) and go "off script" by asking them about
a subject matter that they might find interesting and that had nothing to do with their mental
health or the SIGH-D questions. If the research subject was male and the same age as Adele's
two sons, then she would draw on conversations she had overheard between them about the
hottest, newest video game to make small talk with the subject. This, she explained, would
impress the subject and put them at ease, giving them a chance to talk about something that they
294
knew and enjoyed-something that was personal but positive.
Adele and others also explained that an interviewer could achieve rapport through the
performance of symmetrical transparency of self, an explicit reference to techniques of Rogerian
psychotherapy (see Smith 2005). According to Adele, the best way to get a subject to give the
information that an interviewer needed was to "meet them halfway" in their excavation of self by
"giving a little bit of yourself" in return-by sharing personal details. This could be as simple as
sharing that you have sons, or as benign as mentioning that you have a dog of a certain breed.
Rochelle and Adele both told me that mentioning these small details helped make the
conversation feel more than one-sided, giving the sense that both parties were sharing personal
information, rather than one party asymmetrically extracting data from the other. Note that these
are the same verbal practices that animated the bodily movements and non-verbal
communication of Abby, the virtual human interface of the system built at West Coast
University. This highlights, once again, that empathy is not necessarily an affectively motivated
state alone-it can be formulated through linguistic practices that shape a speaker's
interpretation of the listeners subject position.
At the same time, Adele was aware that sharing personal information to encourage
disclosure was only viable because clinical team members did not have a "clinical relationship"
with the subject. In the context of psychotherapy or psychiatric treatment, clinical professionals
must tightly guard and maintain the boundary between their self and the subject's self, always
attentive to how the professional's conceptualization of self might torque or contour the terms of
their relationship and therefore the nature of the conversation and the act of therapy itself. Since
the clinical team at BPU was not allowed to perform psychotherapy over the phone, they were in
no way responsible for or accountable to the subject's mental health and how the weekly phone
295
conversations might impact it. Their task at hand was to gather the data they needed for that
week: the HAM-D and YMRS scores. They could of course care about the subjects as people,
but the nature of their task at hand discouraged them from caring about the subjects as patients,
or as expressly clinical subjects.
Listening alongside the psychology team members as they conduct assessment calls
demonstrate that psychiatric assessment is a skilled activity, rather than a mechanical one.
Indeed, assessment is most successful when the inventories (the tools that guide and shape what
the listener is listening for) melt away into the background, giving the impression that
conversation is "just talk" rather than a genre of interaction. Assessment is a complex practice
that requires training, can be done poorly or well, and depends just as much on the abilities of the
interviewer to internalize the assessment scales as it does on the willingness or openness of the
interviewee. In observing and conversing with senior BPU clinical team members, it also
becomes clear that conducting assessment over the phone requires constructing yourself as a
very specific kind of listening subject vis-ai-vis the research subject. The listening subject of
psychotherapy or other forms of psychiatric treatment is not the same as the listening subject of
psychiatric assessment. Unlike psychotherapy, assessment requires a shallow investment in the
research subject's wellbeing, not just at BPU but also across the discipline and practice of
psychiatry in general. 67 Clinical team member could deploy verbal strategies for encouraging
disclosure-such as the strategic performance of disclosing their own person details-precisely
because they bore no responsibility over the terms of their relationship with the research
" See, for example, the way assessment and diagnosis unfold in resource-poor public health settings, like the
emergency psychiatric unit that is the subject of Lorna Rhodes's ethnography, Emptying Beds (1991). In these kinds
of settings, clinical personnel do not treat assessment or diagnosis as epistemological practices but as bureaucratic
ones, aimed at moving patients through (or out of) the hospital. Given the uneven ratio of patients to personnel and
hospital rooms, in public mental health contexts, diagnosis and assessment become tools for divvying up vital
resources, rather than illuminating the inner truths of psychic pathology.
296
subjects, and how this relationship might impact their mental health. It wasn't that they didn't
care about research subjects. To say that they have a shallow investment in subjects' wellbeing is
not meant to be an indictment of their individual moral characters. Rather, as a matter of doing
their job and just as other people conducting psychiatric assessment do, they had to follow the
moral framework that their professional task at hand dictated: guiding the interview in order to
gather the data that they had been instructed to gather. In order to listen professionally, they had
to avoid listening personally. Here, "thin listening" is less about respecting the speaker's privacy,
and more about the listener taking care to not overstep professional boundaries, or perhaps, to
protect their own psychological wellbeing: a form of distancing to avoid getting overly attached
to the many research subjects whose troubles they cannot soothe.
ANNOTATION: LISTENING LIKE A COMPUTER
Recall the end goal of the cell phone study: map speech to emotion, and then map emotion to
mood, so that BPU researchers can track changes in vocal emotional patterns to anticipate
changes in mood. The purpose of assessment is not to provide treatment to research subjects but
to calculate-and numerically represent-fluctuations in research subjects' mood states between
mania and depression. In turn, when annotating the very same calls in which clinical team
members interview research subjects, annotators are supposed to calculate the emotional nuances
audible in research subjects' voices across the entire corpus of calls produced during their
enrollment in the study. While clinical team members focus on the semantic content of research
subjects' speech (their responses to the assessment questions), annotators are directed to ignore
semantic content and assign a rating that corresponds to the emotional "activation" (energy) and
297
"valence" (positive or negativity) of the sound of the research subjects' speech. Altogether,
working on the same calls yet listening to them in different ways and with different
quantification scales, the engineering team and the clinical team render mood, emotion, and the
relationship between the two calculable.
As I have argued above, conducting assessment requires "professional listening" because
the professional norms, guidelines, definitions, and tools of psychiatry dictate how clinical team
members direct their conversations with research subjects, and how they coax out and interpret
subjects' speech. Assessment also requires professional listening because team members are
calculating changes in mood rather than changes in emotion, and mood occupies a more stable,
well-defined position within psychiatry as opposed to emotion. Bipolar disorder is defined by
and diagnosed due to radical changes in mood state, and mood as a category of experience
therefore requires more professional expertise in order to spot and understand. By contrast, the
listening associated with assessment-geared toward quantifying changes in emotion-is
decidedly un-professional. As the guidebook I helped the engineers write states, emotion (in the
context of psychological research on bipolar disorder) is easier to observe than mood. More than
that, since the PI wanted to capture and operationalize a "gut instinct," the lack of professional
attunement to emotion was a benefit rather than a barrier. The PI's directive of "training a
computer to listen like a brain" suggests a mode of listening that is radically transparent-a
mode of listening that is im-mediate. Training in clinical psychology and a familiarity with the
various scales and instruments for measuring mode might present another layer of mediation.
Thus, under the PI's direction, Hassan had selected two undergraduate interns in computer
science to assist with the annotation task: Aubrey, a Chinese-American junior at MWU majoring
in computer science, and Josh, her white American cohort-mate. Like Aubrey and Josh, my own
298
lack of training in psychology is part of what made me such a viable candidate to lend another
set of ears to the annotation task.
In order to teach a computer to listen like a brain, however, the annotators found
themselves in a position of having to "listen like computers," a compelling phrase that Chen first
evoked when directing us to avoid paying attention to speech content and base our ratings as
much as possible on speech sound alone. Thus, annotation differs not only from assessment, but
also from the other listening practices in operation at the two other fieldsites. For instance, at
West Coast University, the goal was to build a machine interface that convincingly performed a
certain image of "empathic" listening as an automatic act, an image that depended upon keeping
hidden the young female researchers, whose own listening animated and guided subjects'
interactions with the VHI, and upon reproducing a professionally and racio-ethnically marked
listening habitus. At the BPU, on the other hand, rather than building a machine that appears to
be listening like a human, human annotators were asked to build a machine that listens like a
human by listening like machines: attending to sound without processing, internalizing, or
understanding content, disavowing the person attached to the voice. This presented an
insurmountable tension: the directive was to listen intuitively, to go with our guts, but also to
listen to speech in a way that was completely alien if not altogether impossible. In trying (and
failing) to listen like computers, the annotators eroded the model of language upon which the
entire projects rests: a model of language in which form can be wrested apart and held separate
from meaning. In their trials and tribulations of trying to listen like a computer, the engineering
team also deflated the lofty promise of an im-mediated "listening brain."
Not unlike Meredith, Hassan was a self-described conservative engineer. In his opinion,
before the app could be built or deployed in a clinical context outside of an experimental set-up,
299
the team would first need to assess whether or not a human listener could consistently, auditorily
interpret emotion in a uniform way. Thus, the many labels that Aubrey, Josh and I added to the
audio segments would not be used to build the cell phone app-Hassan had us label the data
instead to calculate the statistical agreement across our labels (quantifying the extent to which we
all agreed with each other's labels). Within the first few weeks of meeting him, Hassan admitted
to me that he had put together the annotation task precisely because he thought it was impossible.
He did not anticipate that Aubrey, Josh and I would agree with each other's labels in a
statistically significant way, and his hunch proved correct. If humans were incapable of
consistently agreeing upon the emotional texture of the sounds they heard in speech, Hassan
would say, it was unreasonable to expect that an algorithm could identify these features in a
robust, meaningful, or accurate way. His cynicism with the project's overall goal was born from
skepticism with the status of emotions as stable objects of analysis. He would compare
automated emotion recognition with voice-to-text translation, and with speaker recognition, the
two main problem spaces of his doctoral thesis work. Unlike automated voice recognition,
automated speaker recognition "can be done, because at least we know that the speaker actually
exists and we know who he is"-the speaker's identity, and his existence, can be confirmed.
Emotions, on the other hand, reside in a world beyond the calculable, material realm. He would
gesture to the door of the engineering office, which he left half-open if we weren't having a
meeting: "We cannot say that the door is open or closed. It is something in between. It is fuzzy."
To Hassan, the existence of emotions was an ontologically uncertain matter, not fit for computer
science.
Nevertheless, if Hassan was going to prove the entire premise of the study wrong, he was
going to do it in a consistent, systematic way. Hassan designed the annotation task and software
300
interface we used to rate segments in an effort to make emotion less fuzzy. Just as HAM-D and
YMRS make mood tractable, stable, and quantifiable, holding it in place, the annotation task
required a technology through which the team could attempt to pin down, concretize, and reify
emotion. Hassan opted to use the dimensional model of emotion for the annotation task, a model
popular among computer scientists and speech signal processing experts studying the
relationship between emotion and speech quality. In the dimensional model of emotion, all
potential human emotions can be plotted within four quadrants, defined by an X-axis of
activation (speech energy) and a Y-axis of valence (speech "color" or "charge"). 6 8 For the task,
Aubrey, Josh and I listened to the same set of audio segments, and rated the "activation" and
"valence" of each segment on a scale of one to nine. For example, a "one" corresponded to low
activation and low valence (low levels of energy with direly negative sounding speech) while a
"nine" corresponded to high activation and high valence (high levels of energy with effusively
positive sounding speech). Over time as we annotated more and more segments and even began
to re-annotate segments we had already listened to, we began to hear subjects' speech through
these numbers, rather than listening to the speech and figuring out how it might fit in to the
rating scale.
What do activation and valence mean, and how did the engineering team come to make
activation and valence meaningful for themselves? Early one morning about a month in to
annotating the audio segments, the PI had walked in through the office's half-open door just as
Aubrey and I had taken our respective seats in front of the two desktop computers designated for
annotation. The PI had come looking for Hassan, who had not yet arrived, and in Hassan's
" The dimensional model of emotion offers a more analog, less binary, and broader space for defining emotion as
opposed to the more traditional linear model that posits a set series of possible emotions ("anger," "happiness,"
"disgust," etc.)
301
absence he struck up conversation with Aubrey and me. Groggy from the hour-long bus ride
from the main campus area to the BPU, Aubrey and I vaguely relayed that our work was
difficult, and we were often unsure of ourselves. It wasn't easy to concretely quantify activation
and valence, two scales of qualification that we had never used (at least not consciously) when
making sense of speech in our day-to-day lives. The PI wondered if we knew the historical
origins of these terms and asked for a pen and piece of paper to diagram out a lesson for us. He
explained that the terms can be traced back to the criteria that Kraepelin had used to plot out
fluctuations in his patients' moods in his long-term study of "manic-depressive insanity." He
suggested we read Kraepelin's book of the same title to help us swim through the annotation task
with more ease. Kraepelin's category of "volition," he told us, is related to "activation," and
what he called "emotion" is linked to "valence." Satisfied with his work, the PI left, and I set
aside the college ruled diagrams he drew for us.
When Hassan arrived two hours later, we presented him the drawings and asked if he
knew of these origin stories, or if he had ever read Kraepelin's book. He took the ruled paper in
his hands, turning it upside-down. "I myself have no idea what these words mean, activation and
valence, and I can't even read this. Aubrey, when I hired you, did I draw this diagram and make
you read some 300-page book from 200 years ago?" "All I remember," answered Aubrey, "is
that you showed up like twenty minutes late."
Despite his uncertainty, as the most senior member of the engineering team, Hassan
trained incoming annotators. It fell upon him to define these terms for the people working under
him. Like "emotion" and "affect," the team had its own working definitions of activation and
valence. When I first met Hassan, he explained that, "activation means excitement-does the
speech sound calm or excited? And valence means the negativity or positivity of the speech
302
signal-is the emotion in the signal negative, neutral, or positive?" He would demonstrate the
distinction between activation and valence, as captured in the "speech signal," using himself and
his own voice as an example. "For instance," he would say, "you might not know that I am
depressed, because you can hear, right now, that I sound happy, relaxed. Low activation. High
valence." His example resembled something he said often when he arrived in the office for the
day, usually after a night of working late until sunrise and video-chatting with his wife, who was
finishing her PhD in computer science many states and time-zones away. He'd throw up his
hands, smile, and announce as he stood in the doorway, "Hey guys, life is great! I am
miserable!" Considering what I knew about him-he was always working, he had to live a 3
hour plane ride away from his wife, his future job prospects were uncertain, he and his wife
could not return to Iran for the foreseeable future and struggled financially-it was hard to
determine when his sarcasm veered into genuine truth.
I do know that his declaration was meant to make us laugh rather than pity him. Hassan's
humor tended toward the dire, and often hinged on a rift between what he said, how he said it,
and what he believed. This disconnect, he would tell the engineering team, was characteristic of
what he called "Persian humor," akin to sarcasm in U.S. American English. He would often state
something very serious and sincere sounding, in a grim, austere tone, only later to reveal that he
had been joking, and that he did not hold dear whatever he had said. For example, although he
had consented to be my research subject and agreed to allow me to record our day to day
activities and conversations in the engineering office, every now and then he would pick up my
audio recorder, point to it, and ask, in an astonished, accusatory tone, "what is this? You are
recording me? I never agreed to this. Turn it off. This is a disgrace. How can you ask me to
participate and give me nothing in return?" When I would frantically apologize and jump from
303
my seat to turn off the recorder, he would say, quietly, with a smile, "Beth, Beth. I'm kidding. I
am joking," and everyone else in the office would groan and laugh. Like his demonstration of
activation and valence, Hassan's joke inadvertently challenges the idea that a listener-like an
annotator-can arrive at the sincere, authentic core of speech. It forwards a kind of opacity claim
about emotion: there can be a discord between what a person says, how they sound, and how
they feel or what they truly believe. Indeed, Hassan demonstrated that the correspondence
between beliefs and practice can be shaky and murky with his participation in the entire project
altogether: he did not believe its central mission was possible, and yet he helped the team pursue
it nonetheless.
Hassan and Chen also offered critiques of the project's central claims through their own
inability to participate in the annotation task, which they found impossible to do. Both of these
men were English-language-learners. They had only recently come to the U.S. and required to
speak English in order to get through their days and move about the world. Discussion whether
or not it was possible for them to "listening like a computer" enacted a critique of the universal
subject position that the formulation of "the listening brain" implies. One such conversation
came up during one of Hassan's machine learning training sessions. We were all huddled around
the small whiteboard in the engineering office, taking notes as Hassan explained how to calculate
the concordance correlation coefficient (or CCC) of all the annotation ratings that Aubrey, Josh,
and I had produced (in the service of calculating the extent to which we agreed with each other).
During a lull in the conversation, I suggested that it might be an interesting experiment, just for
fun, to have Hassan and Chen annotate the segments, calculate their agreement, and then
calculate their agreement with the three native English-speaking annotators. Though Hassan
replied with his usual pessimism, Chen saw an interesting opportunity in my thought experiment
304
(my emphasis added):
Hassan: of course it's make a difference [if it is Chen and Hassan annotating], I mean...
Beth: ((laughs))
Hassan: It [the agreement] will be zero
Aubrey, Chen: ((laughing))
Hassan: "The agreement will be zero"...I cannot even understand that sentence
Chen: That's good!!! We don't want you [to] understand it!
Beth: But that's what I mean! Maybe that's
Chen: We don't want to lunderstand]!
Hassan: --even, I cannot understand emotion
Beth: yeah
Hassan: I, I listened to a couple of them [segments], I ((ughh)) what is this? ((laughing))
Chen: That's computer!
Hassan: for example Josh and Aubrey--
Chen: You're a good computer system!
By Hassan's estimation, he and Chen will not agree with each other at all because of his
difficulties with English. He jokes that he struggles to understand the semantic content of
descriptive sentences ("the agreement will be zero") and so understanding the emotional nuances
of a sentence is out of the question. Understanding the emotional nuances expressed in speech is
an even higher order, demanding challenge. However, Chen recognizes Hassan's lack of
understanding, especially of semantic content, to be a tantalizing opportunity, a resource for
"pure" listening to sound alone. Chen imagines that it would be relatively easy for Hassan to
disentangle sound from content, since he does not intuitively combine the two when he listens to
and interprets streams of speech.
Chen was obsessed with the idea of purity and achieving "pure" listening without
"cheating," i.e., without paying attention to content. Chen's vision of unmediated, pure listening
takes the form of what Chion (1990) Scheffer (1996) and others call "acousmatic listening":
attention to sound without regards to its source, the cause of the sound, or the force motivating it
(see also Kane 2014). Listening to the excerpts acousmatically was central to the task of building
305
the cell phone study's algorithm, which breaks the acoustic components apart from the
denotational components of discourse. Chen was constantly chiding Josh, Aubrey and me for
cheating, especially when we began discussing segments that we were struggling to annotate.
Our conversations made clear to Chen that we were indeed listening to-and absorbing--the
content of segments, since we would use contextual information about the speaker in order to
refer to their vocal characteristics and the segment in question, like "the guy who works for
Uber," or "the woman who owns many pet birds" or "the guy who has a difficult relationship
with his mom and went to see a psychic about her." Chen would intercede in a hushed voice,
"cut it out guys. Stop that. No more of that," jerking his head in the direction of the office door
left ajar, worried that someone in the hallway might hear our frank discussion of not just the
content of the segments but also the research subject uttering them. Chen imagined that an
English-language-learning speaker could turn this "understanding" off, and much more easily
exit the realm of semantic meaning, protecting the privacy of the subject and preventing
agentive, focused listening from sliding into the unfocused absorption of hearing.
Yet as our conversation in the engineering office progressed, Hassan shattered Chen's
dreams of the American English language-learning speaker as a "good computer system."
Instead, Hassan began to suggest that the capacity to identify and characterize "emotional"
components in speech, or even discern the difference between sad speech and angry speech,
depends on one's native language. In this way, being a non-native speaker is a hindrance rather
than advantage. This also suggests that emotional features of speech are not universally produced
or universally understood (emphasis added):
Beth: So like [...]let's say you overhear Adele talking in her office you can't hear what she's
saying, but you can like...would you be able to tell, oh she's having a good conversation. Or oh
she's [angry] somebody's in trouble, like...
Hassan: Ehh actually initially I suggested this to Meredith, I suggested that let's uh because we
306
don't wanna concentrate on content, so let's ask some...non-native--
Beth: Yeah
Hassan: --Speakers to listen to it
Aubrey: Mm
Hassan: Then I, uh listened to a couple of them [segments]
Beth: And it was too--
Hassan: And was like....I have no idea--
Beth: --hard ((laughs))
Hassan: --have no idea what that...so it seems that
Chen: Maybe you're just being honest
Hassan: Yeah we are not focusing on content
Beth: Yeah
Hassan: But we are not eh still...able to focus on acoustic eh acoustic features of emotion
Beth: Yeah...because it's, because it's not, I think you were right
Chen: --is
Hassan: Because it's, it's behind the phonemes...I cannot pronounce phonemes correctly, so
I don't know...the correct place of this phoneme
Beth: Yeah
Hassan: how can I know the correct place of [the] angry version of this phoneme?
Hassan had initially shared Chen's hunch that a non-native speaker might be uniquely situated to
perform the annotation task, but found himself falling up short when sat down and tried to listen
to and rate the segments for activation and valence. Nevertheless, for a brief instant, Hassan is
seduced by Chen's insistence that perhaps Hassan struggled with the annotation task because he
was too "honest," again, because he was doing such a good job of not understanding the content,
which is the basic requirement of annotation. This reading flips the typical connections of
"honesty" in Euro-American English on its head: rather than corresponding with transparency,
being "honest" in listening to sound rather than content keeps the semantic, referential meaning
of speech opaque and inaccessible. But Hassan returns to his firm position that he and Chen
cannot identify acoustic features of emotion so long as they do not know the standardized
"placement" 6 9 of regular phonemes. If they themselves struggle with producing standard
" Standardized placement here refers to the oral production of speech sounds in a way that corresponds with their
representation in vowel and consonant charts, in which the sounds of American English are plotted according to the
positioning of the lips and tongue associated with their production. For instance, a "low back vowel" is a vowel
307
pronunciation, then they will struggle to identify and interpret the meaning of non-standard
pronunciation (an "angry" phoneme) in another speaker.
Hassan turned the thought experiment onto us, stating that, "if I talk in my language you
cannot say [if] I'm happy." Even if he were to laugh, this would not be a sure-fire indication of
the affective charge of the conversation. He reiterated to us his struggles to understand the
emotional nuances of paralinguistic components of speech, even when it was socially incumbent
upon him to do so. This was a steep hurdle to cross when he first arrived to the United States:
Hassan: [...] When I first came here, ah, I talk to you know American people and I was like....the
person is mad at me?
Beth and Aubrey: ((laughing))
Hassan: Somehow this person is- is-- it looks like the person is not really happy ((laughs)).
Maybe he's happy maybe he's not happy so, so, it's all-- I always have this problem that-- still I
have this problem that sometimes...I can't understand.
Even when Hassan needed to interpret people's expressions of emotion to get through his day, he
struggled to distinguish angry speech from so-called neutral speech, and he still struggles with
this to this day. Much to the delight of the anthropologist, our conversation was edging toward an
exciting conclusion: the engineers conceding that emotions are not universally expressed, might
not be universal, and therefore cannot be universally interpreted. It was Chen, rather than
Hassan, who put this breakthrough into words, describing what Hassan's experience implies for
the system that the engineers are building:
Chen: Yes, but then, I have a question to your computer system. Is your computer system, your
neural network, [it] has American culture knowledge?
Aubrey: ((laughing))
Beth: that's the--
Hassan: Yeah, yeah, yeah, exactly! It's language dependent
Beth: --but that's the, that's the point
produced with the tongue positioned low in the mouth (relative to the roof of the mouth) and bunched toward the
back of the mouth (relative to the mouth's opening).
308
Hassan: --language dependent yeah
Chen: so it has the culture, in it?
Beth: yeah
Hassan: Yeah it has a language specific knowledge [...]for example if you train a system based
on, for example, English language? Then you test it for Chinese language--
Chen:yeah
Aubrey: I guess [laughs]
Hassan: --it doesn't work, I mean
Aubrey: yeah
Chen:yeah
Finally, Chen comes out and says it: the neural network-the basis of the cell phone study's
predictive algorithm-does not have general knowledge or does not "listen like a brain." Instead,
the system will have "American culture knowledge" in it, because identifying emotion is a
culturally specific ability. Hassan adamantly agrees, and Aubrey, at first finding the idea funny,
concedes that technically Chen is correct because the system has "language specific knowledge,"
later agreeing that she and I have access to knowledge about the link between emotion, sound,
and speech that the two men do not. The cell phone app, if it ends up being built, will be limited
by the knowledge and experience of the people responsible for building it, and so it will identify
and therefore define emotion according to the limits of this knowledge. As Hassan notes, the
system could not identify the emotions of Mandarin or Cantonese speakers, because it is based
on the language-specific knowledge of native speakers of English.
Together, the engineers unpacked the "human," challenging the cell phone study's
biological essentialism. Pushing this line of thinking to its next step implies that, if the vocalized
expression of emotion is not universal, then perhaps what the ECU team calls "vocal
biomarkers" may not even exist. Even acoustic features of speech are wrapped up in cultural
mediation from which they cannot be entangled. Moreover, this implies that not everyone has
access to the same "intuitive hearing" because listening is a cultural practice and listening to (and
309
distinguishing) emotion requires cultural or "language-specific" knowledge. They insinuate that
form and content are connected through cultural mediation that requires communicative
competence (which is more than just a "gut feeling") to grapple with and weed through. This
implies that the central project of the BPU cell phone study -to splice apart form from
meaning -is an impossible one; this goal ignores the very nature of language as both material
and semiotic.
Finally, Chen and Hassan's ruminations strike a chord with the observations of earlier
anthropological studies of technologists and the automated systems they build. Algorithmic
systems, then, are not capable of recognizing pattern beyond human capacities-they merely
reiterate patterned associations between qualities and types that already have a sociopolitical life
and historical trajectory (see Noble 2018). Engineers and computer scientists, the professional
practitioners deep in the weeds of building these systems, have a keen understanding of the
systems' limitations, and of the fact that they are subjective rather than objective eyes or ears
from nowhere.
INFRASTRUCTURES OF FEELING
Josh, Aubrey, and I were asked to listen to the calls and "follow our guts," use our "best
judgment" and "intuition" when scoring the activation and valence of a speech sound. We
internalized this language of intuition, and when debating over what score to assign a to segment,
we would tell each other that it reallyjustfelt like a 7 or a 3 or a 2. At the same time, Chen and
Hassan ordered us over and over again to focus on speech sound alone and eschew or at least
avoid discussing the content of the segments, an altogether counterintuitive mode of listening to
310
speech. In other words, our task was to treat the speech as familiar, drawing upon our intuitive
sense of its emotional sound, while simultaneously treating it as unfamiliar, as pure, acousmatic
sound rather than speech at all. In this section, I expand upon the discussion above about the
"cultural knowledge" that annotators might encode into the infrastructure of the cell phone
study's app. I also show the ethical, affective tensions that attempting to listening like a
computer can bring, through the struggles of both "getting to know" how a research subject
speech sounded (getting a sense of what their neutral, asymptomatic, or 5-level speech was like)
while also disavowing the particularities and personal details of their conversations and
circumstances.
The team selected Aubrey, Josh, and me to annotate the segments in part because it was
so time consuming, and we were all relatively unskilled and lacked the technical training that
would've enabled us to help out with less menial tasks. Moreover, we were selected due to our
simultaneous (lay) expertise (in American English) and our in-expertise (in psychiatry and
psychology). Nevertheless, the annotation task was extremely difficult. Even as we began to
internalize something about the relationship between vocal qualities and the pathological states
of bipolar disorder, we struggled to put into words and verbalize what it was exactly that we
were rating, what about the subjects' voices motivated our decision to assign them the ratings
that we did. When I first began the annotation task, I spent a long time on each segment,
replaying a single segment over and over again, mulling over my choice. As time went on and
the rating scale began to inhabit my understanding of how bipolar disorder sounds, I could rate
many more segments per day than I had initially been able to. I resigned my judgment to the
scales.
Annotation ratings were supposed to be "subject dependent." The numbers should be
311
specific to an individual subject's speech patterns, which lead to a problem. Ratings were not
supposed to depend on some external, universal standards, such as a general understanding of
how speakers of American English typically speak (in terms of the activation and valence of
their speech). Instead, whenever it was time to begin annotating a new subject, we had to spend a
good amount of time clicking through and replaying the segments without assigning a rating, all
in order to develop a sense of what activation and valence sounded like for that particular person.
This required determining how their most neutral speech sounded-speech that was not
extremely energized or lethargic, and speech that was not clearly exuberant or sad. We referred
to this as figuring out the subject's "five" speech: speech that was rated at af ive for activation,
and a five for valence. The concept of five speech ratifies the notion that, for people experiencing
bipolar disorder, non-pathological speech coincides with the absence of emotion, which
insinuates that there is something inherently pathological about any sort of emotional experience.
The annotation software interface reinforced this notion that five ratings coincide with the
absence of activation and valence-the absence of emotion-while also visually reinforcing the
meaning of the numbers themselves and a model of emotions emanating internally, from a
person's self. The schematized figure of the human over the number five for valence wears a
blank expression, and the grain of activation in the center of the torso of the figure above the five
for activation is a reasonable size (unlike the figure above 9, which has been enveloped by an
explosive cloud of energy).
Five speech became an anchoring point for helping us figure out how to annotate
segments in which the activation and valence were unclear. We would pass the headphones to
each other, describing our sense of the segments' proximity or distance from the feeling of their
five. Because we all annotated the same segments, we could consult each other regarding the
312
about troublesome ratings. Aubrey might insist, "well her five speech is kind of activated-
sounding," or Josh would assert, "he sounds pretty close to neutral." The headphones and the
annotation software also helped to calcify the feeling of fiveness. The ability to focus intently on
the subjects speech, blocking out all other sounds in the office with the sound-canceling
headphones, along with ability to pause, rewind, and replay the segment an endless number of
times allowed us to auditorily scrutinize the speech in a way that the team members making
assessment calls never could. In addition to passing the headphones around and the affordances
of the annotation software, as we annotated, we would leave notes for each other on a communal
notepad, like notes along a map: this particular subject was "good for depression" (they had
many segments with low activation and low valence). This other subject had lots of noisy
segments that were difficult to hear. In this way, we began to collectively establish the meaning
of fiveness, forging tacit knowledge about the data set. We brought the rating scale into
existences by turning annotation into a social activity.
By the end of my four months at the BPU, we could describe subjects' speech to each
other using the numbers alone. A calm and contended subject was a 6-8 (relatively neutral
energy level in the voice, relatively positive sound of the voice). A disgusted or annoyed subject
was an 8-4 (fired up with aggravated energy, slightly perturbed coloring to the voice). We would
often tease each other by rating each other's speech or process a tense moment at a BPU-wide
meeting by later joking with each other about the activation and valence of two people who had
been arguing. I once said something incredibly embarrassing to the PI in the hallway and later, as
I rolled around on the office floor in shame, the others stood around me laughing and debating
about my speech: my activation was definitely at a 9, but valence was confusing-I was
mortified but also, humbly, laughing at myself.
313
The numbers took on an affective charge, a sensation rather than, ironically, something
that we could quantify. The annotation software with its 1-9 scales, our conversations, our notes,
and our own ideas about the slowness of depressed speech and the quickness of manic speech
scaffolded and constituted our "intuition." Altogether, these technologies co-constituted the
affective texture of the subject's speech. This process of internalization underscores that affect is
not something inevitable or pre-lingual, but a feeling that must be held together by systems of
quantification. If assessment depends on professional listening, then annotation depends on
annotators collectively building and fortifying infrastructures of feeling, practices and scales that
make sound meaningful but that can also successfully meld into the background, disappearing
altogether.
In addition to a feeling for the five of the research subjects' speech, Aubrey, Josh and I
shared something else, something much more intimate: an understanding of just how sick some
of the research subjects were, how much some of them struggled, how much some of them
appreciated their weekly calls with the psychology team member. We might also learn about
how well they were doing, about high points in their lives, or about how much they wanted the
phone at the end of the study rather than the participation stipend. While we could strive as much
as possible to listen like a computer, it was impossible to fully reject the presence of the person
uttering the sound. There was something weighty about the whole process, and while Aubrey and
Josh appeared to be managing better than me, at times, the research subjects' audio excerpts
would arrest me, piling up on me in an invisible way. One subject had such frenetic, frenzied
speech, that only hours later when trying to fall asleep in my apartment did I realize with an ache
that I had been clenching my jaw all day while listening, hiking my shoulders up to my neck.
The subject's anxiety had made its way into my head, into my body. Some subjects spoke
314
frankly about their desire to take their own lives, with organized, detailed plans. For those
subjects, I was relieved that the BPU is a mental health care facility and that while they cannot
provide psychotherapy over the phone, the team does have trained clinicians who can and will
assist subjects who are in this much distress. These kinds of segments were particularly hard for
me to forget, to pull the sound away from the semantics, the sentiment away from the person.
Kraepelin wrote of his patients, during bouts of mania, hearing auditory hallucinations,
voices they described as emanating from God, from spirits, or voices as if through a telephone.
During my four months annotating segments at the BPU, it was I who heard telephone voices at
night, voices I was supposed to be ignoring and forgetting but voices I could not fully sever a
connection from. As I wrote one sleepless evening in my fieldnotes, "like a drop in barometric
pressure, they squeeze me, I am contained by them" (July 19, 2017). I lacked the kind of training
that Adele and Rochelle had, the kind that enabled them to keep a part of themselves closed off
and safe from the affective weight of conducting psychiatric assessment. The very thing that
made me an excellent in-expert subject for annotation was also what kept me up at night. During
these moments, I would remember what Jacob had told me, like an incantation: do not forget
about the people on the other end. Even if absorbing the audio segment's contents was
"cheating" and I was potentially violating the study's IRB protocol during these late-night
remembrances, perhaps I was also honoring Jacob's request, holding space for the humanness
that would be built into the study's algorithmic system.
CONCLUSION: TECHNOLOGIES OF CARE
315
Some scholars argue that the ubiquitous presence of sensors and personal computers that track
and capture vast volumes of data, or communication technologies that offer always-open
channels of contact, leads to a world of disconnection, in which people lose the capacity to feel
authentic, genuine intimacy (Turkle 2011). But, as I have hoped to show, technologies like cell
phones offer a form of connectivity that can be life-saving, as was the case with Jacob, as well as
the research subject whose participation in the study gave them the opportunity to chat with a
mental health care worker once a week. That so many of the research subjects wanted to keep the
BPU data and internet-enabled smart phone in lieu of the payment speaks to the fact that, like so
many other resources, the hyper-connectivity of communication technologies is asymmetrically
distributed. Access to a smart phone is a luxury for some before it even has the chance of
transforming into a source of pathology.
Other times, the kind of connectivity these technologies-and the making of them-
requires is too close. In the annotation task, Josh, Aubrey and I had to fight a losing battle with
disconnection. We were supposed to keep our selves separate from the lives and stories of the
research subjects but listening/not listening to the calls granted us a strange, inescapable
intimacy. The problem, then, is not that cell phone applications disconnect people. It is that, in
building the kinds of apps and devices like my informants at the BPU sought to develop, users
and builders become radically connected, and are strung together in a relation that (at least for
me) can feel overwhelming. By ending on an ambivalent note and speaking honestly about how
the annotation task at times disturbed me, I hope to emphasize the extent to which building any
sort of voice-analysis technology, whether for mental health interventions or not, is no light or
inconsequential matter. Annotating the audio segments fundamentally changed my perspective
on devices like Google Home or the Amazon Echo. I now understand that, regardless of what the
316
companies producing these devices insist, human listening plays an unavoidable role in their
development. The presence of a human listener somewhere in the data pipeline, who listens to
and annotates audio segments, is a design feature.
Recent investigative reporting (Day et al 2019; Van Hee et al 2019; Vincent 2019) has
indeed revealed that both Google and Amazon rely on outsourced laborers to auditorily weed
through and annotate the audio segments that users freely pass on to these companies through
their use of the technologies-by interacting with them, speaking to them, the devices capture
and process the user's voice segments. In this coverage, the "eavesdropping human" is
contrasted with the "listening machine." I have hoped to show that this opposition is a false one.
These two are one in the same-in order to make machines listen, you need human listeners.
Unlike my informants at the BPU, who are ultimately committed to improving the lives of
people living with bipolar disorder and interrupting pathological experiences before they can
begin, Amazon and Google have no ethical review structure, no IRB to answer to. Under a
neoliberal model of consumer choice in which consent is far murkier, and the terms of service
are buried in pages of text, the protection of user's privacy (and the mental health of annotators
who listen to their speech) is far shakier.
Within this unregulated space, the outsourced annotators are at great risk as well. Just as
scholars conducting ethnographic research with the content moderators who keep social media
sites like Facebook safe, clean, and free of disturbing images have called for greater regulation
and access to mental health services for content moderators (Roberts 2019) my fieldwork
suggests the dire need for occupational hazard oversight in commercial applications of machine
listening. Strides toward this goal are indeed being made in academia. Several researchers have
testified that building voice analysis technologies for mental health applications does indeed
317
carry the potential for psychological harm (see Wolters, Mkulo and Boyton 2017). For example,
writing in an article that reviews efforts to develop voice analysis technologies for suicide and
risk assessment, a group of computer scientists and engineers warn,
There are a range of potential health risks to investigators associated with collection of
severely depressed and suicidal speech...direct interaction with depressed and suicidal
individuals during collection or subsequent exposure to recorded data during tasks such
as annotation can lead to research health risks including vicarious trauma and depress.
The risk is magnified in researchers with non-clinical backgrounds, who might n be
unfamiliar with either condition" (Cummins et al 2015: 37-38).
The authors suggest a variety of best practices that involving explaining mental health risks to
investigators, minimizing exposure to audio recordings and avoiding the use of headphones.
They also suggest that investigators preview the excerpts and consult with a trauma psychologist
before agreeing to take on the work, and regularly consulting with psychologists and colleagues
while conducting the work. My fieldwork in the academic realm, with its ethically squeamish
moments and my complicity in them, is a microcosm of the troubles and perils at play on the
global scale of so-called smart speakers, listening devices, and the outsourced listening/not
listening that enables them. The connectivity and closeness that unsettled me should spur us into
action to call for the reform of Big Tech, and to suggest that Big Tech look to the academic
realm for insight on how to build these technologies with greater concern for the privacy and
safety of everyone involved.
Moreover, it was only in shadowing the members of the psychology team at the BPU that
I came to better understand-and respect-the complex skills that psychiatric assessment
requires. I gained a respect for this job that informed how I looked at the data gathered at my
other sites. It was only after meeting and shadowing Adele, Rochelle, and Lauren that I began to
realize the extent to which automating psychiatric assessment delegitimizes this job. Learning
from them led me to double back on the data I had gathered at the other sites and deepened my
318
analysis of the VHI. Even though they were not in a position to conduct psychotherapy over the
phone-to care for the subjects as patients-they were all extremely committed to the larger
project of finding a way to help people who live under the diagnosis of bipolar disorder. I often
felt during my fieldwork that I was in no place to critique the cell phone study and some of its
more ethically questionable components- listening to people's phone calls. Everyone on the
team was committed to making a material difference in the lives of their patients, via the cell
phone study. Like the team members at other sites, people working at BPU tended to be
motivated by their own encounters with mental illness-among family members, classmates,
siblings, partners, friends-who made their work on the cell phone study quite literally close to
home. For instance, one of Adele's first jobs was at a long-since-closed state mental hospital.
She witnessed the treacherous depths and teetering, dangerous heights of bipolar disorder first-
hand while working this job. As she went about her day-to-day tasks, sometimes administering
injections of the anti-psychotic drug Thorazine, she encountered severe cases of patients whose
conditions were full-blown. She returns to these encounters, she told me, to keep her motivated.
These memories compel her to work at the BPU, and to assist with the cell phone study.
At the same time, it is key to avoid sentimentalizing efforts to provide care and to resist
taking for granted that well-intentioned motivations absolves providers of care and those
wrapped up in building mental health care interventions from critique. If anything, what my
fieldwork at the BPU shows is that care itself is ambivalent, murky, poly-vocal, and
contradictory. Adele and others could not, technically or legally, care for the research subjects on
the phone. The annotators and I, technically, were not supposed to care about the content of the
calls we rated. In an effort to pin down what "care" can contain and contradict, Martin, Myers
319
and Viseu (2015) write frankly about the ambiguous and sometimes violent "politics of care in
technoscience":
acts of care are always embroiled in complex politics. Care is a selective mode of
attention: it circumscribes and cherishes some things, lives, or phenomena as its objects.
In the process, it excludes others. Practices of care are always shot through with
asymmetrical power relations: who has the power to care? Who has the power to define
what counts as care and how it should be administered? Care can render a receiver
powerless or otherwise limit their power. It can set up conditions of indebtedness or
obligation. It can also sediment these asymmetries by putting recipients in situations
where they cannot reciprocate. Care organizes, classifies, and disciplines bodies. Colonial
regimes show us precisely how care can become a means of governance. It is in this
sense that care makes palpable how justice for some can easily become injustice for
others (627)
The fact that Adele and her colleagues did not invest themselves in the research subject's
emotional wellbeing is part and parcel of care. By this, I mean that neglect and harm are not
opposed to care-they are care's constituencies. To parse out what we might call the attentional
mechanisms of the two modes of listening (annotation and assessment, both of which require
selectively ignore some components of speech will attending to others) of building the predictive
algorithm for the cell phone study, is not to diminish the study's harmful consequences but
merely to always "stay with the trouble" (Haraway 2016) with care. Thus, I hope to have made a
case for the fruitfulness to be had in "unsettling care"-to poke at its taken-for-granted
implications and, through my ethnography, to "situate affection, attention, attachment, intimacy,
feelings, healings, and responsibility as non-innocent orientations circulating within larger
formations, instead of as attributes of individual scientists" (Murphy 2015: 6).
320
References
American Psychiatric Association. 2013. Diagnostica nd statisticalm anual of mental disorders
(5th ed.). Arlington: American Psychiatric Publishing.
Barthes, Roland. 1977. Image-Music-Text. Stephen Heath, trans. New York: Hill and Wang.
Chion, Michel. 1990. Audio-Vision: Sound on Screen. Claudia Gorbman, trans., ed. New York:
Columbia University Press.
Clementz, B, and JA Sweeney, JP Hamm, El Ivelva, LE Ethridge, GD Pearlson, MS Keshavan,
and CA Tamminga. 2016. "Identification of Distinct Psychosis Biotypes using Brain-Based
Biomarkers." American Journal ofPsychiatry 1;173(4): 373-84.
Cummins, Nicholas, Stefan Scherer, Jarek Krajewsi, Sebastian Schnieder, Julien Epps, and
Thoams F. Quatieri. 2015. "A review of depression and suicide risk assessment using speech
analysis." Speech Communication 71: 10-49.
Day, Matt, Giles Turner, and Natalia Drozdiak. 2019. "Amazon Workers Are Listening to What
You Tell Alexa." Bloomberg Technology, April 10. <
https://www.bloomberg.com/news/articles/2019-04-1 0/is-anyone-listening-to-you-on-alexa-a-
global-team-reviews-audio> (accessed July 23, 2019).
Decker, Hannah. 2004. "The Psychiatric Works of Emil Kraepelin: A Many-Faceted Story of
Modem Medicine." Journal of the History ofNeurosciences 13(3): 248-276.
Dror, Otniel. 2001. "Counting the Affects: Discoursing in Numbers." Social Research
68(2):357-378.
Duranti, Alessandro. 1992. "Intentions, Self, and Responsibility: An Essay in Samoan
Ethnometapragmatics." In Responsibility and Evidence in Oral Discourse. Jane H. Hill and Judith T.
Irvine, eds. Pp. 24-47. Cambridge: Cambridge University Press.
Eckman, Paul and W.V. Friesen. 1971. "Constants across cultures in the face and emotion."
Journalo fPersonalitya nd Social Psychology 17:124-129.
Eckman, Paul. 1989. "The argument and evidence about universals in facial expressions of
emotion." In Handbook ofsocialpsychology (Vol. 2). H. Wagner and A. Manstead, eds. Pp.
143-164. Chichester: Wiley.
Eckman, Paul. 1999. "Basic Emotions." In Handbook of Cognition and Emotion. T. Dalgleish
and M. Power, eds. Pp. 45-60. Sussex: John Wiley and Sons Co.
Foucault Michel. 1978. The history ofsexuality. New York: Pantheon Books.
Goffman, Erving. 1981. Forms of Talk. Philadelphia: University of Pennsylvania Press.
321
Goodwin, Charles. 1994. "Professional Vision." American Anthropologist 96(3): 606-663.
Haraway, Donna. 1988. "Situated Knowledges: The Science Question in Feminism and the
Privilege of Partial Perspective." Feminist Studies 14(3): 575-599.
Haraway, Donna. 2016. Staying With the Trouble: Making Kin in the Chthulucene. Durham:
Duke University Press.
Insel, Thomas R. 2017. "Digital Phenotyping: Technology for a New Science of Behavior."
JAMA 318(13):1215-1216.
Kane, Brian. 2014. Sounds Unseen: Acousmatic Sound in Theory and Practice. Oxford, UK:
Oxford University Press.
Keane, Webb. 2003. "Semiotics and the social analysis of material things." Language and
Communication 23: 409-425.
Keane, Webb. 2005. "Signs are Not the Garb of Meaning: On the Social Analysis of Material
Things." In Materiality. Daniel Miller, ed. Pp.182-205. Durham: Duke University Press.
Keane, Webb. 2008. "Others, Other Minds, and Others' Theories of Other Minds: An Afterward
on the Psychology and Politics of Opacity Claims." Anthropological Quarterly 81(2): 473-482.
Kraepelin. 2002[1921]. Manic-depressiveI nsanity and Paranoia.R eprint, Birstol, U.K.:
Thoemmes Press.
Lutz, Catherine and G.M. White. 1986. "The Anthropology of Emotions." Annual Review of
Anthropology 15: 405-436.
Lutz, Catherine and Lila Abu-Lughod, eds. 1990. Language and the Politics of Emotion.
Cambridge: Cambridge University Press.
Martin, Aryn, Natasha Myers, and Ana Viseu. 2015. Social Studies ofScience 45(5): 625-641.
Martin, Emily. 2007. Bipolar Expeditions: Mania and Depression in American Culture.
Princeton: Princeton University Press.
Mattingly, Cheryl. 1994. "The concept of therapeutic 'emplotment."' Social Science & Medicine
38(6): 811-822.
Murphy, Michelle. 2015. "Unsettling care: Troubling transnational itineraries of care in feminist
health practices." Social Studies ofScience 45(5): 717-737.
Noble, Safiya Umoja. 2018. Algorithms of Oppression: How Search Engines Reinforce Racism.
New York: New York University Press.
322
Onnela, J. & Rauch, S. 2016. "Harnessing Smartphone-Based Digital Phenotyping to Enhance
Behavioral and Mental Health." Neuropsychopharmacology4 1:1691-1696.
Puig de la Bellacasa, Maria. 2011. "Matters of care in technoscience: assembling neglected
things." Social Studies ofScience 41(1): 85-106.
Puig de la Bellacasa, Maria. 2017. Matters of Care: Speculative Ethics in more Than Human
Worlds. Minneapolis: University of Minnesota Press.
Rhodes, Lorna. 1991. Emptying Beds: The Work of an Emergency Psychiatric Unit. Oakland:
University of California Press.
Rice, Tom. 2010. "Learning to listen: auscultation and the transmission of auditory knowledge."
Journalo f the Royal Anthropology Institute 6 1(s1): 41-61.
Richardson, Sarah S., and Hallam Stevens. Postgenomics: Perspectives on Biology after the
Genome. Durham: Duke University Press.
Robbins, Joel. 2008. "On Not Knowing Other Minds: Confession, Intention, and Linguistic
Exchange in a Papua New Guinea Community." Anthropological Quarterly 81(2):421-429.
Roberts, Sarah T. 2019. Behind the Screen: Content Moderation in the Shadows of Social
Medial. New Haven: Yale University Press.
Rosaldo, Michelle Zimbalist. 1982. "The things we do with words: Ilongot speech acts and
speech act theory in philosophy." Language in Society I (2):203-237.
Rosaldo, Michelle Zimbalist. 1984. "Toward and anthropology of self and feeling." In Culture
and Theory: Essays on Mind, Self and Emotion. R. Shweder and R. LeVine, eds. Pp. 137-157.
Cambridge, U.K.; Cambridge University Press.
Schaeffer, Pierre. 1966. Traite des objets musicaux. Paris: Le Seuil.
Silverstein, Michael. 2001[1981]. "The Limits of Awareness." In Linguistic Anthropology: A
Reader. Alessandro Duranti, ed. Pp. 382-401. Malden: Blackwell Publishing.
Smith, Benjamin. 2005. "Ideologies of the speaking subject in the psychotherapeutic theory and
practice of Carl Rogers." Journalo fLinguistic Anthropology 15:258-72.
Sterne, Jonathan. 2003. The Audible Past: Cultural Origins of Sound Reproduction. Durham:
Duke University Press.
Throop, Jason. 2010. Suffering and Sentiment: Exploring the Vicissitudes of Experience and
Pain in Yap. Berkeley: University of CalforniaP ress.
323
Throop, Jason. 2003. "Articulatinge xperience." Anthropological Theory 3:219-41.
Torous, John, and Adam C. Powell. 2015. "Current research and trends in the use of smartphone
applications for mood disorders." Internet Interventions 2(2):169-173.
Turkle, Sherry. 2011. Alone Together: Why We Expect Morefrom Technology and Lessfrom
Each Other. New York: Basic Books.
Van Hee, Lente, Ruben Van Den Heuvel, Tim Verheyden, and Denny Baert. 2019. "Google
employees are eavesdropping, even in your living room, VRT NWS has discovered." VRTNWS
July 10. < https://www.vrt.be/vrtnws/en/2019/07/10/google-employees-are-eavesdropping-even-
in-flemish-living-rooms/> (accessed July 23, 2019).
Vincent, James. 2019. "Yep, human workers are listening to recordings from Google Assistant,
too." The Verge, July 11. <https://www.theverge.com/2019/7/11/20690020/google-assistant-
home-human-contractors-Iistening-recordings-vrt-nws> (accessed July 23, 2019).
Wolters, Maria K, Zawadhafsa Mkulo, and Petra M. Boynton. 2017. "The Emotional Work of
Doing eHealth Research." Proceedings of the '17 CHI Conference Extended Abstracts of Human
Factors in Computing System. Pp. 826-846. Denver, CO, June 5.
Wynne, Brian. 1996. "May the Sheep Safely Graze? A Reflexive View of the Expert-Lay
Knowledge Divide." In Risk, Environment and Modernity: Towards a New Ecology. Scott Lasch,
Bronislaw Szerszynski and Brian Wynne, eds. Pp. 44-83. London: Sage.
Zhan, Andong, and Srihari Mohan, Christopher Tarolli, Ruth B. Schneider, Jamie L. Adams,
Saloni Sharma, Molly J. Elson, Kelsey L. Speaker, Alistair M. Glidden, Max A. Little, Andreas
Terzis, E. Ray Dorsey, Suchi Saria. 2018. "Using Smartphones and Machine Learning to
Quantify Parkinson Disease Severity: The Mobile Parkinson Disease Score." JAMA Neurology
75(7):876-880.
324
Conclusion: An Ironic Dream of a Common Language
"He listened with grave interest. 'It is strange to see the mysteries of my discipline from outside,
through your eyes. I've only seen them from within, as a discipline.'
'If you permit-if you wish, Faxe, I should like to communicate with you in mindspeech.'
Iwas sure now that he was a natural Communicant; his consent and a little practice should serve
to lower his unwitting barrier.
'Once you did that, I should hear what others think?'
'No, no. No more than you do already as an empath. Mindspeech is communication, voluntarily
sent and received.'
'Then why not speak aloud?'
'Well, one can lie, speaking.'
'Not mindspeaking?'
'Not intentionally."'
- (Ursula L. Guin, The Left Hand ofDarkness, 1969: 56)
In the exhibition hall where I stood with Hillary (WCU) to demonstrate the android for the study
that never happened, we were met with many questions. Many of the attendees were horrified,
showing their disgust on their face as they listened to Hillary and I ramble off the script we had
agreed upon using to describe the android, its relationship to the VHI, and the study. These kinds
of attendees accused us of attempting to build a therapeutic system that would replace humans,
destroying job opportunities and outsourcing the fragile work of psychiatric care to a cold,
uncaring, inert object.
Sometimes, I shared people's disdain, and felt dissatisfied with the answers that I was
supposed to give as an erstwhile member of the team, working alongside Hillary and lending her
a hand. Secretly, I agreed with some people's moral outrage; I knew from our private
conversations in the long car rides we shared together that Hillary did, too. Although we were
mere, low-level representatives of the team, during the hours we stood flanking the android,
attendees would question the ethics of the entire VHI system. They would ask us how the
325
privacy of research subjects could ever be fully protected; some guessed that Hillary and I had
listened to the research subject's stories (which we had, because the team's IRB protocol enabled
us to, and because we had to, in order to "get to know the data"). Others questioned whether or
not it was right to use research subjects as data without providing them access to mental health
resources. One person wondered why we were not focused on the elimination of war and
imperial occupation itself, which is the root cause of veteran mental health issues. We were
scratching the surface, they would imply, without remedying the underlying cause.
One woman wearing a synthetic fur coat kept returning to ask these kinds of questions,
over and over again, a gadfly disrupting our script. She expressed open dissatisfaction at the
sound bites Hillary and I gave that promoted the study while re-directing people's fears and
anxieties. She told us that she did not buy our answers-she did not believe what we were
saying. It was her final remark on that day, however, that stunned me the most, shorting my
ethnographic circuits, transporting me out of the thicket of my fieldwork and back into the
broader, contemporary moment in which it was unfolding. "You're going to try to make them
our slaves," she said, venom in her eyes as she jerked her chin toward the robot's placid face.
"I've read enough sci-fi to know what happens next."
Unlike the vast majority of the other attendees, and unlike anyone else I encountered in
my fieldwork and the years following it, this woman was concerned not with the humans who
robots will "replace," but with the robots who humans create to take up humankind's boring,
dirty work. The woman's sentiment resonates with Rick Decker's, the protagonist of Do
Androids Dream ofElectric Sheep, an android bounty hunter who develops empathy for the
androids who have attempted to seek sovereignty from their human oppressors, and whom he is
tasked with killing. This is his job, the source of income to provide for him and his wife, and
326
hence the central, ethical conundrum that fuels the story. As Haraway notes, "the boundary
between science fiction and social reality is an optical illusion," an auditory hallucination (1991:
149). The accusation from the woman in the synthetic fur coat were perhaps the most accurate of
all, in as much as we take the figure of the robot to align with an abject genre of the human:
dehumanized, skilless, suited for repetitive labor and servitude.
Nevertheless, this woman most likely relies on and interacts with heteromated systems as
a seamless part of her life. Like so many of us, she no doubt depends upon and benefits from
mechanized labor and hidden, dehumanizing work, whether from the content moderators who
keep social media spaces like Facebook free of graphic images (Roberts 2019) or the miners who
pry from the earth bits of minerals that will be used to form the miniscule batteries and lenses of
an iPhone (Joler and Crawford 2018). This is precisely the point: the kinds of technologies my
informants are building are an abundant feature of contemporary life in the United States. Even if
we find them-and what it takes to make them-morally abhorrent, they are impossible to avoid,
and it is imperative to understand them from the ground-up, ethnographically, to do the
demystifying work of showing the humans in the loop.
To conclude, I settle upon this seeming contradiction-the woman, like Decker, holding
sympathy for the robots and by extension, the people who run and resemble them. I meditate on
some of the dissertation's larger themes, stumbling toward a diagnosis of the present state of
language, linguistic labor, psychiatry, automation and care in the United States. Or perhaps, I
offer not a diagnosis but an assessment. After all, the purpose of psychiatric assessment is to
determine what kinds of questions to ask next. Given these readings of the data, where do we go
from here? What is the next move? If care is a selective mode of attention, where (and with
whom) should we direct our attention?
327
UNCANNYVALLEYS
Part of the tension I felt as an ethnographer at the symposium, and throughout my fieldwork,
came from my inability to answer some of the questions posed to me, and a realization that the
accusations thrown at my informants included my own actions as well. My position as a meta-
scientist, a hybrid ethnographer-researcher, was not an innocent one. As a member of the team, I
was implicated in their work, including the ethically squeamish portions of it. In the exhibition
hall, I could not respond authentically or honestly to the attendees' commentary-I could not
respond the way that I normally would. This was not the time to be openly critical about my
interlocutors' work. My job at the symposium was to assist Hillary and make life easier for her
and everyone else back at WCU, to do my part in avoiding bad publicity or any negative press
coverage of the project or the Institute. Given the precarious nature of funding at WCU and, to
an extent, across the three sites, people's livelihoods-and sometimes also their immigration
statuses-were on the line. Still, people held critical opinions about the very work that sustained
their livelihood and kept them in the country in which they wanted to live. They shared these
critiques with me, either explicitly or implicitly, through study design choices, small acts of
refusal, or in our everyday conversations about living in the United States. I have captured some
of these voices throughout the dissertation. Sometime, for the sake of my interlocutors, I speak
them in my own voice.
When my interlocutors would share their critiques with me explicitly, the disclosure was
often followed by an insistence that I dig critically into the very work that we were all
participating in. As one informant remarked to me, there were many difficult stories to tell about
their technologies, the teams, the treatment of research subjects in psychiatric research, and the
328
institutions they were tangled with, but the stories needed to be told. The question that some
people I interviewed at WCU would ask-"am I allowed to say this?"-itself discloses much.
The question was a rhetorical one, because they were going to say whatever "this" was anyways,
regardless of my answer. People told me their secrets knowing that it was my job to tell others-
or at least, to tell anyone who is reading this thesis. They trusted that I would do so in a way that
could protect them from individual criticism as best as I could. Thus, to conclude from reading
this thesis that my interlocutors are all bad people, doing bad things, is to miss the point, and to
let everyone else (myself, you, dear reader) off too easily.
To return to Goffman's participation framework, in fieldwork as a member of the teams, I
was a mouthpiece-an animator-of their logic, of the technologies' proposed positive impacts.
I am also an animator of the critiques that they preferred I utter for them. Like my role at the
exhibition hall promoting the study, like my volunteer work at the community forum in the
Midwest, and when tucking subjects into the scanner on the East Coast, my language was not
always my own. To assist the teams with their research in exchange for participant observation,
to learn alongside them, required adopting and becoming conversant in my interlocutor's
epistemological life worlds (regarding psychiatry, assessment, language, signals) as well as their
ethico-moral life worlds (regarding the distinction between hearing and listening, the pragmatic
function of the IRB protocol, research subjects' sensitive, intimate stories). As Carol Cohen
(1987) writes with reference to her fieldwork alongside nuclear strategic analysts with whom she
caught herself understanding, getting along with, and even liking, I often caught myself in a
moment of slippage, in which the initial absurdness of my interlocutors' research melted away.
In trying to follow and understand vocal biomarkers, virtual humans, and telephone voices, I
would internalize the researchers' worldviews. I would justify that listening to people's phone
329
calls was not spying, because they had consented and the data had to be gathered somehow, that
it was inevitable to laugh at subjects in the scanner, and that I could watch as many videos of
research subjects as I deemed necessary because I was a member of the team. I trulyfelt the
feeling of a five.
Writing fieldnotes in the various apartments I occupied over these twelve months, I'd
drop my pen in revelation, wondering: maybe there are indeed biologically universal
components of mental illness. If so many therapists report that "everyone knows" depressed
people speak more slowly, then maybe there are universal vocal biomarkers, and unlocking them
maybe really could improve the lives of thousands, if not millions. My interlocutors were trying
to make a difference in the world, while I merely watched, an outsider come to gawk. Who was I
to learn from them and then walk away, only to critique the very people who had shown me
kindness and trust, with whom I had built rapport?
Michael M.J. Fischer uses the concept of the "ethical plateau" to describe "domains of
ethical challenge" in which it is difficult to know which direction to go in; ethical plateaus arise
when "new technological politics that initially seem like warning flags rapidly become absorbed
into routine markers of a changed common sense" (2001: 362). Think, for instance, of smart
listening devices like the Amazon Echo, discussed in Chapter 4, and how my informants' work
can be read as a sentinel, warning us of these far less regulated devices, attuning us to the chains
of labor and histories of de-humanization to which they are attached. These kinds of technologies
form "the ladder of the ethical plateau," which, Fischer suggests, "might provide a way to think
about how traditional critical social theories are being challenged to evolve in new directions"
2001: 368). In addition to the ethical plateau, I encountered another topological formation in my
fieldwork, what I call ethical uncanny valleys, in which ethical frameworks are at once familiar
"M
330
but strange. Like the example Freud uses in his original essay (2012[1912]), being lost and
returning to the same place, again and again, in an attempt to find our way, provokes a feeling of
the uncanny-a sense of strange return, of I've been here before. My fieldwork was full of eerily
familiar terrain, including the crisscrossing histories of psychiatry and computing discussed in
Chapter 1.
My use of "uncanny valley" offers a playful stretching of the original meaning of the
term. Masahiro Mori, a Japanese roboticist, originally developed the term in 1970, and the
translation into English as "uncanny" linked this concept with Freud's (Mori 2012).70 Mori
developed this term to describe the relationship between human-likeness and human affmity for
non-human objects. The closer a non-human object approaches a living, healthy human being in
its movements, appearance, and sound, the more grotesque it becomes, plunging below the level
of neutral affinity and into the negative zone. In Mori's uncanny valley, we find horrific
distortion with recognition at its center. For Freud, the uncanny is about a confrontation with the
darkest, seamiest, and inescapable part the self.
+ Uncanny Valey Healh Person
Toy Robot
lndust Robot
HwManrUkenes 50% 100%
Prosthetic Hand
70 Norri Kageki (2012) suggests that the relationship between Mori's original essay and Freud's has been over-
determined, due to a less-than-accurate translation.
331
Mori's graph of the uncanny valley, depicting "the proposed relation between the human likeness of an entity and
the perceivers affinity for it" (2012: 2). At a crucial vector, an object approaches almost complete human likeness
and the perceiver's affinity for it plummets. The uncanny valley exists in this below-zero zone; the resemblance
provokes horror and disgust.
In ethically uncanny valleys of my fieldwork, there were gravitational wells that pulled
me deeper in, places in which to get stuck. Once stuck, I could better attune myself to the forces
that had drawn me there while also better understanding my own position and the surrounding
architecture. In these sinking spaces that are simultaneously home but also not home, where
fieldwork is also homework, it was difficult to tell: was I studying up? Studying sideways? Who
wielded power over whom? My informants occupied this position with me, a position that is also
a complicit but an ironic one: one that is not entirely sincere. People-myself as an
ethnographer, my informants as research subjects with research subjects as their own-do not
always say what they mean or mean what they say. Through their actions, silences, and their own
misdirection, they can enact subtle critiques, subversions from the inside, flipping the script.
Moreover, rather than "studying those study us," to use Forsythe's (200 1) description of doing
ethnography with computer scientists who employ social science methods, I was studying those
who study like us. In observing my interlocutors build devises that captured people's speech and
enabled its circulation far beyond its context of utterance, subjecting it to analytic and theoretical
re-mediations that the person could never have anticipated, it felt like looking in a mirror, like
seeing my own discipline (anthropology) from the outside-in, even as I participated in it.
ETHICAL SOUNDSCAPES AND THE GOOD LISTENER
How to pull oneself out of an uncanny valley-that is, to recognize that alternative arrangements
are possible, ones that are new rather than oddly familiar? The first step, I believe, is to sit with
332
the queasiness these moments cause, taking them as opportunities for reflection rather than cause
for recoil. This means recognizing that the formations of these dips and dark places are not of
individual doing, but are structural, epochal, and tied to broader forces. As Puig de la Bellacasa
notes, "the purpose of showing how things are constructed"-and connected-"is not to
dismantle things" by denying their reality (2011: 82). To show how things-facts, affects,
ideologies, connections between states and sounds-are constructed rather than existing ab ovo
is not to reject their reality nor to "undermine...the powerful (human) interests they might reflect
and convey" (ibid). Instead, showing these connections is to "affirm their reality by adding
further articulations" (ibid).
In this spirit, Hirschkind's (2006) notion of an ethical soundscape-a sonic landscape
that surrounds us all-offers not an exit strategy from ethically uncanny valleys per say, but an
invitation to tune in to the ways that modes of listening and modes of self-fashioning run
together with politics and power, and how they impact our interactions with and response to the
people we share our spaces with. Hirschkind uses the ethical soundscape to describe how aural
media-in his case, cassette tape Muslim sermons-contribute to the "shaping of the
contemporary moral and political landscape" (2). He invites us to think through listening not as
passive reception, but active process that can be shaped by and also shape one's "ethical
sensibilities under-gridding moral action" (9). The ethical soundscape does not just surround
us-it is more than ineffable milieu. Like habitus, it is constituted by, circulated, and shaped
through repetitive practices, attuning one's mind, body, and affect in a way that encourages a
"technique of self-fashioning" (22) and "ethical sedimentation" (28). Think, for instance, of
Nava and Taylor repeatedly listening to research subject's stories, and how this experience
informed the way in which they would interact with subjects after the interview. Think as well
333
about the other annotators and I, whose listening/not listening suggests the impossibility of
protecting user privacy in machine listening systems.
What else was so familiar, in its strangeness, about the teams' efforts to develop speech
analysis technologies for psychiatric screening, and the crashing together of language ideologies
and listening practices that they imply? What other modes of listening and ethical sedimentations
do the listening practices of the teams-and the intended mode of listening of their
technologies-point to, and reproduce? As already referenced, there is Alexa, the voice-activated
assistant of the Amazon Echo. The gendering of Alexa-the work it takes to avoid calling the
device a "she," and referring to what the device does as "listening"-displays the same kind of
coalescence of gender and labor engineered into figures like Abby. Moreover, scholars like
Virginia Eubanks (2017) argue that the automation of decision-making work in the service sector
and the corresponding devaluing of people conducting that labor (and devaluing of those who are
on the receiving end of the service work) is a feature rather than a bug in the United States. Her
discussion of how the automation of the welfare eligibility process changed the relationship
between caseworks and their clients is another warning sign for what the automation of
psychiatric assessment might look like. Under eligibility automation, caseworkers no longer have
a single case assigned to them based on their location and the location of the client; a loss of
shared locality leads to a loss of contextual information about the client and their case.
Caseworkers are instead assigned a case through a workflow management system. Remarks one
of the caseworkers Eubanks interviewed for her study, "'If I wanted to work in a factory, I would
have worked in a factory...You were expected to produce, and you couldn't do that if you
listened to the client's story" (Eubanks 2017: 63). Tweak the terms of the relationship, and the
caseworker must listen in a different way-listening to move the call along, listening
334
pragmatically and strategically rather than "to the client's story," to its personalized, narrative
texture.
The duplicitous nature of the listening involved in building speech analysis technologies
is also not unique to my informants' projects. They would prompt the subjects to participate in
producing speech or interactional encounters by emphasizing the very components of speech
they sought to downplay: its referential function. The study and development of deception work
in psychiatry has a rich and varied history in the United States. Like signal processing,
psychiatry is also wrapped up in the military industrial complex, and the extraction of
"intelligence" (meaningful, enemy data) from information.
In 1997, reporters from the Baltimore Sun retrieved the KUBARK Counterintelligence
Interrogational Manual through a Freedom of Information Act (FOIA) request. Originally
produced in 1963, KUBARK (the CIA's codename for itself) references the rapport-building
skills of psychotherapists as a source of inspiration for the interrogation tactics described in its
pages. For instance, in the annotated bibliography reference for Harry Stack Sullivan's
guidebook on the psychiatric interview (1954), the KUBARK authors note,
Any interrogator reading this book will be struck by the parallels between the psychiatric
interview and the interrogation. The book is also valuable because the author, a
psychiatrist of considerable repute, obviously had a deep understanding about the nature
of the inter-personal relationships and of resistance.
The release of the Hoffman report? in 2015 detailed, with no minced words, the extent to which
the CIA relied on the American Psychological Association to justify and promote torture
interrogation tactics that violated international human rights standards. The APA also provide the
71 On July 2, 2015, David H. Hoffman of Sidley Austin, LLP, published an independent review that he conducted
with his legal team. The investigation uncovered, among other things, that the APA had re-written its code of ethics
to enable its members to participate in Bush Administration-sanctioned torture, and that members of significant
influence had been tapped to lead interrogation training efforts and develop torture tactics. See:
https://www.apa.org/independent-review/revised-report.pdf
335
government with psychologists-in-training (its younger members) to conduct interrogation
interviews. Under the Obama Administration, starting in 2009, the CIA moved away from
coercive (i.e., torture-driven) interrogation practices and toward non-coercive, research-based
strategies that emphasized rapport building, listening rather than questioning, and guiding a
suspect's impression as to how they were being listened to (Watkins 2017). This brings a whole
new, suspicious reading to the emphasis on rapport building and trust that is built into the VHI's
interface. Researchers designed their studies, as I have shown, in an effort to grasp the truth of
speech through practices that restrain the speaker's agency. The research validates a language
ideology-and attendant set of practices-which implies that the heart of language lies within a
secreted space, which must be wrenched open sometimes against the speaker's will. We should
be cautious in considering how this research might be taken up to justify tactics that the
researchers themselves would not agree with, without regard to their agency or their desire to
help rather than harm people.
If a central point of my ethnography has been to explore what it means to listen, issues of
ethics and responsibility bring a related question: what makes a good listener? I invoke
"goodness" here with reference to measures of skill and expertise within a professional
framework, and measures of moral and ethical goodness (the "good listener" in pursuit of
eudemonia-human flourishing, "the good life"). Within the cultural legacy of psychotherapy in
the United States, these two are interlinked. That is to say, the ideology of inner reference
(psychiatry's hegemonic language ideology) implies a moral framework and a set of attendant
linguistic and listening practices, with the implications that language (being primarily referential
and anchored in a speaker's self) is the grounds of intersubjectivity, and therefore leads to
empathy. At the same time, as the skilled listening of interviewers like Adele and Rochelle as
336
well as Abby illustrated, the intersubjective sharing of empathy can be an illusion, a strategic
performance. Being a good listener is both a professional practice and a skill that needs to be
socially reproduced. In my fieldsites, where the linguistic labor of being an empathic listener is
assigned to less experienced, lower-level researchers, because it a task that "anyone can do," we
can see the consequences of these two frameworks fusing together. To be a good listener as a
social worker becomes indistinguishable from being a good listener as a human being. Listening
to the content of speech (and giving the impression that this is how speech is being listened to)
no longer appears to be a professional practice, or a professional skill that must be cultivated.
Good listening appears as a human capacity-something anyone who is human can do, and yet
also, at the same time, something that can be easily replicated, mimed, and performed by a non-
human machine.
FURTHER ARTICULATIONS
This dissertation has shown how dominant Euro-American language ideologies (of speech's
relationship to interior states) are the guiding logics behind the automation of psychiatric
screening, even while this ideology gradually unravels in the building of the technologies.
Through these technologies and their attendant labor relations, researchers seek to peel from the
core of speech a "layer" of pure affect, of the most universal components of mental illness. On
the one hand, machine listening implies a radical mode of im-mediation, and one that defeats the
human exceptionalism of language. To attempt to render speech into mere sound-a wave-is to
equalize it, to empty it of its species-specific particularities, to equate it with any other kind of
sound. On the other hand, what we might call "human listening"-listening to language
337
semantically, as an emanation of the speaker's self-is posed as the unique domain of the
human, a mode of listening that a machine might mimic but never fully recreate.
This contradiction highlights something crucial about the nature of language ideologies,
which are, as Kathryn Woolard writes, "the mediating link between social form and forms of
talk," between ideals and practices (1998: 3). Woolard reminds us that the "ideology" of
language ideologies has multiple meanings. One such meaning implies the commonsense, that
which is known and taken for granted about the world, "derived from, rooted in, reflective of, or
responsive to the experiences or interest of a particular social position," although ideologies
often move through the world as if "universally true" (6). There is also a Marxist tradition to the
study of language ideologies: language ideologies provide distorted, illusory rationalization of
how language works, including the relationship between language and interaction. They are, as
Foucault might put it, "power-linked discourses" that map incompletely onto the world as people
experience it (Woolard 1998: 7). Thus, language ideologies justify practices that are attached to
power, but they are not inevitable.
My dissertation has largely argued that the language ideologies my interlocutors pursue
and that fuel their efforts are entrenched in histories and hierarchies of value that stretch beyond
the individual projects, that are calcified and continually reinforced in psychiatric practices, and
in the practices involved in gathering and classifying research subjects' speech data. But are
there any new, further articulations to be found? What are the contemporary inflections of these
ideologies, at a time when speech has moved from the medium of the air to the medium of the
Internet, an era of "alternative facts," fake news, Deep Fakes, conspiracy theorist "crisis actors"?
And in an era where the relationship between utterances, intentions, authenticity, and truth, is
newly in discussion?
338
Perhaps the projects do indeed point to something new. But I also believe that what feels
new about the current state of things in the United States for some, feels old for others. Maybe
what is new is the dawning realization that the relationship between utterances and action,
speech and sincerity, is a figment of the liberal democratic imagination that has never universally
held true for all. Following the election of Donald Trump, indigenous STS scholar Kim Tallbear
(2017) has asserted that native people living in the place sometimes referred to as the United
States have long been living in a post-truth world since the arrival of settlers, and since the
blatant, open, and continual violation of land treaties. Likewise, for so many living in the United
States-disabled, trans, gender non-binary, non-white-the word of the law extends
asymmetrically.
My informants, in their technological intervention into the relationship between speech
and agency, show the fabricated nature of this connection. There is justice work to be had in the
exposing of these seams. This is also why the invocation of "post-humanism" gives me such
caution and pause. The post-ness of the post-human implies a completion, a sense of finishing. In
the words of Ruha Benjamin,
"posthumanistv isions assume that we all have had a chance to be human. How nice it
must be...to be so tired of living mortally that one dreams of immortality. Like so many
other 'posts' (post racial, postcolonial, etc.), post humanism grows out of the Man's
experience. This means that, be decoding the racial dimension of technology and the
ways in which different genres of humanity are construed in the process, we gain a
keener sense of the architecture of power-and not simply as a top-down story of
powerful tech companies imposing coded inequality onto an innocent public. This is also
about how we (click) submit, because of all that we see to gain by having our choices and
behaviors tracked, predicted, and radicalized" (Benjamin 2019: 32; emphasis original).
Looking ahead and imagining alternative visions of the present is a productive exercise. Sci-fi-
science fiction, speculative fabulation-invites us to imagine life in the future. It also can
339
provide a means of doubling back-have I been here before?-reflecting on the connections
between what seems new and unprecedented, and the lesser-examined portions of the past.
340
References
Benjamin, Ruha. 2019. Race After Technology: Abolitionist Toolsfor the New Jim Code.
Cambridge, UK: Polity Press.
Cohen, Carol. 1987. "Sex and Death in the Rational World of Defense Intellectuals." Signs
12(4): 687-718.
Dick, Philip K. 1968. Do Androids Dream ofElectric Sheep? New York: Random House.
Eubanks, Virginia. 2017. Automating Inequality:H ow High-Tech Tools Profile, Police, and
Punish the Poor. New York: St. Martin's Press.
Fischer, Michael M.J. 2001. "Ethnographic Critique and Technoscientific Narratives: The Old
Mole, Ethical Plateaux, and the Governance of Emergent Biosocial Polities." Culture, Medicine,
and Psychiatry 25: 355-393.
Forsythe, Diana. 2001. Studying Those Who Study Us: An Anthropologist in the World of
Artificial Intelligence. Stanford: Stanford University Press.
Freud, Sigmund. 1919 [2003]. The Uncanny. D. McLintock, trans. New York: Penguin.
Kageki, Norri. 2012. "An Uncanny Mind: Masahiro Mori on the Uncanny Valley and Beyond."
IEEE Spectrum, 12 June. <https://spectrum.ieee.org/automaton/robotics/humanoids/an-uncanny-
mind-masahiro-mori-on-the-uncanny-valley> (accessed August 7, 2019).
Haraway, Donna. 1991. Simians, Cyborgs, and Women: The Reinvention ofNature. London:
Routledge.
Hirshkind, Charles. 2006. The Ethical Soundscape: Cassette Sermons and Islamic Conterpublics.
New York: Columbia University Press.
Irani, Lilly. 2015. "The cultural work of microwork." New Media and Society 17(5): 720-739.
Le Guin, Ursula K. [1969]2016. The Left Hand ofDarkness. New York: Penguin Books
Joler, Vladen, and Kate Crawford. 2018. "Anatomy of an Al System: The Amazon Echo As An
Anatomical Map of Human Labor, Data and Planetary Resources," AlNow Institute andShare
Lab, September 7. <https://anatomyof.ai> (accessed August 6, 2019).
Mori, Masahiro. 2012. "The Uncanny Valley." K.J. MacDorman and Norri Kageki, trans. IEEE
Robotics andAutomation 12(2): 98-100.
Puig de la Bellacasa, Maria. 2011. "Matters of care in technoscience: assembling neglected
things." Social Studies of Science 41(1): 85-106.
341
Stack Sullivan, Harry. 1954. The PsychiatricI nterview. New York: W.W. Norton and Co.
Roberts, Sarah T. 2019. Behind the Screen. Content Moderation in the Shadows of Social Media.
New Haven: Yale University Press.
TallBear, Kim. 2017. "Interrogating 'the Threat'." Presidential Plenary, Society for the Social
Studies of Sciences Annual Meeting, Denver, CO, August 30.
Watkins, Ali. 2017. "Elite terrorist interrogation team withers under Trump." Politico, December
5. <https://www.politico.com/story/201 7/12/05/elite-terrorist-interrogation-trunp-279930>
(accessed August 7, 2019).
Woolard, Katheryn. 1998. "Language Ideology as a Field of Inquiry." In Language Ideologies.
Practicea nd Theory. Bambi B. Shieffelin, Kathryn A. Woolard, and Paul V. Kroskrity, eds.
Pp.3-47. New York: Oxford University Press.