NOVEMBER 01 2024
The JIBO Kids Corpus: A speech dataset of child-robot
interactions in a classroom environment
Natarajan Balaji Shankar  ; Amber Afshan; Alexander Johnson; Aurosweta Mahapatra; Alejandra Martin;
Haolun Ni; Hae Won Park; Marlen Quintero Perez; Gary Yeung; Alison Bailey; Cynthia Breazeal; Abeer Alwan
JASA Express Lett. 4, 115201 (2024)
https://doi.org/10.1121/10.0034195
 
View Export
Online Citation
Articles You May Be Interested In
Automatic forced alignment on child speech: Directions for improvement
Proc. Mtgs. Acoust. (December 2015)
Developing a corpus of spoken language variability
J. Acoust. Soc. Am. (October 2003)
Measuring open-set, word recognition in school-aged children: Corpus of monosyllabic target words and
speech maskers
J. Acoust. Soc. Am. (October 2019)
 23 March 2026 19:03:40
ARTICLE asa.scitation.org/journal/jel
The JIBO Kids Corpus: A speech dataset of child-robot
interactions in a classroom environment
Natarajan Balaji Shankar,1,a) Amber Afshan,1 Alexander Johnson,1 Aurosweta Mahapatra,1
Alejandra Martin,2 Haolun Ni,1 Hae Won Park,3 Marlen Quintero Perez,2 Gary Yeung,1
Alison Bailey,2 Cynthia Breazeal,3 and Abeer Alwan1
1Department of Electrical and Computer Engineering, University of California Los Angeles, Los Angeles,
California 90095, USA
2Department of Education, University of California Los Angeles, Los Angeles, California 90095, USA
3MIT Media Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
balaji1312@ucla.edu; amberafshan@ucla.edu; ajohnson49@ucla.edu; aurosweta99@ucla.edu; alemartin@ucla.edu;
michaelni12@ucla.edu; haewon@mit.edu; mquint30@ucla.edu; garyyeung@ucla.edu; abailey@gseis.ucla.edu;
cynthiab@media.mit.edu; alwan@ee.ucla.edu
Abstract: This paper describes an original dataset of children’s speech, collected through the use of JIBO, a social robot. The
dataset encompasses recordings from 110 children, aged 4–7 years old, who participated in a letter and digit identification
task and extended oral discourse tasks requiring explanation skills, totaling 21 h of session data. Spanning a 2-year collection
period, this dataset contains a longitudinal component with a subset of participants returning for repeat recordings. The data-
set, with session recordings and transcriptions, is publicly available, providing researchers with a valuable resource to advance
investigations into child language development. VC 2024 Author(s). All article content, except where otherwise noted, is licensed under a
Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
[Editor: Yolanda F. Holt] https://doi.org/10.1121/10.0034195
Received: 21 August 2024 Accepted: 8 October 2024 Published Online: 1 November 2024
1. Introduction
The development of language and literacy skills stands as a cornerstone of elementary education. However, empirical find-
ings from the National Assessment of Educational Progress underscore a concerning reality: 37% of fourth-grade students
in the United States do not demonstrate reading proficiency aligned with grade-level expectations (Irwin et al., 2022). The
foundations of literacy are established in the crucial pre-kindergarten (pre-K) and kindergarten years, where children
develop preliteracy skills such as phonological awareness and letter knowledge (Bus and Van IJzendoorn, 1999). These
early developmental stages, therefore, necessitate focused attention and resources to foster language growth.
To enhance the learning experience and capitalize on these advancements, the usage of systems in educational
space has become commonplace (Williams et al., 2000), but technological advancements must still address a significant
hurdle: the inadequate performance of contemporary Automatic Speech Recognition (ASR) techniques when tasked with
scoring children’s responses (Dutta et al., 2022; Yeung and Alwan, 2018). The error-prone nature of automatically gener-
ated children’s speech transcriptions poses a significant challenge for their integration into educational applications.
Research focused on kindergarten-aged children underscores the imperative to specifically tailor ASR systems for this age
group, as preliteracy skills, such as phonological and alphabetic knowledge developed at the pre-K and kindergarten level,
can support the development of literacy skills (Biemiller and Slonim, 2001; Fish and Pinkerman, 2003; Hart et al., 1997;
Paez et al., 2007; Snow et al., 2007). However, a notable scarcity of comprehensive children’s speech databases persists
within the field, particularly, with respect to longitudinal datasets.
These longitudinal resources are invaluable for investigating language development and refining child-centered
automatic speech recognition and speaker recognition systems (Dutta et al., 2022; Safavi et al., 2012; Yeung and Alwan,
2018). By tracking the same children over time, researchers can map the trajectories of language acquisition. This under-
standing can guide the development of systems and techniques specifically tailored to the evolving characteristics of
children’s speech (Yeung and Alwan, 2019). Longitudinal data also facilitate the development of educational applications
specifically tailored to children’s voices by offering insight into how children’s speech patterns evolve, supporting applica-
tions in areas such as personalized learning environments and child-robot interaction.
To effectively collect data from children, researchers must design data collection mechanisms that are engaging
and centered on the child’s experience. Social robots, with their ability to engage children interactively, hold significant
a)Author to whom correspondence should be addressed.
JASA Express Lett. 4 (11), 115201 (2024) VC Author(s) 2024. 4, 115201-1
 23 March 2026 19:03:40
ARTICLE asa.scitation.org/journal/jel
potential as a vehicle for implementing these data-driven insights in clinical and educational settings (Kanero et al., 2018;
Kory et al., 2013; Westlund and Breazeal, 2015). Robots can facilitate targeted activities aimed at various objectives, includ-
ing the evaluation of speech development and phonetic acquisition and the reinforcement of pronunciation skills.
Leveraging the interactive capabilities of a social robot, JIBO (Spaulding et al., 2018), this paper presents a novel
children’s speech dataset collected over a 2-year period. JIBO was employed to administer a series of structured and semi-
structured tasks to children in pre-K, kindergarten, and first grade. These tasks included letter and digit identification as
well as explanation tasks. The dataset’s longitudinal component, with a subset of participants returning for follow-up
recordings, facilitates the analysis of developmental trajectories in children’s speech. As part of a larger human-robot inter-
action (HRI) study, evaluating the effectiveness of social robots in classroom settings in Yeung et al. (2019b), Yeung et al.
(2019a), Tran et al. (2020), Johnson et al. (2022b), and Johnson et al. (2022a), this paper offers a comprehensive discus-
sion of the dataset’s collection, encompassing design considerations and recording conditions.
2. Related work
Several speech corpora exist for studying English-speaking children. For example, the Providence corpus (Demuth et al.,
2006) offers longitudinal audio recordings of six English-speaking children, aged 1–4 years old, engaged in natural interac-
tions with their mothers. Furthermore, the TBALL dataset (Kazemzadeh et al., 2005) focuses specifically on 256 non-
native English speakers, ranging from kindergarten to fourth grade. The PERCEPT-R (Benway et al., 2022; Benway et al.,
2023) corpus comprises data from 281 children, analyzing typical speech and residual speech sound disorders, affecting
rhotic. The UltraSuite dataset (Eshky et al., 2018) is a repository of ultrasound and acoustic data, collected from recordings
of child speech therapy sessions of 86 children. The SEED dataset (Speights Atkins et al., 2020) comprises 58 children
with and without speech disorders. Meanwhile, the CID children’s speech corpus (Lee et al., 1999) comprises a collection
of read speech samples produced by 436 children, aged 5–17 years old. Similarly, the CMU Kids corpus (Eskenazi, 1996)
comprises read speech from 76 children with a narrower age focus of 6–11 years old. Additionally, the AusKidTalk corpus
(Ahmed et al., 2021) is a large scale corpus of Australian children between the ages of 3 and 12 years old, comprising a
collection of single words, utterances, and narrative speech. Likewise, the CSLU OGI Kids’ Speech corpus (Shobaki et al.,
2000) contains read speech samples from participants encompassing a broad age range from kindergarten to grade ten.
Moreover, the CU Kid’s Prompted and Read Speech (Cole et al., 2006) corpus comprises read speech data from 663
American English-speaking children aged between 4 and 11 years old, whereas the CU Kid’s Read and Summarized Story
(Cole and Pellom, 2006) corpus consists of spontaneous speech recordings from 326 children aged between 6 and 11 years
old. The Birmingham subset of the PF-STAR corpus (Russell, 2006) contains samples from 150 British children between
the ages of 4 and 15 years old. The MyST corpus (Ward et al., 2011) contains 499 h of audio recordings from 1300 chil-
dren from third to fifth grades, interacting with a virtual tutor for science topics. However, to our knowledge, there are no
publicly available English language databases that combine child speech data and a longitudinal component across a large
range of speakers, necessitating the creation of the JIBO Kids Corpus.
3. Data collection
3.1 JIBO
The JIBO Kids Corpus is constructed as a series of structured and semi-structured sessions conducted between a social
robot, JIBO, and a child, following a protocol detailed in Bailey and Heritage (2014) and Bailey and Heritage (2018).
Initially conceived as a domestic personal assistant robot, JIBO (Spaulding et al., 2018) is a social robot capable of 360-deg
rotation for expressive body language, including head-tilting, directional gaze shifts, and dance-like movements. To support
interaction and assessment delivery, JIBO possesses a small embedded facial screen that primarily displays JIBO’s animated
eye, synchronized with its physical movements, and also provides a visual interface for presenting text, images, or videos
as needed. JIBO’s dual cameras are located above the screen, and its microphone array rotates the head to facilitate sound
source detection. Two lateral loudspeakers enable audio output for speech and music playback. JIBO’s design, including its
expressive movements and child-like voice, positions it as the child’s peer-like learning companion. This use of social
robotics aims to foster a natural conversational setting that encourages spontaneous speech production in young
participants.
3.2 Recording setup
The audio recording setup employed a Logitech C390e webcam (Lausanne, Switzerland), positioned at a 30–45 angle rela-
tive to the child and approximately 50 cm away (see Fig. 1). Recordings took place in an unoccupied office during regular
school hours with some background noise present from the surrounding school environment. The JIBO social robot was
placed on a table or desk in front of the child at an approximate distance of 50 cm. Adjacent to the child, a researcher
assumed the role of an “instructor,” engaging in interactive activities with the child and JIBO. Simultaneously, another
researcher, designated as the “operator,” oversees the computational aspects and item display through a connected com-
puter interface using a Wizard-of-Oz setup (Dahlb€ack et al., 1993). This setup facilitated real-time adjustment of the items
displayed by JIBO. In instances in which unexpected interactions arose between the child and social robot, the instructor
JASA Express Lett. 4 (11), 115201 (2024) 4, 115201-2
 23 March 2026 19:03:40
ARTICLE asa.scitation.org/journal/jel
Fig. 1. An example recording session.
intervened verbally to assist the child in navigating any difficulties encountered during the session. These include instances
where the prompt required repetition or children were distracted or reticent. This intervention ensured the smooth pro-
gression of the interaction between the child and JIBO.
3.3 Participants
The participants for the study comprised children proficient in English, residing in Southern California. Approximately
40% of the participating children reported exposure to additional languages—predominantly Spanish—at home. Nearly
one-third of the children were enrolled in a Spanish-English dual language program, and two-thirds were enrolled in an
English-only instructional environment. Informed consent was obtained from students and parents for study participation
and dissemination of data, following school site and institutional review board procedures.
Data collection proceeded over a 2-year period. In year 1, sessions were recorded with a cohort consisting of 38
pre-K and 55 kindergarten students. Year 2 of the study involved a cohort of 35 kindergarten and 27 first-grade partici-
pants. A subset of children from year 1 returned the following year, forming a longitudinal cohort. This longitudinal facet
enables an exploration of developmental patterns across time with data available for 22 children progressing from pre-K
to kindergarten and 23 children advancing from kindergarten to firs grade. A breakdown of speaker grade and gender is
present in Tables 1 and 2.
To ensure participant privacy, the dataset contains no personally identifiable information, and participants are
represented solely by anonymized codes. Each child was anonymized in the format TXYY7ZZZ, where X in {1,2} is the
year of the study; YY in {01,02,03} is the child’s year in school, 01 is pre-K (ages 4–5 years old), 02 is kindergarten (ages
5–6 years old), and 03 is grade 1 (ages 6–7 years old). ZZZ is a unique identifier for each child. Boys are odd numbered
and girls are even numbered. Thus, T1027235 is child 235, a male kindergartener whose data were collected in year one of
the study.
3.4 Data preprocessing
To ensure the quality of the data, all audio recordings were subjected to a preprocessing stage. Initial recordings were cap-
tured at a uniform sampling rate of 48 kHz prior to subsequent downsampling to 16 kHz. All recordings were scrutinized
to remove any sessions marred by poor audio quality.
Table 1. Breakdown of participants per year of study. Audio length that is indicated refers to after preprocessing.
Audio length
Year Grade Male Female (hh:mm)
1 Pre- K 19 19 4:10
Kindergarten 23 32 6:57
2 Kindergarten 20 15 5:42
1 13 14 4:18
JASA Express Lett. 4 (11), 115201 (2024) 4, 115201-3
 23 March 2026 19:03:40
ARTICLE asa.scitation.org/journal/jel
Table 2. Breakdown of speakers for longitudinal study. Audio length that is indicated refers to after preprocessing.
Year 1 audio Year 2 audio
Cohorts Gender Participants Length (hh:mm) Length (hh:mm)
Cohort 1 Male 14 1:25 2:09
(Pre-K! Kindergarten) Female 8 0:56 1:24
Cohort 2 Male 10 1:16 1:25
(Kindergarten! 1) Female 13 1:28 2:19
Sessions exhibiting a substantial degree of audio clipping were identified and removed from the dataset. An
exception to this criterion was granted for recordings obtained during the “blocks” task, in which pronounced clacking
noises are present as participants interact with physical cube toys.
Long periods of silence at the beginnings and ends of sessions were identified, and the respective sessions were
trimmed. Sessions characterized by excessively muted audio levels or instances of inaudible or hushed speech by the partic-
ipants were also identified and removed. Additionally, sessions contaminated by excessive background noise were elimi-
nated from the dataset. In total, this data cleaning process removed 4:38 h of noisy audio.
3.5 Transcription
During transcription of all experimental sessions, the speech of child participants was annotated at the word level.
Instructor and JIBO speech were excluded from the transcription. All personally identifiable information was redacted
from the audio recordings and transcripts. However, sections that exhibited significant crosstalk between the instructor or
JIBO and the child remain embedded within the accompanying audio recordings.
4. Corpus composition
JIBO was loaded with educational materials to conduct a letter and number identification task along with explanation
tasks, using its screen to display visual stimuli. Audio instructions and prompts were recorded by a female researcher with
a pitch-shift to emulate a child’s voice. Participants were given 2min of exposure to JIBO’s voice through interactive voice
prompts prior to the start of the exercises. JIBO intermittently provided positive reinforcement during the session, praising
correct answers and fostering engagement. Each session was limited to 30min to maintain consistency and prevent fatigue.
The breakdown of average session length, as well as total data collected per task, is presented in Table 3.
4.1 Letter and digit identification
In the interviews conducted during the first year of the study, children were asked to identify a sequence of the digits 0–9
and the letters of the English alphabet displayed on JIBO’s screen. Accompanying each display was a prompt tailored to
the content, such as “What letter is this?” or “What number is this?” On the child’s identification of the presented letter
or number, JIBO transitioned to the next item in the sequence, presenting a new prompt without further intervention or
supplementary cues.
Table 3. Breakdown of audio length by recorded session.
Total audio length Average session length
Task Year of study Grade (hh:mm) (mm:ss)
Letter and digit 1 Pre-K 2:41 4:53
Kindergarten 3:59 4:53
2 Kindergarten 3:02 5:52
1 2:19 5:34
Teeth 1 Pre-K 0:41 1:53
Kindergarten 1:26 2:06
2 Kindergarten 1:01 2:16
1 0:46 2:18
Blocks 2 Kindergarten 1:39 3:11
1 1:12 3:48
Colors 1 Pre-K 0:48 2:10
Kindergarten 1:32 2:09
Total 21:07 3:18
JASA Express Lett. 4 (11), 115201 (2024) 4, 115201-4
 23 March 2026 19:03:40
ARTICLE asa.scitation.org/journal/jel
Fig. 2. Sample display of JIBO during the alphabet train game.
For participants in pre-K and kindergarten in year 2 of the study, an “alphabet train game” was introduced to
evaluate the child’s proficiency in letter identification (see Fig. 2). Here, JIBO presented a scrolling train on its screen,
where each train car was adorned with a distinct letter. The child’s task involved identifying each letter as it scrolled by.
For children in grade 1, a “spelling game” was used to gauge their aptitude in basic word decoding. In this task, JIBO dis-
played a word accompanied by a corresponding image, prompting the child to read the word aloud before subsequently
attempting to spell it.
For children in pre-K and kindergarten in year 2 of the study, JIBO presented a “finger game,” an image of hands
with raised fingers, prompting the child to orally count out the quantity of raised fingers. For children in grade 1, a “real
world math game” tested the math abilities of the children through a series of tasks ranging from simple addition involving
candy to more complex scenarios such as determining the age of a child depicted in a birthday celebration scene.
4.2 Explanations
The session recordings for both years of the study featured an interactive component designed to capture spontaneous
responses from the participants through extended discourse tasks. In year 1, children engaged in interviews revolving
around two distinct explanation tasks: “brushing their teeth” (referred to as “teeth”) and “mixing paint into colors”
(“colors”). The children were prompted to articulate their approaches to executing these tasks (“How do you clean your
teeth?”), elucidate the rationale behind the task (“Why do you brush your teeth?”), expand on how they would explain the
task to a peer (“How would you teach your friend to brush their teeth?”), and justify why their peer should undertake the
task in the manner proposed (“Why should they brush their teeth?”).
During year 2 of the study, the teeth-brushing task was revisited (teeth), allowing for longitudinal examination
of responses. Additionally, participants were presented with a novel task involving an undisclosed number of cubes, which
could be joined together or separated, and were tasked with determining the number of cubes provided (blocks).
Following this task, the same series of questions from the teeth task were posed to elicit insights into participants’
approaches, reasoning, and strategies for communicating the counting task to a friend. This semi-structured approach to
interactive sessions across both study years offers insights into the developmental trajectories of children’s explanation dis-
course skills and communicative abilities over time.
5. Conclusion
This paper introduces the JIBO Kids Corpus, a unique longitudinal corpus of child-robot interaction speech. This dataset
presents a resource for investigating linguistic development in children and advancing automatic speech recognition and
speaker verification systems for children. We anticipate that this publicly available dataset will contribute to research on
language acquisition and inform the development of educational applications for children.
Acknowledgments
This work was supported in part by the National Science Foundation (NSF). Amber Afshan, Alexander Johnson, Alejandra
Martin, Marlen Quintero Perez and Gary Yeung completed this work while they were students at UCLA.
Author Declarations
Conflict of Interest
The authors have no conflicts to disclose.
JASA Express Lett. 4 (11), 115201 (2024) 4, 115201-5
 23 March 2026 19:03:40
ARTICLE asa.scitation.org/journal/jel
Ethics Approval
The research presented in this paper was conducted in accordance with Institutional Review Board (IRB) guidelines. All
participants, including students and their parents, provided informed consent for the collection and distribution of anony-
mized speech data. No personally identifiable information was included in the dataset, ensuring participant privacy.
Data Availability
The data that support the findings of this study are openly available in https://github.com/balaji1312/Jibo_Kids at https://
doi.org/10.5281/zenodo.13964791.
References
Ahmed, B., Ballard, K. J., Burnham, D., Sirojan, T., Mehmood, H., Estival, D., Baker, E., Cox, F., Arciuli, J., Benders, T., Demuth, K., Kelly, B., Diskin-
Holdaway, C., Shahin, M., Sethu, V., Epps, J., Lee, C. B., and Ambikairajah, E. (2021). “AusKidTalk: An auditory-visual corpus of 3- to 12-year-old
Australian children’s speech,” in Proceedings of Interspeech 2021, 30 August–3 September, Brno, Czechia (ISCA, Belgium), pp. 3680–3684.
Bailey, A. L., and Heritage, M. (2014). “The role of language learning progressions in improved instruction and assessment of English lan-
guage learners,” TESOL Q. 48(3), 480–506.
Bailey, A. L., and Heritage, M. (2018). Progressing Students’ Language Day by Day (Corwin, Thousand Oaks, CA).
Benway, N. R., Preston, J. L., Hitchcock, E., Rose, Y., Salekin, A., Liang, W., and McAllister, T. (2023). “Reproducible speech research with the
artificial intelligence-ready PERCEPT corpora,” J. Speech. Lang. Hear. Res. 66(6), 1986–2009.
Benway, N., Preston, J. L, Hitchcock, E., Salekin, A., Sharma, H., and McAllister, T. (2022). “PERCEPT-R: An open-access American English
child/clinical speech corpus specialized for the audio classification of //,” in Proceedings of Interspeech 2022, 18–22 September, Incheon,
Korea (ISCA, Belgium), pp. 3648–3652.
Biemiller, A., and Slonim, N. (2001). “Estimating root word vocabulary growth in normative and advantaged populations: Evidence for a
common sequence of vocabulary acquisition,” J. Educ. Psychol. 93(3), 498–520.
Bus, A. G., and Van IJzendoorn, M. H. (1999). “Phonological awareness and early reading: A meta-analysis of experimental training studies,”
J. Educ. Psychol. 91(3), 403–414.
Cole, R., Hosom, P., and Pellom, B. (2006). “University of Colorado prompted and read children’s speech corpus,” Technical Report TR-
CSLR-2006-03, Center for Spoken Language Research, Boulder, CO.
Cole, R., and Pellom, B. (2006). “University of Colorado read and summarized stories corpus,” Technical Report TR-CSLR-2006-03, Center
for Spoken Language Research, Boulder, CO.
Dahlb€ack, N., J€onsson, A., and Ahrenberg, L. (1993). “Wizard of Oz studies: Why and how,” in Proceedings of the 1st International
Conference on Intelligent User Interfaces, 4–7 January, Orlando, FL (Association for Computing Machinery, New York), pp.
193–200.
Demuth, K., Culbertson, J., and Alter, J. (2006). “Word-minimality, epenthesis and coda licensing in the early acquisition of English,” Lang.
Speech 49(2), 137–173.
Dutta, S., Tao, S. A., Reyna, J. C., Hacker, R. E., Irvin, D. W., Buzhardt, J. F., and Hansen, J. H. L. (2022). “Challenges remain in building ASR
for spontaneous preschool children speech in naturalistic educational environments,” in Proceedings of Interspeech 2022, 18–22 September,
Incheon, Korea (ISCA, Belgium), pp. 2706–2710.
Eshky, A., Ribeiro, M. S., Cleland, J., Richmond, K., Roxburgh, Z., Scobbie, J., and Wrench, A. (2018). “Ultrasuite: A repository of ultrasound
and acoustic data from child speech therapy sessions,” in Proceedings of Interspeech 2018, 2–6 September, Hyderabad, India (ISCA,
Belgium), pp. 1888–1892.
Eskenazi, M. S. (1996). “Kids: A database of children’s speech,” J. Acoust. Soc. Am. 100(4), 2759.
Fish, M., and Pinkerman, B. (2003). “Language skills in low-SES rural Appalachian children: Normative development and individual
differences, infancy to preschool,” J. Appl. Dev. Psychol. 23(5), 539–565.
Hart, B., Risley, T. R., and Kirby, J. R. (1997). “Meaningful differences in the everyday experience of young American children,” Can. J. Educ.
22(3), 323.
Irwin, V., De La Rosa, J., Wang, K., Hein, S., Zhang, J., Burr, R., Roberts, A., Barmer, A., Bullock Mann, F., Dilig, R., and Parker, S. (2022).
“Report on the condition of education 2022. NCES 2022-144,” National Center for Education Statistics, Washington, DC, p. 232.
Johnson, A., Fan, R., Morris, R., and Alwan, A. (2022a). “LPC augment: An LPC-based ASR data augmentation algorithm for low and zero-
resource children’s dialects,” in Proceedings of ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 23–27 May, Singapore (IEEE, New York), pp. 8577–8581.
Johnson, A., Martin, A., Quintero, M., Bailey, A., and Alwan, A. (2022b). “Can social robots effectively elicit curiosity in stem topics from K-1
students during oral assessments?,” in 2022 IEEE Global Engineering Education Conference (EDUCON), 28–31 March, Tunis, Tunisia (IEEE,
New York), pp. 1264–1268.
Kanero, J., Geçkin, V., Oranç, C., Mamus, E., Ku€ntay, A. C., and Go€ksun, T. (2018). “Social robots for early language learning: Current evi-
dence and future directions,” Child Dev. Perspect. 12(3), 146–151.
Kazemzadeh, A., You, H., Iseli, M., Jones, B., Cui, X., Heritage, M., Price, P., Anderson, E., Narayanan, S., and Alwan, A. (2005). “TBALL data
collection: The making of a young children’s speech corpus,” in Proceedings of Interspeech 2005, 4–8 September, Lisbon, Portugal (ISCA,
Belgium), pp. 1581–1584.
Kory, J. M., Jeong, S., and Breazeal, C. L. (2013). “Robotic learning companions for early language development,” in Proceedings of the 15th
ACM on International Conference on Multimodal Interaction, 9–13 December, Sydney, Australia (Association for Computing Machinery,
New York), pp. 71–72.
Lee, S., Potamianos, A., and Narayanan, S. (1999). “Acoustics of children’s speech: Developmental changes of temporal and spectral parame-
ters,” J. Acoust. Soc. Am. 105(3), 1455–1468.
JASA Express Lett. 4 (11), 115201 (2024) 4, 115201-6
 23 March 2026 19:03:40
ARTICLE asa.scitation.org/journal/jel
Paez, M. M., Tabors, P. O., and Lopez, L. M. (2007). “Dual language and literacy development of Spanish-speaking preschool children,”
J. Appl. Dev. Psychol. 28(2), 85–102.
Russell, M. (2006). The PF-STAR British English Children’s Speech Corpus (The Speech Ark Limited, Edinburgh, UK).
Safavi, S., Najafian, M., Hanani, A., Russell, M., Jančovič, P., and Carey, M. (2012). “Speaker recognition for children’s speech,” in Proceedings
of Interspeech 2012, 9–13 September, Portland, OR (ISCA, Belgium), pp. 1836–1839.
Shobaki, K., Hosom, J.-P., and Cole, R. (2000). “The OGI kids’ speech corpus and recognizers,” in Proceedings of ICSLP 2000, 16–20 October,
Beijing, China (ISCA, Belgium), pp. 564–567.
Snow, C. E., Porche, M. V., Tabors, P. O., and Harris, S. R. (2007). Is Literacy Enough? Pathways to Academic Success for Adolescents (Brookes
Publishing Co., Baltimore, MD).
Spaulding, S., Chen, H, Ali, S., Kulinski, M., and Breazeal, C. (2018). “A social robot system for modeling children’s word pronunciation,” in
Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS ’18), 10–15 July, Stockholm,
Sweden (IFAAMAS, Richland, SC), pp. 1658–1666.
Speights Atkins, M., Bailey, D. J., and Boyce, S. E. (2020). “Speech exemplar and evaluation database (seed) for clinical training in articulatory
phonetics and speech science,” Clin. Linguist. Phonet. 34(9), 878–886.
Tran, T., Tinkler, M., Yeung, G., Alwan, A., and Ostendorf, M. (2020). “Analysis of disfluency in children’s speech,” in Proceedings of
Interspeech 2020, 25–29 October, Shanghai, China (ISCA, Belgium), pp. 4278–4282.
Ward, W., Cole, R., Bolan~os, D., Buchenroth-Martin, C., Svirsky, E., Van Vuuren, S., Weston, T., Zheng, J., and Becker, L. (2011). “My science
tutor: A conversational multimedia virtual tutor for elementary school science,” ACM Trans. Speech Lang. Process. 7(4), 1–29.
Westlund, J. K., and Breazeal, C. (2015). “The interplay of robot language level with children’s language learning during storytelling,” in
Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction Extended Abstracts, 3–6 March,
Portland, OR (ACM, New York), pp. 65–66.
Williams, S. M., Nix, D., and Fairweather, P. (2000). “Using speech recognition technology to enhance literacy instruction for emerging read-
ers,” in International Conference of the Learning Sciences, 12–16 June, Ann Arbor, MI (Psychology Press, Hove, UK), pp. 115–120.
Yeung, G., Afshan, A., Quintero, M., Martin, A., Spaulding, S., Park, H. W., Bailey, A., Breazeal, C., and Alwan, A. (2019a). “Towards the
development of personalized learning companion robots for early speech and language assessment,” in Proceedings of the 2019 Annual
Meeting of the American Educational Research Association (AERA), 5–9 April, Toronto, Canada (AERA, Washington, DC).
Yeung, G., and Alwan, A. (2018). “On the difficulties of automatic speech recognition for kindergarten-aged children,” in Proceedings of
Interspeech 2018, 2–6 September, Hyderabad, India (ISCA, Belgium), pp. 1661–1665.
Yeung, G., and Alwan, A. (2019). “A frequency normalization technique for kindergarten speech recognition inspired by the role of f0 in
vowel perception,” in Proceedings of Interspeech 2019, 15–19 September, Graz, Austria (ISCA, Belgium), pp. 1671–1675.
Yeung, G., Bailey, A. L., Afshan, A., Tinkler, M., Perez, M. Q., Martin, A., Pogossian, A. A., Spaulding, S., Park, H. W., Muco, M., and Alwan,
A. (2019b). “A robotic interface for the administration of language, literacy, and speech pathology assessments for children,” in Proceedings
of 8th ISCA Workshop on Speech and Language Technology in Education (SLaTE 2019), 20–21 September, Graz, Austria (ISCA, Belgium),
pp. 41–42.
JASA Express Lett. 4 (11), 115201 (2024) 4, 115201-7
 23 March 2026 19:03:40