NOVEMBER 01 2024 The JIBO Kids Corpus: A speech dataset of child-robot interactions in a classroom environment Natarajan Balaji Shankar ; Amber Afshan; Alexander Johnson; Aurosweta Mahapatra; Alejandra Martin; Haolun Ni; Hae Won Park; Marlen Quintero Perez; Gary Yeung; Alison Bailey; Cynthia Breazeal; Abeer Alwan JASA Express Lett. 4, 115201 (2024) https://doi.org/10.1121/10.0034195   View Export Online Citation Articles You May Be Interested In Automatic forced alignment on child speech: Directions for improvement Proc. Mtgs. Acoust. (December 2015) Developing a corpus of spoken language variability J. Acoust. Soc. Am. (October 2003) Measuring open-set, word recognition in school-aged children: Corpus of monosyllabic target words and speech maskers J. Acoust. Soc. Am. (October 2019) 23 March 2026 19:03:40 ARTICLE asa.scitation.org/journal/jel The JIBO Kids Corpus: A speech dataset of child-robot interactions in a classroom environment Natarajan Balaji Shankar,1,a) Amber Afshan,1 Alexander Johnson,1 Aurosweta Mahapatra,1 Alejandra Martin,2 Haolun Ni,1 Hae Won Park,3 Marlen Quintero Perez,2 Gary Yeung,1 Alison Bailey,2 Cynthia Breazeal,3 and Abeer Alwan1 1Department of Electrical and Computer Engineering, University of California Los Angeles, Los Angeles, California 90095, USA 2Department of Education, University of California Los Angeles, Los Angeles, California 90095, USA 3MIT Media Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA balaji1312@ucla.edu; amberafshan@ucla.edu; ajohnson49@ucla.edu; aurosweta99@ucla.edu; alemartin@ucla.edu; michaelni12@ucla.edu; haewon@mit.edu; mquint30@ucla.edu; garyyeung@ucla.edu; abailey@gseis.ucla.edu; cynthiab@media.mit.edu; alwan@ee.ucla.edu Abstract: This paper describes an original dataset of children’s speech, collected through the use of JIBO, a social robot. The dataset encompasses recordings from 110 children, aged 4–7 years old, who participated in a letter and digit identification task and extended oral discourse tasks requiring explanation skills, totaling 21 h of session data. Spanning a 2-year collection period, this dataset contains a longitudinal component with a subset of participants returning for repeat recordings. The data- set, with session recordings and transcriptions, is publicly available, providing researchers with a valuable resource to advance investigations into child language development. VC 2024 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). [Editor: Yolanda F. Holt] https://doi.org/10.1121/10.0034195 Received: 21 August 2024 Accepted: 8 October 2024 Published Online: 1 November 2024 1. Introduction The development of language and literacy skills stands as a cornerstone of elementary education. However, empirical find- ings from the National Assessment of Educational Progress underscore a concerning reality: 37% of fourth-grade students in the United States do not demonstrate reading proficiency aligned with grade-level expectations (Irwin et al., 2022). The foundations of literacy are established in the crucial pre-kindergarten (pre-K) and kindergarten years, where children develop preliteracy skills such as phonological awareness and letter knowledge (Bus and Van IJzendoorn, 1999). These early developmental stages, therefore, necessitate focused attention and resources to foster language growth. To enhance the learning experience and capitalize on these advancements, the usage of systems in educational space has become commonplace (Williams et al., 2000), but technological advancements must still address a significant hurdle: the inadequate performance of contemporary Automatic Speech Recognition (ASR) techniques when tasked with scoring children’s responses (Dutta et al., 2022; Yeung and Alwan, 2018). The error-prone nature of automatically gener- ated children’s speech transcriptions poses a significant challenge for their integration into educational applications. Research focused on kindergarten-aged children underscores the imperative to specifically tailor ASR systems for this age group, as preliteracy skills, such as phonological and alphabetic knowledge developed at the pre-K and kindergarten level, can support the development of literacy skills (Biemiller and Slonim, 2001; Fish and Pinkerman, 2003; Hart et al., 1997; Paez et al., 2007; Snow et al., 2007). However, a notable scarcity of comprehensive children’s speech databases persists within the field, particularly, with respect to longitudinal datasets. These longitudinal resources are invaluable for investigating language development and refining child-centered automatic speech recognition and speaker recognition systems (Dutta et al., 2022; Safavi et al., 2012; Yeung and Alwan, 2018). By tracking the same children over time, researchers can map the trajectories of language acquisition. This under- standing can guide the development of systems and techniques specifically tailored to the evolving characteristics of children’s speech (Yeung and Alwan, 2019). Longitudinal data also facilitate the development of educational applications specifically tailored to children’s voices by offering insight into how children’s speech patterns evolve, supporting applica- tions in areas such as personalized learning environments and child-robot interaction. To effectively collect data from children, researchers must design data collection mechanisms that are engaging and centered on the child’s experience. Social robots, with their ability to engage children interactively, hold significant a)Author to whom correspondence should be addressed. JASA Express Lett. 4 (11), 115201 (2024) VC Author(s) 2024. 4, 115201-1 23 March 2026 19:03:40 ARTICLE asa.scitation.org/journal/jel potential as a vehicle for implementing these data-driven insights in clinical and educational settings (Kanero et al., 2018; Kory et al., 2013; Westlund and Breazeal, 2015). Robots can facilitate targeted activities aimed at various objectives, includ- ing the evaluation of speech development and phonetic acquisition and the reinforcement of pronunciation skills. Leveraging the interactive capabilities of a social robot, JIBO (Spaulding et al., 2018), this paper presents a novel children’s speech dataset collected over a 2-year period. JIBO was employed to administer a series of structured and semi- structured tasks to children in pre-K, kindergarten, and first grade. These tasks included letter and digit identification as well as explanation tasks. The dataset’s longitudinal component, with a subset of participants returning for follow-up recordings, facilitates the analysis of developmental trajectories in children’s speech. As part of a larger human-robot inter- action (HRI) study, evaluating the effectiveness of social robots in classroom settings in Yeung et al. (2019b), Yeung et al. (2019a), Tran et al. (2020), Johnson et al. (2022b), and Johnson et al. (2022a), this paper offers a comprehensive discus- sion of the dataset’s collection, encompassing design considerations and recording conditions. 2. Related work Several speech corpora exist for studying English-speaking children. For example, the Providence corpus (Demuth et al., 2006) offers longitudinal audio recordings of six English-speaking children, aged 1–4 years old, engaged in natural interac- tions with their mothers. Furthermore, the TBALL dataset (Kazemzadeh et al., 2005) focuses specifically on 256 non- native English speakers, ranging from kindergarten to fourth grade. The PERCEPT-R (Benway et al., 2022; Benway et al., 2023) corpus comprises data from 281 children, analyzing typical speech and residual speech sound disorders, affecting rhotic. The UltraSuite dataset (Eshky et al., 2018) is a repository of ultrasound and acoustic data, collected from recordings of child speech therapy sessions of 86 children. The SEED dataset (Speights Atkins et al., 2020) comprises 58 children with and without speech disorders. Meanwhile, the CID children’s speech corpus (Lee et al., 1999) comprises a collection of read speech samples produced by 436 children, aged 5–17 years old. Similarly, the CMU Kids corpus (Eskenazi, 1996) comprises read speech from 76 children with a narrower age focus of 6–11 years old. Additionally, the AusKidTalk corpus (Ahmed et al., 2021) is a large scale corpus of Australian children between the ages of 3 and 12 years old, comprising a collection of single words, utterances, and narrative speech. Likewise, the CSLU OGI Kids’ Speech corpus (Shobaki et al., 2000) contains read speech samples from participants encompassing a broad age range from kindergarten to grade ten. Moreover, the CU Kid’s Prompted and Read Speech (Cole et al., 2006) corpus comprises read speech data from 663 American English-speaking children aged between 4 and 11 years old, whereas the CU Kid’s Read and Summarized Story (Cole and Pellom, 2006) corpus consists of spontaneous speech recordings from 326 children aged between 6 and 11 years old. The Birmingham subset of the PF-STAR corpus (Russell, 2006) contains samples from 150 British children between the ages of 4 and 15 years old. The MyST corpus (Ward et al., 2011) contains 499 h of audio recordings from 1300 chil- dren from third to fifth grades, interacting with a virtual tutor for science topics. However, to our knowledge, there are no publicly available English language databases that combine child speech data and a longitudinal component across a large range of speakers, necessitating the creation of the JIBO Kids Corpus. 3. Data collection 3.1 JIBO The JIBO Kids Corpus is constructed as a series of structured and semi-structured sessions conducted between a social robot, JIBO, and a child, following a protocol detailed in Bailey and Heritage (2014) and Bailey and Heritage (2018). Initially conceived as a domestic personal assistant robot, JIBO (Spaulding et al., 2018) is a social robot capable of 360-deg rotation for expressive body language, including head-tilting, directional gaze shifts, and dance-like movements. To support interaction and assessment delivery, JIBO possesses a small embedded facial screen that primarily displays JIBO’s animated eye, synchronized with its physical movements, and also provides a visual interface for presenting text, images, or videos as needed. JIBO’s dual cameras are located above the screen, and its microphone array rotates the head to facilitate sound source detection. Two lateral loudspeakers enable audio output for speech and music playback. JIBO’s design, including its expressive movements and child-like voice, positions it as the child’s peer-like learning companion. This use of social robotics aims to foster a natural conversational setting that encourages spontaneous speech production in young participants. 3.2 Recording setup The audio recording setup employed a Logitech C390e webcam (Lausanne, Switzerland), positioned at a 30–45 angle rela- tive to the child and approximately 50 cm away (see Fig. 1). Recordings took place in an unoccupied office during regular school hours with some background noise present from the surrounding school environment. The JIBO social robot was placed on a table or desk in front of the child at an approximate distance of 50 cm. Adjacent to the child, a researcher assumed the role of an “instructor,” engaging in interactive activities with the child and JIBO. Simultaneously, another researcher, designated as the “operator,” oversees the computational aspects and item display through a connected com- puter interface using a Wizard-of-Oz setup (Dahlb€ack et al., 1993). This setup facilitated real-time adjustment of the items displayed by JIBO. In instances in which unexpected interactions arose between the child and social robot, the instructor JASA Express Lett. 4 (11), 115201 (2024) 4, 115201-2 23 March 2026 19:03:40 ARTICLE asa.scitation.org/journal/jel Fig. 1. An example recording session. intervened verbally to assist the child in navigating any difficulties encountered during the session. These include instances where the prompt required repetition or children were distracted or reticent. This intervention ensured the smooth pro- gression of the interaction between the child and JIBO. 3.3 Participants The participants for the study comprised children proficient in English, residing in Southern California. Approximately 40% of the participating children reported exposure to additional languages—predominantly Spanish—at home. Nearly one-third of the children were enrolled in a Spanish-English dual language program, and two-thirds were enrolled in an English-only instructional environment. Informed consent was obtained from students and parents for study participation and dissemination of data, following school site and institutional review board procedures. Data collection proceeded over a 2-year period. In year 1, sessions were recorded with a cohort consisting of 38 pre-K and 55 kindergarten students. Year 2 of the study involved a cohort of 35 kindergarten and 27 first-grade partici- pants. A subset of children from year 1 returned the following year, forming a longitudinal cohort. This longitudinal facet enables an exploration of developmental patterns across time with data available for 22 children progressing from pre-K to kindergarten and 23 children advancing from kindergarten to firs grade. A breakdown of speaker grade and gender is present in Tables 1 and 2. To ensure participant privacy, the dataset contains no personally identifiable information, and participants are represented solely by anonymized codes. Each child was anonymized in the format TXYY7ZZZ, where X in {1,2} is the year of the study; YY in {01,02,03} is the child’s year in school, 01 is pre-K (ages 4–5 years old), 02 is kindergarten (ages 5–6 years old), and 03 is grade 1 (ages 6–7 years old). ZZZ is a unique identifier for each child. Boys are odd numbered and girls are even numbered. Thus, T1027235 is child 235, a male kindergartener whose data were collected in year one of the study. 3.4 Data preprocessing To ensure the quality of the data, all audio recordings were subjected to a preprocessing stage. Initial recordings were cap- tured at a uniform sampling rate of 48 kHz prior to subsequent downsampling to 16 kHz. All recordings were scrutinized to remove any sessions marred by poor audio quality. Table 1. Breakdown of participants per year of study. Audio length that is indicated refers to after preprocessing. Audio length Year Grade Male Female (hh:mm) 1 Pre- K 19 19 4:10 Kindergarten 23 32 6:57 2 Kindergarten 20 15 5:42 1 13 14 4:18 JASA Express Lett. 4 (11), 115201 (2024) 4, 115201-3 23 March 2026 19:03:40 ARTICLE asa.scitation.org/journal/jel Table 2. Breakdown of speakers for longitudinal study. Audio length that is indicated refers to after preprocessing. Year 1 audio Year 2 audio Cohorts Gender Participants Length (hh:mm) Length (hh:mm) Cohort 1 Male 14 1:25 2:09 (Pre-K! Kindergarten) Female 8 0:56 1:24 Cohort 2 Male 10 1:16 1:25 (Kindergarten! 1) Female 13 1:28 2:19 Sessions exhibiting a substantial degree of audio clipping were identified and removed from the dataset. An exception to this criterion was granted for recordings obtained during the “blocks” task, in which pronounced clacking noises are present as participants interact with physical cube toys. Long periods of silence at the beginnings and ends of sessions were identified, and the respective sessions were trimmed. Sessions characterized by excessively muted audio levels or instances of inaudible or hushed speech by the partic- ipants were also identified and removed. Additionally, sessions contaminated by excessive background noise were elimi- nated from the dataset. In total, this data cleaning process removed 4:38 h of noisy audio. 3.5 Transcription During transcription of all experimental sessions, the speech of child participants was annotated at the word level. Instructor and JIBO speech were excluded from the transcription. All personally identifiable information was redacted from the audio recordings and transcripts. However, sections that exhibited significant crosstalk between the instructor or JIBO and the child remain embedded within the accompanying audio recordings. 4. Corpus composition JIBO was loaded with educational materials to conduct a letter and number identification task along with explanation tasks, using its screen to display visual stimuli. Audio instructions and prompts were recorded by a female researcher with a pitch-shift to emulate a child’s voice. Participants were given 2min of exposure to JIBO’s voice through interactive voice prompts prior to the start of the exercises. JIBO intermittently provided positive reinforcement during the session, praising correct answers and fostering engagement. Each session was limited to 30min to maintain consistency and prevent fatigue. The breakdown of average session length, as well as total data collected per task, is presented in Table 3. 4.1 Letter and digit identification In the interviews conducted during the first year of the study, children were asked to identify a sequence of the digits 0–9 and the letters of the English alphabet displayed on JIBO’s screen. Accompanying each display was a prompt tailored to the content, such as “What letter is this?” or “What number is this?” On the child’s identification of the presented letter or number, JIBO transitioned to the next item in the sequence, presenting a new prompt without further intervention or supplementary cues. Table 3. Breakdown of audio length by recorded session. Total audio length Average session length Task Year of study Grade (hh:mm) (mm:ss) Letter and digit 1 Pre-K 2:41 4:53 Kindergarten 3:59 4:53 2 Kindergarten 3:02 5:52 1 2:19 5:34 Teeth 1 Pre-K 0:41 1:53 Kindergarten 1:26 2:06 2 Kindergarten 1:01 2:16 1 0:46 2:18 Blocks 2 Kindergarten 1:39 3:11 1 1:12 3:48 Colors 1 Pre-K 0:48 2:10 Kindergarten 1:32 2:09 Total 21:07 3:18 JASA Express Lett. 4 (11), 115201 (2024) 4, 115201-4 23 March 2026 19:03:40 ARTICLE asa.scitation.org/journal/jel Fig. 2. Sample display of JIBO during the alphabet train game. For participants in pre-K and kindergarten in year 2 of the study, an “alphabet train game” was introduced to evaluate the child’s proficiency in letter identification (see Fig. 2). Here, JIBO presented a scrolling train on its screen, where each train car was adorned with a distinct letter. The child’s task involved identifying each letter as it scrolled by. For children in grade 1, a “spelling game” was used to gauge their aptitude in basic word decoding. In this task, JIBO dis- played a word accompanied by a corresponding image, prompting the child to read the word aloud before subsequently attempting to spell it. For children in pre-K and kindergarten in year 2 of the study, JIBO presented a “finger game,” an image of hands with raised fingers, prompting the child to orally count out the quantity of raised fingers. For children in grade 1, a “real world math game” tested the math abilities of the children through a series of tasks ranging from simple addition involving candy to more complex scenarios such as determining the age of a child depicted in a birthday celebration scene. 4.2 Explanations The session recordings for both years of the study featured an interactive component designed to capture spontaneous responses from the participants through extended discourse tasks. In year 1, children engaged in interviews revolving around two distinct explanation tasks: “brushing their teeth” (referred to as “teeth”) and “mixing paint into colors” (“colors”). The children were prompted to articulate their approaches to executing these tasks (“How do you clean your teeth?”), elucidate the rationale behind the task (“Why do you brush your teeth?”), expand on how they would explain the task to a peer (“How would you teach your friend to brush their teeth?”), and justify why their peer should undertake the task in the manner proposed (“Why should they brush their teeth?”). During year 2 of the study, the teeth-brushing task was revisited (teeth), allowing for longitudinal examination of responses. Additionally, participants were presented with a novel task involving an undisclosed number of cubes, which could be joined together or separated, and were tasked with determining the number of cubes provided (blocks). Following this task, the same series of questions from the teeth task were posed to elicit insights into participants’ approaches, reasoning, and strategies for communicating the counting task to a friend. This semi-structured approach to interactive sessions across both study years offers insights into the developmental trajectories of children’s explanation dis- course skills and communicative abilities over time. 5. Conclusion This paper introduces the JIBO Kids Corpus, a unique longitudinal corpus of child-robot interaction speech. This dataset presents a resource for investigating linguistic development in children and advancing automatic speech recognition and speaker verification systems for children. We anticipate that this publicly available dataset will contribute to research on language acquisition and inform the development of educational applications for children. Acknowledgments This work was supported in part by the National Science Foundation (NSF). Amber Afshan, Alexander Johnson, Alejandra Martin, Marlen Quintero Perez and Gary Yeung completed this work while they were students at UCLA. Author Declarations Conflict of Interest The authors have no conflicts to disclose. JASA Express Lett. 4 (11), 115201 (2024) 4, 115201-5 23 March 2026 19:03:40 ARTICLE asa.scitation.org/journal/jel Ethics Approval The research presented in this paper was conducted in accordance with Institutional Review Board (IRB) guidelines. All participants, including students and their parents, provided informed consent for the collection and distribution of anony- mized speech data. No personally identifiable information was included in the dataset, ensuring participant privacy. Data Availability The data that support the findings of this study are openly available in https://github.com/balaji1312/Jibo_Kids at https:// doi.org/10.5281/zenodo.13964791. References Ahmed, B., Ballard, K. J., Burnham, D., Sirojan, T., Mehmood, H., Estival, D., Baker, E., Cox, F., Arciuli, J., Benders, T., Demuth, K., Kelly, B., Diskin- Holdaway, C., Shahin, M., Sethu, V., Epps, J., Lee, C. B., and Ambikairajah, E. (2021). “AusKidTalk: An auditory-visual corpus of 3- to 12-year-old Australian children’s speech,” in Proceedings of Interspeech 2021, 30 August–3 September, Brno, Czechia (ISCA, Belgium), pp. 3680–3684. Bailey, A. L., and Heritage, M. (2014). “The role of language learning progressions in improved instruction and assessment of English lan- guage learners,” TESOL Q. 48(3), 480–506. Bailey, A. L., and Heritage, M. (2018). Progressing Students’ Language Day by Day (Corwin, Thousand Oaks, CA). Benway, N. R., Preston, J. L., Hitchcock, E., Rose, Y., Salekin, A., Liang, W., and McAllister, T. (2023). “Reproducible speech research with the artificial intelligence-ready PERCEPT corpora,” J. Speech. Lang. Hear. Res. 66(6), 1986–2009. Benway, N., Preston, J. L, Hitchcock, E., Salekin, A., Sharma, H., and McAllister, T. (2022). “PERCEPT-R: An open-access American English child/clinical speech corpus specialized for the audio classification of //,” in Proceedings of Interspeech 2022, 18–22 September, Incheon, Korea (ISCA, Belgium), pp. 3648–3652. Biemiller, A., and Slonim, N. (2001). “Estimating root word vocabulary growth in normative and advantaged populations: Evidence for a common sequence of vocabulary acquisition,” J. Educ. Psychol. 93(3), 498–520. Bus, A. G., and Van IJzendoorn, M. H. (1999). “Phonological awareness and early reading: A meta-analysis of experimental training studies,” J. Educ. Psychol. 91(3), 403–414. Cole, R., Hosom, P., and Pellom, B. (2006). “University of Colorado prompted and read children’s speech corpus,” Technical Report TR- CSLR-2006-03, Center for Spoken Language Research, Boulder, CO. Cole, R., and Pellom, B. (2006). “University of Colorado read and summarized stories corpus,” Technical Report TR-CSLR-2006-03, Center for Spoken Language Research, Boulder, CO. Dahlb€ack, N., J€onsson, A., and Ahrenberg, L. (1993). “Wizard of Oz studies: Why and how,” in Proceedings of the 1st International Conference on Intelligent User Interfaces, 4–7 January, Orlando, FL (Association for Computing Machinery, New York), pp. 193–200. Demuth, K., Culbertson, J., and Alter, J. (2006). “Word-minimality, epenthesis and coda licensing in the early acquisition of English,” Lang. Speech 49(2), 137–173. Dutta, S., Tao, S. A., Reyna, J. C., Hacker, R. E., Irvin, D. W., Buzhardt, J. F., and Hansen, J. H. L. (2022). “Challenges remain in building ASR for spontaneous preschool children speech in naturalistic educational environments,” in Proceedings of Interspeech 2022, 18–22 September, Incheon, Korea (ISCA, Belgium), pp. 2706–2710. Eshky, A., Ribeiro, M. S., Cleland, J., Richmond, K., Roxburgh, Z., Scobbie, J., and Wrench, A. (2018). “Ultrasuite: A repository of ultrasound and acoustic data from child speech therapy sessions,” in Proceedings of Interspeech 2018, 2–6 September, Hyderabad, India (ISCA, Belgium), pp. 1888–1892. Eskenazi, M. S. (1996). “Kids: A database of children’s speech,” J. Acoust. Soc. Am. 100(4), 2759. Fish, M., and Pinkerman, B. (2003). “Language skills in low-SES rural Appalachian children: Normative development and individual differences, infancy to preschool,” J. Appl. Dev. Psychol. 23(5), 539–565. Hart, B., Risley, T. R., and Kirby, J. R. (1997). “Meaningful differences in the everyday experience of young American children,” Can. J. Educ. 22(3), 323. Irwin, V., De La Rosa, J., Wang, K., Hein, S., Zhang, J., Burr, R., Roberts, A., Barmer, A., Bullock Mann, F., Dilig, R., and Parker, S. (2022). “Report on the condition of education 2022. NCES 2022-144,” National Center for Education Statistics, Washington, DC, p. 232. Johnson, A., Fan, R., Morris, R., and Alwan, A. (2022a). “LPC augment: An LPC-based ASR data augmentation algorithm for low and zero- resource children’s dialects,” in Proceedings of ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 23–27 May, Singapore (IEEE, New York), pp. 8577–8581. Johnson, A., Martin, A., Quintero, M., Bailey, A., and Alwan, A. (2022b). “Can social robots effectively elicit curiosity in stem topics from K-1 students during oral assessments?,” in 2022 IEEE Global Engineering Education Conference (EDUCON), 28–31 March, Tunis, Tunisia (IEEE, New York), pp. 1264–1268. Kanero, J., Geçkin, V., Oranç, C., Mamus, E., Ku€ntay, A. C., and Go€ksun, T. (2018). “Social robots for early language learning: Current evi- dence and future directions,” Child Dev. Perspect. 12(3), 146–151. Kazemzadeh, A., You, H., Iseli, M., Jones, B., Cui, X., Heritage, M., Price, P., Anderson, E., Narayanan, S., and Alwan, A. (2005). “TBALL data collection: The making of a young children’s speech corpus,” in Proceedings of Interspeech 2005, 4–8 September, Lisbon, Portugal (ISCA, Belgium), pp. 1581–1584. Kory, J. M., Jeong, S., and Breazeal, C. L. (2013). “Robotic learning companions for early language development,” in Proceedings of the 15th ACM on International Conference on Multimodal Interaction, 9–13 December, Sydney, Australia (Association for Computing Machinery, New York), pp. 71–72. Lee, S., Potamianos, A., and Narayanan, S. (1999). “Acoustics of children’s speech: Developmental changes of temporal and spectral parame- ters,” J. Acoust. Soc. Am. 105(3), 1455–1468. JASA Express Lett. 4 (11), 115201 (2024) 4, 115201-6 23 March 2026 19:03:40 ARTICLE asa.scitation.org/journal/jel Paez, M. M., Tabors, P. O., and Lopez, L. M. (2007). “Dual language and literacy development of Spanish-speaking preschool children,” J. Appl. Dev. Psychol. 28(2), 85–102. Russell, M. (2006). The PF-STAR British English Children’s Speech Corpus (The Speech Ark Limited, Edinburgh, UK). Safavi, S., Najafian, M., Hanani, A., Russell, M., Jančovič, P., and Carey, M. (2012). “Speaker recognition for children’s speech,” in Proceedings of Interspeech 2012, 9–13 September, Portland, OR (ISCA, Belgium), pp. 1836–1839. Shobaki, K., Hosom, J.-P., and Cole, R. (2000). “The OGI kids’ speech corpus and recognizers,” in Proceedings of ICSLP 2000, 16–20 October, Beijing, China (ISCA, Belgium), pp. 564–567. Snow, C. E., Porche, M. V., Tabors, P. O., and Harris, S. R. (2007). Is Literacy Enough? Pathways to Academic Success for Adolescents (Brookes Publishing Co., Baltimore, MD). Spaulding, S., Chen, H, Ali, S., Kulinski, M., and Breazeal, C. (2018). “A social robot system for modeling children’s word pronunciation,” in Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS ’18), 10–15 July, Stockholm, Sweden (IFAAMAS, Richland, SC), pp. 1658–1666. Speights Atkins, M., Bailey, D. J., and Boyce, S. E. (2020). “Speech exemplar and evaluation database (seed) for clinical training in articulatory phonetics and speech science,” Clin. Linguist. Phonet. 34(9), 878–886. Tran, T., Tinkler, M., Yeung, G., Alwan, A., and Ostendorf, M. (2020). “Analysis of disfluency in children’s speech,” in Proceedings of Interspeech 2020, 25–29 October, Shanghai, China (ISCA, Belgium), pp. 4278–4282. Ward, W., Cole, R., Bolan~os, D., Buchenroth-Martin, C., Svirsky, E., Van Vuuren, S., Weston, T., Zheng, J., and Becker, L. (2011). “My science tutor: A conversational multimedia virtual tutor for elementary school science,” ACM Trans. Speech Lang. Process. 7(4), 1–29. Westlund, J. K., and Breazeal, C. (2015). “The interplay of robot language level with children’s language learning during storytelling,” in Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction Extended Abstracts, 3–6 March, Portland, OR (ACM, New York), pp. 65–66. Williams, S. M., Nix, D., and Fairweather, P. (2000). “Using speech recognition technology to enhance literacy instruction for emerging read- ers,” in International Conference of the Learning Sciences, 12–16 June, Ann Arbor, MI (Psychology Press, Hove, UK), pp. 115–120. Yeung, G., Afshan, A., Quintero, M., Martin, A., Spaulding, S., Park, H. W., Bailey, A., Breazeal, C., and Alwan, A. (2019a). “Towards the development of personalized learning companion robots for early speech and language assessment,” in Proceedings of the 2019 Annual Meeting of the American Educational Research Association (AERA), 5–9 April, Toronto, Canada (AERA, Washington, DC). Yeung, G., and Alwan, A. (2018). “On the difficulties of automatic speech recognition for kindergarten-aged children,” in Proceedings of Interspeech 2018, 2–6 September, Hyderabad, India (ISCA, Belgium), pp. 1661–1665. Yeung, G., and Alwan, A. (2019). “A frequency normalization technique for kindergarten speech recognition inspired by the role of f0 in vowel perception,” in Proceedings of Interspeech 2019, 15–19 September, Graz, Austria (ISCA, Belgium), pp. 1671–1675. Yeung, G., Bailey, A. L., Afshan, A., Tinkler, M., Perez, M. Q., Martin, A., Pogossian, A. A., Spaulding, S., Park, H. W., Muco, M., and Alwan, A. (2019b). “A robotic interface for the administration of language, literacy, and speech pathology assessments for children,” in Proceedings of 8th ISCA Workshop on Speech and Language Technology in Education (SLaTE 2019), 20–21 September, Graz, Austria (ISCA, Belgium), pp. 41–42. JASA Express Lett. 4 (11), 115201 (2024) 4, 115201-7 23 March 2026 19:03:40