Validation of Babbly’s AI Algorithm for the Classification of Infant Language Milestones

Study in collaboration between Babbly and Healthcare Innovation and Technology Lab (HITLAB) led by Dr. Stan Kachnowski, Director of Digital Health at Columbia University

Report authors: Iva Brunec, Carla Margalef Bentabol, Stan Kachnowski

Correspondence: or


Introduction. The present study validates Babbly’s AI algorithm to detect developmentally relevant infant vocalization patterns in audio recordings. The algorithm predicts the presence of 4 unique infant vocalization classes (cooing, single-syllable babbling, duplicated babbling, and variegated babbling).

Methods. Human annotations of different babbling classes were compared against model predictions for audio recordings. Classification statistics were then calculated for the vocalization classes present in the annotations (ground truth) vs. the model predictions.

Results. The overall accuracy (F-1 score statistic) was 0.86 per audio, and 0.91 per infant (collapsing across all recordings per infant). Prediction accuracy was similar across the age span, and split by sex. We observed high precision and recall (both 0.84) for duplicated (canonical) babbling, which is a particularly important marker of typical language development.

Discussion. Babbly’s AI model produces high accuracy results from short audio recordings taken by caregivers during normal family activities without restriction on location or activity. Babbly’s algorithm can therefore contribute to early detection of language delays and developmental disorders, such as autism, which are highly correlated with language delay. This can ultimately foster more effective interventions and improved outcomes for infants at risk.

Note: Throughout this white paper, we use the term “we” for clarity and ease of reading. Data collection and processing were independently conducted by the external team at HITLAB at an arm’s length to Babbly, ensuring objectivity and impartiality.

Figure 1. A graphical summary of the present validation study.


Speech and language delays in early childhood are prevalent. Approximately 7.5% of preschool children are estimated to have developmental language delay (McGregor 2020, Norbury et al., 2016), with an estimated prevalence of approximately 10-20% at 2 years of age (Collisson et al., 2016; Zubrick et al., 2007). Recent studies suggest that the proportion of 1-year-olds missing language milestones has increased following the COVID-19 pandemic (Byrne et al., 2023). It is commonly assumed that most late talkers catch up to their peers as they approach school age, but this ‘wait-and-see’ approach has been criticized (Singleton, 2018; Buschmann et al., 2008). As many as 20-30% of late talkers do not catch up (Ellis and Thal, 2008), highlighting the importance of early detection and intervention for offsetting later knock-on effects on vocabulary and literacy (Carson et al., 2022; Ward, 1999; Elbro, Dalby, & Maarbjerg, 2011).

Babbling and language development

Signs of potential language delay can be detected even before the onset of first words, laying the groundwork for early intervention. The trajectory of infant vocalizations, extending from cooing to babbling, is a fundamental precursor to language acquisition and the development of communication (McGillion et al., 2017; Oller et al., 1999; Oller, 1980). Through babbling, infants gradually transition from reflexive vocalizations to intentional communication, reflecting an innate drive to engage with their environment and caregivers (Jusczyk, 1997). Research across multiple languages and cultures shows that while infants’ babbling characteristics increasingly align with the linguistic structure of their native language, they follow universal patterns in the developmental milestones from babbling to language over time (Levitt & Utman, 1992; Werker & Tees, 1999).

Infants with delays in babbling in their first year of life are more likely to have smaller vocabularies and delayed language development between 18-30 months of age (McGillion et al., 2017; Oller et al., 1999). Notably, the absence of or significant reduction in babbling can serve as an early indicator of potential developmental disorders or delays. Infants later diagnosed with autism exhibited lower canonical babbling rates compared to typically developing infants (Yankowitz et al., 2022; Patten et al., 2014). Reduced or absent canonical babbling was also related to language delay in general (Lohmander et al., 2017), and infants with lower babbling rates and vocal output (volubility) overall were later more likely to later be diagnosed with apraxia (Overby et al., 2020). Identifying deviations from typical babbling trajectories can thus hold crucial diagnostic value and support early intervention strategies.

Babbly’s AI algorithm

Babbly’s artificial intelligence (AI) algorithm takes audio clips of baby vocalizations as input and identifies key early speech milestones. First, the algorithm passes the audio to an utterance detection model, which identifies which segments of the audio contain baby and/or adult voice. Then, the audio segments where a baby voice was detected (either on its own or overlapping with adult voices) are passed to a speech milestones classification model. This model classifies each baby vocalization as one of 5 speech classes: cooing, single-syllable babbling, duplicated babbling, variegated babbling, or ‘other’. The first four classes are developmentally relevant patterns of infant vocalizations, covering the range of vocalizations from birth to first words. These milestones are often difficult to spot for non-experts, making the application of AI technology a valuable aid in their detection

Validation Study

 In the present study, we aimed to validate the accuracy of this algorithm in
detecting the presence of different vocalization patterns in newly collected real-world
recordings of infants between 4-16 months of age. For each recording and each infant, we
compared the predictions of the algorithm to annotations made by three independent trained
human observers and assessed the level of alignment between the labels provided by the
annotators and the predictions made by the algorithm. To do so, we compared the unique
labels present in the annotations for each audio, and the unique labels present in the
predictions for each audio, and applied established classification statistics to these values (see
Figure 1).

To preview, we observed good agreement between model predictions and human annotations. A high level of accuracy was achieved both for predicting annotated labels when they were present in the annotations, and for avoiding false positives when they were not present. We observed no meaningful differences in prediction accuracy based on sex, and observed comparable levels of performance for infants younger than a year and those a year old or older, highlighting that the model can be adopted across all stages of preverbal development.


The participants were recruited and new recordings were collected by the Healthcare Innovation and Technology Lab (HITLAB). HITLAB conducts digital health research to assess the potential outcome efficacy and economic value of new technologies. Researchers were given access to Babbly’s AI remote service to evaluate the model’s prediction accuracy.


Families with infants aged 4-16 months with no diagnosis of language delay were recruited virtually from social media mother/parent groups using social media and with physical flyers. The families signed virtual consent forms prior to participating. This study was approved by the Biomedical Research Alliance of New York (BRANY) Institutional Review Board (BABBLY_AI_0523/BRANY #23-08-378-681)

Of the 105 families who initially responded, 90 were invited for interviews, and 75 were screened to ensure they matched the inclusion criteria. Of these, 47 expressed interest in participating. A final sample of 35 families was recruited, and 2-5 recordings were collected from each family, for a total of 101 recordings. Each family was compensated with a $50 gift card upon completion of the study.

Thirteen of the infants were female and 22 were male. Nineteen of the participants were 11 months old or younger, and 16 were 12 months old or older.

Data Annotation

Each audio was annotated by 3 independent annotators who were unaware of the labels provided by other annotators. A team of 5 annotators was recruited and trained on annotation guidelines for 7 unique classes. Three of the annotators were graduate (MSc) students in Speech-Language Pathology, one was a recent graduate with prior annotation experience, and one was an active Speech-Language Pathologist with prior annotation experience

All annotators were trained on 7 unique vocalization classes:

  1. Cooing: strings of primarily vowels with occasional early consonants (‘ooo’, ‘aaa’, ‘goo’)
  2. Single-syllable babbling: distinct single syllables, separated by a pause or breath (‘da’,
  3. Duplicated babbling: canonical babbling, repetitive consonant-vowel syllables not
    separated by a pause or breath (‘mamama’, ‘dadada’)
  4. Variegated babbling: babbling with different consonant-vowel combinations, akin to
    language (‘dabadu’, ‘badupi’)
  5. Other baby voice: any other sounds generated by an infant (grunts, cries, whining,
  6. Adult voice: any adult voice in the recording, whether it was directed at the infant or not
  7. Child voice: children whose voices are clearly distinct from adults, but their speech is
    fluent and easily understandable (for example, older siblings).

Only categories 1-5 were included in the subsequent validation analysis, the remaining categories were annotated for completeness and to ensure agreement among annotators.

For each audio recording, each annotator provided a list of unique classes present. Each recording could contain any number of classes, up to a maximum of 7. Once three annotators labeled each audio, a majority vote agreement between annotators was established as ground truth. This means that for each audio recording, the final annotation classes were those where at least two of the three annotators agreed. All audio recordings where the annotators disagreed on more than 1 class were removed from the analysis entirely.

Audio Data

Participants were asked to provide recordings consisting of between 30 seconds and 3
minutes of continuous audio. They were instructed to minimize background noise, make sure
only one infant’s voice was present in each recording, and try to avoid overlaps between adult
and baby voices (adult voices were allowed in the recordings)

Participants uploaded their audio recordings using a secure link to a HIPAA-compliant repository. A total of 128 files were received from the 35 infants. Of these, 27 had more than one disagreement among the annotators, and were removed from subsequent analysis, leaving 101 audio recordings.

Within the final dataset of 101 files, the average number of recordings per infant was 2.88 (minimum 1 after removing audios with annotator disagreement, maximum 4). The average audio recording duration was 42.9 seconds (range 30.0-76.4s), and the average total amount of audio per infant was 123.9 seconds (range 32.7-214.3s).

Babbly’s AI Model

The audio recordings were processed by Babbly’s AI remote service, which provided instantaneous results to the researchers. The output of the AI algorithm was a list of classes detected in each audio recording, which was subsequently compared to human expert annotations.


Alignment Analysis

We assessed the agreement between model predictions and human annotations by comparing the predicted vs. annotated lists of unique classes. This analysis was applied 1) to the lists of predicted and annotated classes in each audio, and 2) to the lists of unique predicted and annotated classes per infant (taking into account all classes across all audios). The analysis per infant was applied to validate whether a larger amount of data would lead to more reliable predictions of babbling patterns. Standard classification statistics in machine learning (ML) evaluation were then computed for these list-pairs.

Classification Statistics

In each of the sections below, we report the following classification statistics:


Proportion of correct predictions out of all predictions. Reflects how frequently the model predictions contain a class that is indeed present in the annotations


Proportion of correctly identified positive cases out of all actual occurrences of a class in the annotations. Reflects how frequently a class present in the annotations is included in the predictions.

F1 Score

The harmonic mean of precision and recall. Provides a balanced overall measure of the model’s accuracy.

Results Per Audio

Table 1 below quantifies whether each annotated class was predicted in each audio, whether it was missed by the model, or whether a class not present in the annotations was predicted (false positive). The total ‘support’ value therefore reflects the total number of labels across all audios, and the class-specific ‘support’ values reflect the total count of occurrences of a given class.

The overall weighted average was 0.82 for precision, and 0.91 for recall. The overall F-1 score was 0.86, with a total of 305 label-prediction samples.

Table 1. Classification statistics per audio.

Results Per Infant

Table 2 below provides results on data summarized for each infant. Instead of calculating the presence/absence of each class in each audio, these statistics were calculated for each infant, collapsing across all audio recordings. The maximum support value per class is therefore equivalent to the number of infants in the study.

The overall weighted average was 0.87 for precision, and 0.97 for recall. The overall F-1 score was 0.91, with a total of 128 infant-prediction samples.

Table 2. Classification statistics per infant.

Results By Sex & Age

We next split the per-infant results above by sex (M vs. F), and by age group (<= 11 months vs. >= 12 months). Table 3 contains the results split by sex, and Table 4 contains the results split by age group.

The overall F-1 score was 0.93 for female infants (precision 0.89, recall 0.98), and 0.91 for male infants (precision 0.85, recall 0.96).

Table 3. Classification statistics per infant, split by sex.

The overall F-1 score was 0.89 for infants younger than 1 year (precision 0.85, recall 0.94), and 0.94 for infants older than 1 year (precision 0.90, recall 1.0). The algorithm therefore performed marginally better for infants older than a year of age, but it is worth noting that both groups were fairly small in size (19 and 16).

Table 4. Classification statistics per infant, split by age group.


The validation of AI technologies against human expertise is an essential step in establishing their credibility and reliability for clinical applications. In the present study, we evaluated the prediction accuracy of Babbly’s AI babbling milestones classification algorithm against human experts. A high degree of accuracy was achieved, suggesting that the algorithm reliably and accurately predicts 4 developmentally relevant infant vocalization patterns in audio recordings of 4-16-month-old infants. Three key conclusions can be drawn: 1) The model performs with high accuracy on new recordings from infants’ everyday environments; 2) Precision and recall are both high and balanced for duplicated (canonical) babbling; and 3) The model’s performance is balanced across the age range and across sex in the present study.

The high level of accuracy suggests that Babbly’s model is sufficiently sensitive and robust in generalizing to audio recordings collected independently of prior training/validation/test sets used to build the model. The audio recordings were made in a variety of environments and with no restriction on recording device, highlighting the utility of the algorithm in detecting infant vocalization in real-world settings and contexts.

Notably, good and balanced precision and recall values were achieved for duplicated (canonical) babbling. This result highlights the utility of our algorithm in detecting preverbal language skills especially relevant to subsequent language development. Canonical babbling is a particularly important marker of typical language development in the first year of life, as its absence or reduction can indicate subsequent language delay or developmental disorders such as autism  (Oller et al., 1999; Patten et al., 2014; Yankowitz et al., 2022). Reliable detection of the presence or absence of canonical babbling could therefore contribute significantly to the early detection of potential language development issues.

We found no meaningful differences in prediction accuracy based on sex or age, suggesting that the model can be applied equally reliably across the developmental trajectory and across sexes. Some research suggests that language delays are somewhat more frequent in boys compared to girls (Adani & Cepanec, 2019), and that boys produce first words slightly later than girls, on average (Zubrick et al., 2007). While these differences tend to be small and are considered to be driven by environmental factors (Van Hulle et al., 2004), the present results suggest that the algorithm can be used to detect potential missed milestones with balanced accuracy for boys and girls.

The present study has two limitations to be addressed in future work. First, the algorithm has not been validated on bilingual families. While the model was built using a variety of languages in its training set, it has most frequently been applied to English-speaking recordings. Second, the present study did not test the prediction accuracy on recordings of premature vs. full-term 12 infants due to limited scope (all but one infant were full-term). There is some evidence that infants born prematurely (before 37 weeks of gestation) are more likely to have delays in language development (Zimmerman, 2018; Sansavini et al., 2011). However, the fact that the algorithm is equally reliable across the age range in the present study suggests that the absence of age-appropriate babbling milestones would be detected regardless of premature status.

In sum, the present results show promise in using Babbly’s AI technology to detect language milestones before the onset of first words. The AI model’s capacity to accurately detect these patterns lays the groundwork for faster identification and intervention in potential language delays. Moving forward, further research could delve into refining the AI model’s performance by incorporating larger and more diverse datasets, including more linguistic contexts. This iterative process of improvement will continue to bridge the gap between AI capabilities and human expertise, ultimately enhancing our ability to detect and address language development challenges in a holistic manner. Reliable detection and assessment will also result in saving clinicians’ time in evaluating children and providing them with more accurate information otherwise not available in short visits.


Adani, S., & Cepanec, M. Sex differences in early communication development: behavioral and neurobiological indicators of more vulnerable communication system development in boys. Croatia Medical Journal, 60(2), 2019, 141-149.

Buschmann, A., Jooss, B., Rupp, A., Dockter, S., Blaschtikowitz, H., Heggen, I., & Pietz, J. (2008). Children with developmental language delay at 24 months of age: results of a diagnostic work‐up. Developmental Medicine & Child Neurology, 50(3), 223-229.

Byrne, S., Sledge, H., Franklin, R., Boland, F., Murray, D. M., & Hourihane, J. (2023). Social communication skill attainment in babies born during the COVID-19 pandemic: a birth cohort study. Archives of Disease in Childhood, 108(1), 20-24.


Carson, L., Baker, E., & Munro, N. (2022). A systematic review of interventions for late talkers: intervention approaches, elements, and vocabulary outcomes. American Journal of Speech-Language Pathology, 31(6), 2861-2874.

Collisson, B. A., Graham, S. A., Preston, J. L., Rose, M. S., McDonald, S., & Tough, S. (2016). Risk and protective factors for late talking: An epidemiologic investigation. The Journal of Pediatrics, 172, 168–174.


Elbro, C., Dalby, M., & Maarbjerg, S. (2011). Language‐learning impairments: a 30‐year follow‐up of language‐impaired children with and without psychiatric, neurological and cognitive difficulties. International Journal of Language & Communication Disorders, 46(4), 437-448.


Ellis, E. M., & Thal, D. J. (2008). Early language delay and risk for language impairment. Perspectives on Language Learning and Education, 15(3), 93-100.


Jusczyk, P. W. (1997). The discovery of spoken language. The MIT Press.

Levitt, A. G., & Utman, J. G. A. (1992). From babbling towards the sound systems of English and French: A longitudinal two-case study. Journal of Child Language, 19(1), 19-49.


Lohmander, A., Holm, K., Eriksson, S., & Lieberman, M. (2017). Observation method identifies that a lack of canonical babbling can indicate future speech and language problems. Acta Paediatrica, 106(6), 935-943.


McGillion, M., Herbert, J. S., Pine, J., Vihman, M., DePaolis, R., Keren‐Portnoy, T., & Matthews, D. (2017). What paves the way to conventional language? The predictive value of babble, pointing, and socioeconomic status. Child Development, 88(1), 156-166.
McGregor, K. K. (2020). How we fail children with developmental language disorder. Language, Speech, and Hearing Services in Schools, 51(4), 981-992.


Norbury, C. F., Gooch, D., Wray, C., Baird, G., Charman, T., Simonoff, E., … & Pickles, A. (2016). The impact of nonverbal ability on prevalence and clinical presentation of language disorder: Evidence from a population study. Journal of Child Psychology and Psychiatry, 57(11), 1247-1257.


Oller, D. K. (1980). The emergence of the sounds of speech in infancy. Child Phonology, 1, 93–112.


Oller, D. K., Eilers, R. E., Neal, A. R., & Schwartz, H. K. (1999). Precursors to speech in infancy: The prediction of speech and language disorders. Journal of Communication Disorders, 32(4), 223-245.


Overby, M., Belardi, K., & Schreiber, J. (2020). A retrospective video analysis of canonical babbling and volubility in infants later diagnosed with childhood apraxia of speech. Clinical Linguistics & Phonetics, 34(7), 634-651.


Patten, E., Belardi, K., Baranek, G. T., Watson, L. R., Labban, J. D., & Oller, D. K. (2014). Vocal patterns in infants with autism spectrum disorder: Canonical babbling status and vocalization frequency. Journal of Autism and Developmental Disorders, 44, 2413-2428.


Sansavini, A., Guarini, A., Savni, S., Broccoli, S., Justice, L., Allessandroni, R., & Faldella, G. (2011). Longitudinal trajectories of gestural and linguistic abilities in very pre-term infants in the second year of life. Neuropsychologia, 49(13), 3677-3688.


Singleton, N. C. (2018). Late talkers: Why the wait-and-see approach is outdated. Pediatric Clinics, 65(1), 13-29. Van Hulle, C., Goldsmith, H. & Lemery, S. Genetic, environmental, and gender effects on individual differences in toddler expressive language. Journal of Speech, Language and Hearing Research, 47(4), 2004, 904-912


Ward, S. (1999). An investigation into the effectiveness of an early intervention method for delayed language development in young children. International Journal of Language & Communication Disorders, 34(3), 243-264.


Werker, J. F., & Tees, R. C. (1999). Influences on infant speech processing: Toward a new synthesis. Annual Review of Psychology, 50(1), 509-535.


Yankowitz, L. D., Petrulla, V., Plate, S., Tunc, B., Guthrie, W., Meera, S. S., … & Parish-Morris, J. (2022). Infants later diagnosed with autism have lower canonical babbling ratios in the first year of life. Molecular Autism, 13(1), 1-16.


Zimmerman, E. Do infants born very premature and who have very low birth weight catch up with their full term peers in their language abilities by early school age? Journal of Speech, Language, and Hearing Research, 61, 53–65.


Zubrick, S.R., Taylor, C.L., Rice, M.L. & Slegers, D.W. (2007). Late language emergence at 24 months: An epidemiological study of prevalence, predictors and covariates. Journal of Speech, Language and Hearing Research, 50(6) 1562-1592.