automated vocal analysis of naturalistic recordings … · automated vocal analysis of naturalistic...

Automated vocal analysis of naturalistic recordingsfrom children with autism, language delay, andtypical developmentD. K. Ollera,b,1, P. Niyogic, S. Grayd, J. A. Richardsd, J. Gilkersond, D. Xud, U. Yapaneld, and S. F. Warrene

aSchool of Audiology and Speech-Language Pathology, University of Memphis, Memphis, TN 38105; bKonrad Lorenz Institute for Evolution and CognitionResearch, Altenberg, Austria A-3422; cDepartments of Computer Science and Statistics, University of Chicago, Chicago, IL 60637; dLENA Foundation, Boulder,CO 80301; and eDepartment of Applied Behavioral Science and Institute for Life Span Studies, University of Kansas, Lawrence, KS 66045

Edited* by E. Anne Cutler, Max Planck Institute for Psycholinguistics, Heilig Landstichting, Netherlands, and approved June 21, 2010 (received for review April21, 2010)

For generations the study of vocal development and its role inlanguage has been conducted laboriously, with human transcribersand analysts coding and taking measurements from small recordedsamples. Our research illustrates a method to obtain measures ofearly speech development through automated analysis of massivequantities of day-long audio recordings collected naturalistically inchildren’s homes. A primary goal is to provide insights into the de-velopment of infant control over infrastructural characteristics ofspeech through large-scale statistical analysis of strategically se-lected acoustic parameters. In pursuit of this goal we have discov-ered that the first automated approach we implemented is not onlyable to track children’s development on acoustic parameters knownto play key roles in speech, but also is able to differentiate vocal-izations from typically developing children and childrenwith autismor language delay. Themethod is totally automated,with no humanintervention, allowing efficient sampling and analysis at unprece-dented scales. The work shows the potential to fundamentally en-hance research in vocal development and to add a fully objectivemeasure to the battery used to detect speech-related disorders inearly childhood. Thus, automated analysis should soon be able tocontribute to screeninganddiagnosis procedures for early disorders,and more generally, the findings suggest fundamental methods forthe study of language in natural environments.

vocal development | automated identification of language disorders |all-day recording | automated speaker labeling | autism identification

Typically developing children in natural environments acquirea linguistic system of remarkable complexity, a fact that has

attracted considerable scientific attention (1–6). Yet research onvocal development and related disorders has beenhamperedby therequirement for human transcribers and analysts to take laboriousmeasurements from small recorded samples (7–11). Consequently,not only has the scientific study of vocal development been limited,but the potential for research to help identify vocal disorders earlyin life has, with a few exceptions (12–14), been severely restricted.We begin, then, with the question of whether it is possible to

construct a fully automated system (i.e., an algorithm) that usesnothing but acoustic data from the child’s “natural” linguistic en-vironment, isolates child vocalizations, extracts significant features,and assesses vocalizations in a useful way for predicting level ofdevelopment and detection of developmental anomalies. Therearemajor challenges for such a procedure. First, with recordings ina naturalistic environment, isolating child vocalizations from ex-traneous acoustic input (e.g., parents, other children, backgroundnoise, television) is vexingly difficult. Second, even if the child’svoice can be isolated, there remains the challenge of determiningfeatures predictive of linguistic development. We describe proce-dures permitting massive-scale recording and use the data col-lected to show it is possible to track linguistic development withtotally automated procedures and to differentiate recordings fromtypically developing children and children with language-related

disorders. We interpret this as a proof of concept that automatedanalysis can now be included in the tool repertoire for research onvocal development and should be expected soon to yield furtherscientific benefits with clinical implications.Naturalistic sampling on the desired scale capitalizes on a battery-

powered all-day recorder weighing 70 g that can be snapped intoa chest pocket of children’s clothing, from which it records incompletely natural environments (SI Appendix, “Recording De-vice”). Recordings with this device have been collected since 2006in three subsamples: typically developing language, language de-lay, and autism (Fig. 1A and SI Appendix, “Participant Groups,Recording and Assessment Procedures”). In both phase I (2006–2008) and phase II (2009) studies (SI Appendix,Table S2), parentsresponded to advertisements and indicated if their children hadbeen diagnosed with autism or language delay. Children in phase Iwith reported diagnosis of language delay were also evaluated bya speech-language clinician used by our project. Parents of chil-dren with language delay in phase II and parents of children withautism in both phases supplied documentation from the di-agnosing clinicians, who were independent of the research. Par-ent-based assessments obtained concurrently with recordings (SIAppendix, Table S3) confirmed sharp group differences concor-dant with diagnoses; the autism sample had sociopsychiatricprofiles similar to those of published autism samples, and lowlanguage scores occurred in autism and language delay samples(Fig. 1B and SI Appendix, Figs. S2–S4 and “Analyses IndicatingAppropriate Characteristics of the Participant Groups”).Demographic information (Fig. 1B and SI Appendix, Tables S3

and S4) was collected through both phases: boys appeared dis-proportionately in the disordered groups, mother’s educationallevel (a proxy for socioeconomic status) was higher for childrenwith language disorders, and general developmental levels werelow in the disordered groups. A total of 113 children werematched

Author contributions: D.K.O., P.N., S.G., J.A.R., J.G., D.X., and S.F.W. designed research;S.G., J.A.R., J.G., D.X., and U.Y. performed research; P.N., S.G., J.A.R., and D.X. contributednew reagents/analytic tools; D.K.O., P.N., S.G., J.A.R., and D.X. analyzed data; and D.K.O.,P.N., and S.F.W. wrote the paper.

Conflict of interest statement: The recordings and hardware/software development werefunded by Terrance and Judi Paul, owners of the previous for-profit company Infoture.Dissolution of the company was announced February 10, 2009, and it was reconstituted asthe not-for-profit LENA Foundation. All assets of Infoture were given to the LENA Foun-dation. Before dissolution of the company, D.K.O., P.N., and S.F.W. had received consul-tation fees for their roles on the Scientific Advisory Board of Infoture. J.A.R., J.G., and D.X.are current employees of the LENA Foundation. S.G. and U.Y. are affiliates and previousemployees of Infoture/LENA Foundation. None of the authors has or has had any owner-ship in Infoture or the LENA Foundation.

*This Direct Submission article had a prearranged editor.

Freely available online through the PNAS open access option.1To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1003882107/-/DCSupplemental.

13354–13359 | PNAS | July 27, 2010 | vol. 107 | no. 30 www.pnas.org/cgi/doi/10.1073/pnas.1003882107

http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1003882107/-/DCSupplemental/sapp.pdf








mailto:[email protected]

http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1003882107/-/DCSupplemental

http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1003882107/-/DCSupplemental

www.pnas.org/cgi/doi/10.1073/pnas.1003882107

on these variables (Fig. 1C) to allow comparisonwith results on theentire sample.Our dataset included 1,486 all-day recordings from 232 children,

withmore than3.1million automatically identifiedchildutterances.The signal processing software reliably identified (SI Appendix,Tables S6 and S7) sequences of utterances from the child wearingthe recorder, discarded cries and vegetative sounds, and labeled theremaining consecutive child vocalizations “speech-related childutterances” (SCUs; SI Appendix, Fig. S5). Additional analysis di-vided SCUs into “speech-related vocal islands” (SVIs), high-energyperiods bounded by low-energy periods (SI Appendix, Figs. S6 andS7). Roughly, the energy criterion isolated salient “syllables” inSCUs. Analysis of SVIs focused on acoustic effects of rhythmic“movements” of jaw, tongue, and lips (i.e., articulation) that un-derlie syllabic organization, and on acoustic effects of vocal qualityor “voice.” Infants show voluntary control of syllabification andvoice in the first months of life and refine it throughout languageacquisition (15, 16). Developmental tracking of these features byautomated means at massive scales could add a major new com-ponent to language acquisition research.Given their infrastructuralcharacter, anomalies in development of rhythmic/syllabic articula-tion and voice might also suggest an emergent disorder (17, 18).SVIs were analyzed on 12 infrastructural acoustic features

reflecting rhythmic/syllabic articulation and voice and known toplay roles in speech development (19) (SI Appendix, “AutomatedAcoustic Feature Analysis”). The features pertain to four con-ceptual groupings (SI Appendix, Table S5): (i) rhythm/syllabicity(RhSy), (ii) low spectral tilt/high pitch control (LtHp, designed toreflect certain voice characteristics), (iii) high bandwidth/lowpitch control (BwLp, designed to reflect additional voice charac-teristics), and (iv) duration. The 12 features were assigned to thefour groupings based on a combination of infrastructural vocaltheory (19) and results of principal component (PC) analysis(PCA; SI Appendix, Table S10 and Fig. S12).

The algorithmdetermined the degree to which each SVI showedpresence or absence of each of the 12 features, providing ameasureof infrastructural vocal development on each feature. Each SVIwas classified as plus or minus (i.e., present or absent) for eachfeature. The number of “plus” classifications per parameter variedfrommore than 2,400 per recording on the voicing (VC) parameterto fewer than 100 on the wide formant bandwidth parameter (SIAppendix, Fig. S8). To adjust for differences in length of recordingsand amount of vocalization (i.e., “volubility”) across individualsand recordings, we took the ratio for each parameter of the numberof SVIs classified as plus to the number of SCUs. This yielded 12numbers, one for each parameter at each recording, reflecting per-utterance usage of the 12 features. The SVI/SCU ratio also variedwidely, from more than 1.3 per utterance for VC to fewer than0.05 for wide formant bandwidth (SI Appendix, Fig. S9). The 12-dimensional vector composed of SVI/SCU ratios normalizedfor age was used to predict vocal development and to classifyrecordings as pertaining to children with or without a language-related disorder (SI Appendix, “Statistical Analysis Summary”).The acoustic features were thus chosen as developmental indi-

cators and as potential child group differentiators. Aberrations ofvoice have been noted from the first descriptions of autism spec-trum disorders (ASDs) (20, 21); subsequent research has addedevidence of abnormalities of prosody (10, 22–25) and articulatoryfeatures affecting rhythm/syllabicity (11, 26). Still, the standarddiagnostic reference for mental/behavioral disorders (27) does notinclude vocal characteristics in assessment of ASD. Reasons ap-pear to be that evidence of vocal abnormalities in autism is scat-tered and often vague, criteria for judgment of vocalization are nottypically included in health professionals’ educational curricula,and reliability of judgment for vocal characteristics is problematicgiven individual variability (10, 26). Autism diagnosis is basedheavily on “negative markers” such as joint attention deficit (28–31) and associated communication deficits (32–35). Abnormal

A

B

C

Fig. 1. Demographics. (A) Characteristics of the child groups and recordings. (B) Demographic parameters indicating that groups differed significantly interms of gender, mother’s education level (a strong indicator of socioeconomic status), and general development age determined by the Child DevelopmentInventory (CDI) (40), an extensive parent questionnaire assessment obtained for 86% of the children, including all those in the matched samples. This CDI ageequivalent score is based on 70 items distributed across all of the subscales of the CDI, including language, motor development, and social skills. The groupsalso differed in age and in scores on the LENA Developmental Snapshot, a 52-item parent questionnaire measure of communication and language, obtainedfor all participants. (C) Demographic parameters for subsamples matched on gender, mother’s education level, and developmental age.

Oller et al. PNAS | July 27, 2010 | vol. 107 | no. 30 | 13355

DEV

ELOPM

ENTA

LBIOLO

GY

PSYC

HOLO

GICALAND

COGNITIVESC

IENCE

S













vocal characteristics would constitute a positive marker that mightenhance precision of screening or diagnosis (31).Most children with language-related disorders, even without au-

tism, also show articulation and voice anomalies (36–39). Our workexplores the possibility that the automated approach may be usefulin discriminating typical from language-disordered development, aswell as in discriminating autism from language delay. The power ofthe automated approach will presumably depend on its ability topredict changes with age in typically developing children.

ResultsCorrelations between child age and SVI/SCU ratios for the 12parameters showed clear evidence that the automated analysispredicts development: for the typically developing and language-delayed groups, five of the 12 correlations were greater than ±0.4,at a high level of statistical significance of P< 10−10 (Fig. 2A and SIAppendix, Table S9 A–D); all 12 correlations with age for thetypically developing sample and seven of 12 for the language-delayed sample were statistically significant. The autistic sample, tothe contrary, showed little evidence of development on theparameters; all their correlations of acoustic parameters with agewere less than ±0.2, and only two were statistically reliable (SIAppendix, Table S9 E and F). The typically developing groupshowed reliable negative correlations (P < 10−4) on three param-eters for which the other groups showed positive correlations, il-lustrating that certain vocal tendencies diminished with age in thetypically developing group but did not diminish or increased withage in the others. Correlations of the 12 parameters with eachother also revealed coherency within the four parameter groupings

for all three child groups, but the autistic sample showed manymore correlations not predicted by the four parameter groupingsthan the other child groups (SI Appendix, Table S9 A–F), a resultsuggesting children with autism organize acoustic infrastructurefor vocalization differently from typically developing children orchildren with language delay.Multiple linear regression (MLR) also illustrated development

and group differentiation. The 12 acoustic parameter ratios (SVIs/SCUs) for each recording were regressed against age for the typi-cally developing sample, yielding a normative model for devel-opment of vocalization. Coefficients of the model were used tocalculate developmental ages for autistic and language-delayedrecordings, displayed along with data for the typically developingrecordings in Fig. 2 B and C. The MLR model for the 802recordings from the typically developing sample accounted forgreater than 58% of variance in predicted age. MLR for the 106typically developing children yielded yet higher prediction, ac-counting for more than 69% of variance, and suggesting that theautomated analysis can provide a strong indicator of vocal de-velopment within this age range.The correlation of predicted vocal developmental age with real

age based on MLR was much higher for the typically developingand language-delayed samples (Fig. 2C) than for the autism sam-ple (Fig. 2B), in which the predicted levels were also generallyconsiderably below those of the typically developing sample. Theacoustic parameters within the four theoretical groupings werealso regressed against age (SI Appendix, Fig. S11A–D), with resultssuggesting strong differentiation of the typically developing andautistic samples for three of the four groupings (RhSy, LtHp, andduration). The low correlations with age for the autism sample inthese MLR analyses based on the typically developing model donot imply that the autism sample showed no change with regard tothe acoustic parameters across age. On the contrary, an in-dependent MLR model for the autism sample alone using all 12acoustic parameters as predictors accounted formore than 16%ofvariance (P < 0.001), indicating that the autism group did developwith age on the parameters. However the nature of the change withregard to the parameters across age was clearly different in the twogroups, as indicated by the MLR results.Linear discriminant analysis (LDA) and linear logistic re-

gression were used tomodel classification of children into the threegroups based on the acoustic parameters. Here the ratio scores(SVIs/SCUs) for each of the 12 parameters were first standardizedby calculating z-scores for each recording in monthly age intervals.Leave-one-out cross-validation (LOOCV) was the main holdoutmethod to verify generalization of results to recordings not in thetraining samples (SIAppendix, “StatisticalAnalysis Summary”). Sixpairwise comparisons (SI Appendix, Fig. S1), each based on 1,500randomized splits into training and testing (i.e., holdout) samplesof varying sizes, confirmed that LOOCV results were consistentlyrepresentative of midrange randomly selected holdout results, andthus provide a useful standard for the data reported in the sub-sequent paragraphs. LOOCV results for LDA and linear logisticregression were nearly identical (SI Appendix, Table S1); the re-mainder of the present report thus focuses on LDA only.The results indicated strong correct identification of autism

versus typical development (χ2, P< 10−21), with sensitivity (i.e., hitrate) of 0.75 and specificity (i.e., correct rejection rate) of 0.98,based on a threshold (i.e., cutoff) posterior classification proba-bility of 0.52 (Fig. 3A). At equal-error probability, sensitivity andspecificity were 0.86 (crossover point, Fig. 3B). LDA also differ-entiated the typically developing sample from a combined autismand language-delay sample. At a cutoff of 0.56, specificity (i.e.,correct identification of typically developing children) was 0.90.At the same cutoff, 80% of the autism sample and 62% of thelanguage-delay sample were correctly classified as not being in thetypically developing group (Fig. 3C). Sensitivity and specificity atequal error probability for classification of children as being or not

0.50 0.60 0.70

c ous

� c

ag e

*

* * * * *

*

* * * * p < 0.004

A *

0 100.20 0.30 0.40

n s o

f 12

a c

t er s

wit

h

Ty pica l

Delayed

*

* *

* * * *

0 200.10 0.00 0 . 10

o rr e

la�

o n

pa ra

me t Dela ye d

Au� st ic

* * *

e

0.30 0 . 20

VC CS SE S Q LT HF GW WB S M L X L

C o

LtHpRh Sy BwLp Du ra �on

R² = 0.5813 50

60 R² = 0.5813

40

50

60

p men

t ag

B

20

30

40

20

30

40

a l d

evel

o p

² = 0

10

20R² = 0.3534

0

10

20

c te d

vo c

a

10 10 20 30 40 50 10

0

10 20 30 40 50

Pr ed

i c

Re al ag e in months Re al ag e in months

C

0.0305 R

Fig. 2. Results of correlational analysis and MLR analysis. (A) Correlations ofacoustic parameter SVI/SCU ratio scoreswith age across the 1,486 recordings. In10 of 12 cases, both typically developing and language-delayed childrenshowed higher absolute values of correlations with age than children withautism (SI Appendix, Table S9 A–F). All 12 correlations for the typically de-veloping sample and seven of 12 for the language-delayed sample were sta-tistically significant after Bonferroni correction (P< 0.004). The autistic sample,conversely, showed little evidence of development on the parameters; allcorrelations of acoustic parameters with agewere lower than±0.2. (B) MLR fortypically developing and autism samples. Blue dots represent real and pre-dicted age (i.e., “predicted vocal development age”) for 802 recordings oftypically developing children based on SVI/SCU ratios for the 12 acousticparameters (r=0.762,R2 = 0.581). Reddots represent 351 recordings of childrenwith autism, forwhich predicted vocal development ages (r = 0.175,R2 = 0.031)were determined based on the typically developing MLR model. Each red di-amond represents the mean predicted vocal development level acrossrecordings for one of the 77 children with autism. (C) MLR for the typicallydeveloping (blue) and language-delayed samples (333 recordings; gold squaresfor 49 individual child averages; r = 0.594, R2 = 0.353).

13356 | www.pnas.org/cgi/doi/10.1073/pnas.1003882107 Oller et al.












being typically developing was 0.79 (Fig. 3D). Fig. 4A suppliesLDA data on all six possible binary classificatory comparisons ofthe three child groups and illustrates strongest discrimination ofthe autism versus typically developing samples (P< 10−21). For theentire samples (left column), both typically developing and autismgroups were also very reliably (P < 10−5) differentiated from thelanguage-delayed group, although sensitivity/specificity was lower(0.73 and 0.70) than in the differentiation of the typically de-veloping from autism groups or in differentiation of the typicallydeveloping from the combined language disordered groups.Fig. 4A also reveals only slightly lower levels of group differ-

entiation for modeling with LOOCV on the matched sample of113 children. Holdout models (i.e., not based on LOOCV) withtraining on phase I data and testing on phase II data revealedsimilar results. Models based on LOOCV training for all record-ings were also used to determine sensitivity and specificity rates ononly the first recording available for each child, again with similarresults seen in Fig. 4A. The first-recording analysis suggests theautomated approach can discriminate groups quite reliably witha single all-day recording for each child.Fig. 4B compares mean posterior probabilities of autism classi-

fication (LOOCV models) in subsamples for one configurationonly: typically developing versus autism. Group differentiation wasrobust, applying to both genders and to higher and lower socio-economic status (indicated by mother’s reported education level).Further, differentiation was strong for subgroups matched on lan-guage development and expressive language measures, for the au-tism subgroup reported to be speaking in words, as well as forsubgroups with high and low within-group language scores. Addi-tional comparisons in Fig. 4B illustrate that subgroups diagnosedwith autism through widely used tests documented in clinicalreports supplied by parents (SI Appendix, Table S3) were differ-entiated sharply from the typically developing sample by the auto-mated analysis. All comparisons of typically developing and autisticsamples in Fig. 4B were highly statistically reliable (P < 10−11).Additional evaluations of posterior probabilities for individuals andsubgroups of special interest in the autism and language-delaysamples further illustrated the robustness of the automated analysis

in groupdifferentiation (SIAppendix, “Results: IndividualChildrenand Subgroups of Particular Interest in the Group DiscriminationAnalyses”; and SI Appendix, Figs. S14 and S15). Very low correla-tions of age with posterior probabilities (r= −0.07 and r= 0.11 forautism and typical samples, respectively) suggest group differenti-ation by this automatedmethod was not importantly dependent onage within these samples (SI Appendix, Fig. S16).Long-term success of automated vocal analysis may depend on

the “transparency” of modeling, the extent to which results are in-terpretable through vocal development theory. The present workwas based on such a theory (SI Appendix, “The 12 AcousticParameters” and “Research in Development of Vocal AcousticCharacteristics”), and thus provides a beginning for explaining whythemodel worked in, for example, predicting age. In pursuit of suchexplanation we further evaluated correlations among SVI/SCUratios for the 12 acoustic parameters. To alleviate confoundingeffects of intercorrelations among the parameters, it was useful toproject the data along orthogonal principal components (PCs). ThePCA (SI Appendix, Table S10 and Fig. S12 A–C) yielded onedominant PC in the typically developing sample (analysis conductedonall 802 recordingswithout differentiationof the recordingswithinchild), accounting for 40%of variance,more thandouble that of anyother PC. The dominant PC for the typically developing group washighly correlated with all of the a priori RhSy (rhythm/syllabicity)parameters: (i) canonical transitions (r = 0.88), reflecting for eachrecording the extent to which “syllables” (i.e., SVIs) had rapid for-mant transitions required in mature syllables, (ii) voicing (VC; r =0.78), indicating whether SVIs had periodic vocal energy requiredfor mature syllables, and (iii) spectral entropy (r= 0.71), indicatingwhether variations in vocal quality of SVIs were typical of variationsoccurring in mature speech. The dominant PC was also highly cor-related with a parameter from the a priori duration grouping: (iv)medium duration (r = 0.91), reflecting SVI durations in the mid-range for mature speech syllables.When the four PCs of the typically developing sample were

regressed against age at the child level, this dominant PC playedthe central role in age prediction (R2 = 0.55, P < 10−19). Theβ-coefficient (0.73) for the dominant PC was more than three

0.901.00A BAu�sm, N = 77

N o argest bubble = 26

N of largest bubble = 270.500.600.700.800.90

TYPectc

lass

ifica

N largest bubble 26

Typically developing, N =1060 000.100.200.300.40 TYP

AUT

pco

rre

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.00

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0 901.00

Prop

sC Au�sm, N = 77argest bubble =23

N of largestbubble = 4 0 50

0.600.700.800.90

assi

fica�

ons

DLanguage delayed, N = 49

of largest bubble =23

N of largest bubble = 7

0.100.200.300.400.50

TYP

AUT+DEL

onco

rrec

tcl

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.00

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

E ti t d t i b bilit f

Pro

P t i b bilit th h ld f

Typically developing, N = 106

Es�mated posterior probability of“not typically developing” classifica�on

Posterior probability threshold for“not typically developing” classifica�on

f

fN

opor

tior

tion

t ion

s

Fig. 3. LDA with LOOCV indicating differentiation of child groups. Estimated classification probabilities were based on the 12 acoustic parameters. Resultsare displayed in bubble plots, with each bubble sized in proportion to the number of children at each x axis location (the number of children represented bythe largest bubble in each line is labeled). The plots indicate classification probabilities (Left) and proportion-correct classification (Right) for LDA in (A and B)children with autism (red) versus typically developing children (blue) and (C and D) a combined group of children with autism (red) or language delay (gold)versus typically developing children (blue).


DEV

ELOPM

ENTA

LBIOLO

GY

PSYC

HOLO

GICALAND

COGNITIVESC

IENCE

S







times that of any other PC, suggesting that the typically developingchildren’s control of the infrastructure for syllabification providedthe primary basis for modeling of vocal development age (SIAppendix, “Principal Components Analysis Indicating Empiricaland Theoretical Organization of the Parameters”).PCA at the child level for the autism sample showed a consid-

erably different structure of relations between acoustic parame-ters and PCs, and importantly, the first PC for the autism sampleshowed low correlations with the a priori RhSy acoustic parame-ters (SI Appendix, Table S10 and Fig. S12). When the four PCs forthe autism sample were regressed against age, the PC with highcorrelations on the a priori RhSy parameters (PC2 in this case)yielded a β of 0.16, less than one fourth of that occurring for thetypically developing sample. Prediction of age by this autism PCwas not statistically significant. Thus, in the typically developingsample, the PC most associated with control of the RhSy param-eters was highly predictive of age, but in the autism sample, nosuch predictiveness was found.To quantify the role of RhSy parameters in group differentia-

tion, the PCA model for the first four PCs in the typically de-veloping sample was applied to child-level data from all threegroups and then converted to standard scores based on the meanand SD for the typically developing group. Analysis of the meanstandard scores (SI Appendix, Fig. S13) showed that, indeed, thePC associated with RhSy accounted for more than 3.5 times asmuch variance in group differentiation (adjustedR2 = 0.21) as anyof the other PCs, suggesting that children’s control of the in-frastructure for syllabification may have played the predominantrole in group differentiation.

DiscussionThe purpose of our work is fundamentally scientific: to developtools for large-scale evaluation of vocal development and for re-searchon foundations for language.Theoutcomes indicate that thetime forbasic research throughautomated vocal analysis ofmassiverecording samples is already upon us.A typically developing child’sage can be predicted at high levels (accounting for more than twothirds of variance) based on this first attempt at totally automatedacoustic modeling with all-day recordings. The automated pro-cedure has proven sufficiently transparent to offer suggestions onhow the developmental age prediction worked—the primary factorappears to have been the child’s command of the infrastructure forsyllabification, a finding that should help to guide subsequent in-quiries. From a practical standpoint, the present work illustratesthe possibility of adding a convenient and fully objective evaluationof acoustic parameters to the battery of tests that is used with in-creasing success to identify children with language delay and au-tism. It appears that childhood control of the infrastructuralfeatures of syllabification played the central role in group differ-entiation, just as it played the central role in tracking development.The automated method showed higher accuracy in differentiatingchildren with and without a language disorder than in differenti-ating the two languagedisorder groups (autismand languagedelay)from each other. Future work will be directed to evaluation ofadditional vocal factors that may differentiate subgroups of lan-guage disorder more effectively.Based on the results reported here, there appears to be little

reason for doubt that totally automated analysis of well selectedacoustic features from naturalistic recordings can provide a mon-

En�re sample, Matched sample,Train on Phase I data

(N=138) test on Phase IITrain on all recordings,test on first recording

AN=232, LOOCV N=113, LOOCV

(N=138), test on Phase IIdata (N=94)

test on first recording,N=232, LOOCV

Configura�on Sens/Spec 2, p< Sens/Spec 2, p< Sens/Spec 2, p< Sens/Spec 2, p<

Aut vs Typ 0 86 5 5E 22 0 82 2 4E 08 0 87 5 0E 10 0 83 6 7E 19Aut vs Typ 0.86 5.5E 22 0.82 2.4E 08 0.87 5.0E 10 0.83 6.7E 19Aut vs Typ + Del 0.79 8.7E 18 0.74 1.1E 06 0.73 1.3E 05 0.78 8.0E 16Typ vs Aut + Del 0.79 5.4E 19 0.76 2.4E 07 0.80 2.8E 08 0.77 9.0E 17

Typ vs Del 0.73 1.1E 07 0.69 6.8E 04 0.73 0.001 0.68 3.4E 05ypAut vs Del 0.70 9.7E 06 0.66 0.007 0.67 0.0117 0.66 5.0E 04

Del vs Typ + Aut 0.60 0.017 0.54 0.424 0.57 0.247 0.50 0.936

0.70.80.91.0

babi

lity

onan

dte

rval Au�sm

Typically developing

0.30.40.50.6

oste

rior

prob

mcl

assi

ficat

ion

fiden

ceIn

0.00.10.20.3

All Boys Girls High SES Low SES Matched Matched Aut With High Low ADOS Aut CARS Aut CBCL

Mea

npo

ofau

�sm

95%

Co

AllChildren

Boys Girls High SESMoth Ed

Low SESMoth Ed

MatchedSnapLang

MatchedCDIExp

Aut WithWords

HighLang

LowLang

ADOS AutDiag

CARS AutDiag

CBCLMCHAT

B

Fig. 4. LDA showing group discrimination for various configurations and subsamples. (A) LDA data on the six binary configurations of the three groups forthe entire sample (N = 232; LOOCV); subsamples matched on gender, mother’s education level, and developmental level (n = 113; LOOCV); a different holdoutmethod in which, instead of LOOCV, training was conducted on phase I (2006–2008) data (n = 138) and testing on phase II (2009) data (n = 94); and a testingsample based for each child on the first recording only (N = 232; LOOCV). (B) Bar graph comparisons for subsamples illustrating robustness of group dif-ferentiation in the autism versus typical development configuration only (based on LOOCV modeling). Means were calculated over logit-transformed pos-terior probabilities (PPs) of autism classification, then converted back to PPs. All comparisons showed robust group differentiation, including (from left toright) the entire sample (N = 232), boys (typical, n = 48; autism, n = 64), girls (typical, n = 58; autism, n = 13), children of higher socioeconomic status (SES) asindicated by mother’s education level ≥6 on an eight-point scale (typical, n = 42; autism, n = 49) and lower SES (typical, n = 64; autism, n = 28). To assess thepossibility that “language level” may have played a critical role in automated group differentiation, we compared 35 child pairs matched for developmentalage (typical development group, mean age of 22.6 mo; autism group, mean age of 22.7 mo) on the Snapshot, a language/communication measure (SIAppendix, “Participant Groups and Recording Procedures”), and 46 child pairs matched on the raw score from a single subscale of the CDI (40), namely, theexpressive language subscale (typical development group, mean score of 21.6; autism group, mean score of 21.5), and we found robust group differentiationon PPs. Similar results were obtained for 48 children in the autism sample whose parents reported they were using spoken words meaningfully and for typicaland autism samples split at medians into subgroups of high or low language (High Lang, Low Lang) level on the Snapshot developmental quotient. Asubsample of 29 children with autism for whom diagnosis had been based on the Autism Diagnostic Observation Schedule (ADOS) (41) and another of 24children with autism diagnosed with the Childhood Autism Rating Scale (CARS) (42) also showed robust group discrimination for PPs. Finally, in phase II(typical, n = 30; autism, n = 77) administration of both the Child Behavior CheckList (CBCL) (43) and the Modified Checklist for Autism in Toddlers (MCHAT) (44)had supported group assignment based on diagnoses, and group differentiation of PPs for these children using the automated system was unambiguous.

13358 | www.pnas.org/cgi/doi/10.1073/pnas.1003882107 Oller et al.








itoring system for developmental patterns in vocalization as wellas significant differentiation of children with and without dis-orders such as autism or language delay.We are optimistic that theprocedure can be improved, as this is our first attempt at auto-mated infrastructural modeling, with all parameters designed andimplemented in advance of any analysis of recordings. To datethere have still been no adjustments made in the theoreticallyproposed parameters. Additional modeling (e.g., hierarchical,age-specific, nonlinear) can be invoked, modifications can be ex-plored in acoustic features, and larger samples, especially from theentire spectrum of ASDs and other language-related disorders,can help tune the procedures. The future of research on vocaldevelopment will profit from the combination of traditionalapproaches using laboratory-based exhaustive analysis of smallsamples of vocalization with the enormous power that is nowclearly possible to command through automated analysis of nat-uralistic recordings.

Materials and MethodsThe following details of methods are provided in SI Appendix, SI Materials andMethods: (i) statistical analysis summary, (ii) recording device, (iii) participantgroups and recording and assessment procedures, (iv) automated analysisalgorithms, (v) the 12 acoustic parameters, and (vi) reliability of the automatedanalysis. The statistical analysis summary provides details regarding methods ofprediction of development usingMLR, as well as differentiation of child groupsusing LDA and logistic regression. An empirical illustration of reasons for usingLOOCV in LDA and logistic regression analyses is also given. The recording pro-cedures used in the present work as well as recruitment procedures for childrenin the three groups are explained along with selection criteria. Additional dataare supplied on language and social characteristics of the participating groups.Automated analysis algorithms are described along with reliability data. Allprocedures were approved by the Essex Institutional Review Board.

ACKNOWLEDGMENTS. Research by D.K.O. for this paper was funded by anendowment from the Plough Foundation, which supports his Chair ofExcellence at The University of Memphis.

1. de Villiers JG, de Villiers PA (1974) Competence and performance in child language:Are children really competent to judge? J Child Lang 1:11–22.

2. Slobin D (1973) Cognitive prerequisites for the development of grammar. Studies inChild Language Development, eds Ferguson CA, Slobin DI (Holt, Rinehart & Winston,New York), pp 175–208.

3. Bloom L (1970) Language Development (MIT Press, Cambridge, MA).4. Brown R (1973) A First Language (Academic Press, London).5. Pinker S (1994) The Language Instinct (Harper Perennial, New York).6. Cutler A, Klein W, Levinson SC (2005) The cornerstones of twenty-first century

psycholinguistics. Twenty-First Century Psycholinguistics: Four Cornerstones, edCutler A (Erlbaum, Mahwah, NJ), pp 1–20.

7. Locke JL (1983) Phonological Acquisition and Change (Academic Press, New York).8. Oller DK (1980) The emergence of the sounds of speech in infancy. Child Phonology,

Vol 1: Production, eds Yeni-Komshian G, Kavanagh J, Ferguson C (Academic Press,New York), pp 93–112.

9. Stark RE, Rose SN, McLagen M (1975) Features of infant sounds: The first eight weeksof life. J Child Lang 2:205–221.

10. Sheinkopf SJ, Mundy P, Oller DK, Steffens M (2000) Vocal atypicalities of preverbalautistic children. J Autism Dev Disord 30:345–354.

11. Wetherby AM, et al. (2004) Early indicators of autism spectrum disorders in thesecond year of life. J Autism Dev Disord 34:473–493.

12. Oller DK, Eilers RE (1988) The role of audition in infant babbling. Child Dev 59:441–449.

13. Eilers RE, Oller DK (1994) Infant vocalizations and the early diagnosis of severehearing impairment. J Pediatr 124:199–203.

14. Masataka N (2001) Why early linguistic milestones are delayed in children withWilliams syndrome: Late onset of hand banging as a possible rate-limiting constrainton the emergence of canonical babbling. Dev Sci 4:158–164.

15. Oller DK, Griebel U (2008) The origins of syllabification in human infancy and inhuman evolution. Syllable Development: The Frame/Content Theory and Beyond, edsDavis B, Zajdo K (Lawrence Erlbaum and Associates, Mahwah, NJ), pp 368–386.

16. Vihman MM (1996) Phonological Development: The Origins of Language in the Child(Blackwell Publishers, Cambridge, MA).

17. Stark RE, Ansel BM, Bond J (1988) Are prelinguistic abilities predictive of learningdisability? A follow-up study. Preschool Prevention of Reading Failure, edsMasland RL, Masland M (York Press, Parkton, MD).

18. Stoel-Gammon C (1989) Prespeech and early speech development of two late talkers.First Lang 9:207–223.

19. Oller DK (2000) The Emergence of the Speech Capacity (Lawrence Erlbaum Associates,Mahwah, NJ).

20. Kanner L (1943) Autistic disturbances of affective contact. Nerv Child 2:217–250.21. Asperger H (1991) Autistic “psychopathy” in childhood. Autism and Asperger

Syndrome, ed Frith U (Cambridge University Press, Cambridge, UK), pp 37–90.22. Paul R, Augustyn A, Klin A, Volkmar FR (2005) Perception and production of prosody

by speakers with autism spectrum disorders. J Autism Dev Disord 35:205–220.23. Pronovost W, Wakstein MP, Wakstein DJ (1966) A longitudinal study of the speech

behavior and language comprehension of fourteen children diagnosed atypical orautistic. Except Child 33:19–26.

24. McCann J, Peppé S (2003) Prosody in autism spectrum disorders: a critical review. Int JLang Commun Disord 38:325–350.

25. Peppé S, McCann J, Gibbon F, O’Hare A, Rutherford M (2007) Receptive andexpressive prosodic ability in children with high-functioning autism. J Speech LangHear Res 50:1015–1028.

26. Shriberg LD, et al. (2001) Speech and prosody characteristics of adolescents and adultswith high-functioning autism and Asperger syndrome. J Speech Lang Hear Res 44:1097–1115.

27. Association AP (2000) Diagnostic and Statistical Manual of Mental Disorders, DSM-IV-TR (American Psychiatric Association, Arlington, VA).

28. Baron-Cohen S, Allen J, Gillberg C (1992) Can autism be detected at 18 months? Theneedle, the haystack, and the CHAT. Br J Psychiatry 161:839–843.

29. Baron-Cohen S (2000) Theory of mind and autism: a fifteen year review.Understanding Other Minds: Perspectives from Developmental CognitiveNeuroscience, eds Baron-Cohen S, Tager-Flusberg H, Cohen DJ (Oxford UniversityPress, Oxford), pp 3–20.

30. Loveland KA, Landry SH (1986) Joint attention and language in autism anddevelopmental language delay. J Autism Dev Disord 16:335–349.

31. Mundy P, Kasari C, SigmanM (1992) Nonverbal communication, affective sharing, andintersubjectivity. Infant Behav Dev 15:377–381.

32. Mundy P, Sigman M, Kasari C (1990) A longitudinal study of joint attention andlanguage development in autistic children. J Autism Dev Disord 20:115–128.

33. Rutter M (1978) Diagnosis and definition of childhood autism. J Autism ChildSchizophr 8:139–161.

34. Tager-Flusberg H (1989) A psycholinguistic perspective on language development inthe autistic child. Autism: Nature, Diagnosis, and Treatment, ed Dawson G (Guilford,New York), pp 92–115.

35. Tager-Flusberg H (2000) Language and understanding minds: connections in autism.UnderstandingOtherMinds: Perspectives fromDevelopmental Cognitive Neuroscience,eds Baron-Cohen S, Tager-Flusberg H, Cohen DJ (Oxford University Press, Oxford), pp124–149.

36. Tyler AA, Sandoval KT (1994) PreschoolersWith Phonological and Language Disorders:Treating Different Linguistic Domains. Lang Speech Hear Serv Schools 25:215–234.

37. Conti-Ramsden G, Botting N (1999) Classification of children with specific languageimpairment: Longitudinal considerations. J Speech Lang Hear Res 42:1195–1204.

38. Shriberg LD, Aram DM, Kwiatkowski J (1997) Developmental apraxia of speech: I.Descriptive and theoretical perspectives. J Speech Lang Hear Res 40:273–285.

39. Marshall CR, Harcourt-Brown S, Ramus F, van der Lely HK (2009) The link betweenprosody and language skills in children with specific language impairment (SLI) and/ordyslexia. Int J Lang Commun Disord 44:466–488.

40. Ireton H, Glascoe FP (1995) Assessing children’s development using parents’ reports.The Child Development Inventory. Clin Pediatr (Phila) 34:248–255.

41. Lord C, Rutter M, DiLavore PC, Risi S (2002) Autism Diagnostic Observation Schedule(Western Psychological Services, Los Angeles).

42. Schopler E, Reichler RJ, DeVellis RF, Daly K (1980) Toward objective classification ofchildhood autism: Childhood Autism Rating Scale (CARS). J Autism Dev Disord 10:91–103.

43. Achenbach TM (1991) Integrative Guide to the 1991 CBCL/4-18, YSR, and TRF Profiles(University of Vermont, Burlington, VT).

44. Robins DL, Fein D, Barton ML, Green JA (2001) The Modified Checklist for Autism inToddlers: An initial study investigating the early detection of autism and pervasivedevelopmental disorders. J Autism Dev Disord 31:131–144.


DEV

ELOPM

ENTA

LBIOLO

GY

PSYC

HOLO

GICALAND

COGNITIVESC

IENCE

S



automated vocal analysis of naturalistic recordings … · automated vocal analysis of naturalistic...

Documents