Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017...

18
Wordbank: an open repository for developmental vocabulary data* MICHAEL C. FRANK, MIKA BRAGINSKY, DANIEL YUROVSKY AND VIRGINIA A. MARCHMAN Stanford University, USA (Received July Revised December Accepted March First published online May ) ABSTRACT The MacArthur-Bates Communicative Development Inventories (CDIs) are a widely used family of parent-report instruments for easy and inexpensive data-gathering about early language acquisition. CDI data have been used to explore a variety of theoretically important topics, but, with few exceptions, researchers have had to rely on data collected in their own lab. In this paper, we remedy this issue by presenting Wordbank, a structured database of CDI data combined with a browsable web interface. Wordbank archives CDI data across languages and labs, providing a resource for researchers interested in early language, as well as a platform for novel analyses. The site allows interactive exploration of patterns of vocabulary growth at the level of both individual children and particular words. We also introduce wordbankr, a software package for connecting to the database directly. Together, these tools extend the abilities of students and researchers to explore quantitative trends in vocabulary development. INTRODUCTION Learning language is one of the most impressive and intriguing human accomplishments, and understanding the processes by which vocabulary grows can provide a window into mechanisms of linguistic and cognitive development more generally (e.g. Bloom, ). The MacArthur-Bates [*] This work supported by a John Merck Scholars award and NSF BCS-. Thanks to Ranjay Krishna for contributions to the initial development of the site, to Rune Nørgaard Jørgensen for helping port data from CLEX, to all of the contributors listed at <http:// wordbank.stanford.edu/contributors> for generously sharing their data, and to the Advisory Board of the MacArthur-Bates Communicative Development Inventories, especially Philip Dale and Larry Fenson, for their support. Address for correspondence: Michael C. Frank, Department of Psychology, Jordan Hall (Bldg. ), Serra Mall, Stanford, CA ; tel: () -; e-mail: mcfrank@ stanford.edu J. Child Lang. (), . © Cambridge University Press doi:./S of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/S0305000916000209 Downloaded from https:/www.cambridge.org/core. University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms

Transcript of Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017...

Page 1: Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms Communicative Development Inventories

Wordbank an open repository for developmentalvocabulary data

MICHAEL C FRANK MIKA BRAGINSKYDANIEL YUROVSKY AND VIRGINIA A MARCHMAN

Stanford University USA

(Received July ndashRevised December ndashAccepted March ndash

First published online May )

ABSTRACT

The MacArthur-Bates Communicative Development Inventories (CDIs)are a widely used family of parent-report instruments for easy andinexpensive data-gathering about early language acquisition CDI datahave been used to explore a variety of theoretically important topicsbut with few exceptions researchers have had to rely on data collectedin their own lab In this paper we remedy this issue by presentingWordbank a structured database of CDI data combined with abrowsable web interface Wordbank archives CDI data across languagesand labs providing a resource for researchers interested in earlylanguage as well as a platform for novel analyses The site allowsinteractive exploration of patterns of vocabulary growth at the level ofboth individual children and particular words We also introducewordbankr a software package for connecting to the database directlyTogether these tools extend the abilities of students and researchers toexplore quantitative trends in vocabulary development

INTRODUCTION

Learning language is one of the most impressive and intriguing humanaccomplishments and understanding the processes by which vocabularygrows can provide a window into mechanisms of linguistic and cognitivedevelopment more generally (eg Bloom ) The MacArthur-Bates

[] This work supported by a JohnMerck Scholars award and NSF BCS- Thanks toRanjay Krishna for contributions to the initial development of the site to Rune NoslashrgaardJoslashrgensen for helping port data from CLEX to all of the contributors listed at lthttpwordbankstanfordeducontributorsgt for generously sharing their data and to theAdvisory Board of the MacArthur-Bates Communicative Development Inventoriesespecially Philip Dale and Larry Fenson for their support Address forcorrespondence Michael C Frank Department of Psychology Jordan Hall (Bldg) Serra Mall Stanford CA tel () - e-mail mcfrankstanfordedu

J Child Lang () ndash copy Cambridge University Press doiS

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Communicative Development Inventories (Fenson et al ) are awidely used family of parent-report instruments for easy and inexpensivedata-gathering about early language acquisition CDI data have been usedto explore many theoretically rich topics including variation in early wordproduction (Fenson et al ) vocabulary composition (Bates et al) the relationship between lexical and grammatical development(Bates amp Goodman ) and the growth of lexical networks (HillsMaouene Maouene Sheya amp Smith ) With few exceptions howeverresearchers have had to rely on data collected in their own lab While CDInorms are available (Fenson et al Joslashrgensen Dale Bleses amp Fenson) no public resource offers researchers the opportunity to share andaccess raw cross-linguistic data at the scale necessary to address questionsabout demographic variation vocabulary composition relations withgrammatical development and other important issues

To remedy this issue we introduce Wordbank (lthttpwordbankstanfordedugt) a structured database of developmental vocabulary data Buildingon previous tools like Cross Linguistic Lexical Norms (CLEX Joslashrgensenet al ) Wordbank archives raw CDI data across languages and labsproviding a large-scale database of information about childrenrsquos vocabularyknowledge The site hosts an interactive and expandable set of in-depthanalyses that can be explored by interested researchers students andmembers of the public Wordbank lowers the cost of new exploratoryanalyses by facilitating the productive reuse of data

The current paper presents the Wordbank site in detail We begin bydiscussing the motivations for constructing such a site The bulk of thepaper then describes the Wordbank site including its database architectureits web-based front-end and its extensibility In particular we highlight twoanalysis functions that are provided by the online interface vocabularygrowth norms across individuals and trajectories of acquisition forindividual words These broad analyses allow a very wide range of targetedinvestigations Throughout the paper we use an exploration of genderdifferences in production vocabulary as a worked case study that illustratesthe various features of the site We end by presenting wordbankr apackage for the R statistical programming language that allows researchusers to access the database directly

MOTIVATION AND BACKGROUND

The nature and course of early word learning is an important window intochildrenrsquos growing understanding of the world Early words cross-cut avariety of linguistic categories but generally consist of names for

We use the umbrella abbreviation lsquoCDIrsquo to refer to the broader class of parent-reportinstruments adapted from the original English version

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

caregivers (eg mama) common objects (eg bottle shoe) social expressions(eg bye-bye) and actions or routines (eg peekaboo throw) (Nelson Tardif et al ) New words enter childrenrsquos expressive vocabulariesslowly at first but this process accelerates over the second year such thatchildren reach an average of words by months and more than by the time they graduate from high school (Fenson et al )At the same time there are significant individual differences in languageacquisition For example according to detailed observational studiesalthough some -month-olds already produce ndash words othersproduce no words at all and will not do so until they are months orolder (eg Brown Bloom Clark ) How can suchdifferences be measured accurately and efficiently And can we promoteearly detection of differences in vocabulary growth that will be clinicallysignificant later in development

Measuring early vocabulary

Traditional studies of language development typically apply a combinationof observational assessment and structured tests frequently relying onshort samples of interactions and small samples of children Discerningboth the universal features and natural variation of early lexicaldevelopment has been greatly facilitated by the development ofparent-report instruments like the MacArthur-Bates CDI (Fenson et al ) and the Language Development Survey (LDS Rescorla) The CDIs in particular were developed across a period of morethan forty years Originally designed for use in a research study(Bates ) the instruments have evolved from a structured interview tothe current paper-and-pencil format and are now increasinglyadministered online (eg Kristoffersen et al for Norwegian orlthttplaboratoriumdetskarecskgt for Slovak) While other assessmenttools exist for slightly older children to our knowledge no other measureallows cost-effective global language assessment for children in the criticalage ranges between the emergence of language and the period whenchildren become more able to engage in structured face-to-face activities(around months)

Naturalistic observations are the other leading candidate for measurementof early language but such observations are extremely costly andtime-consuming to transcribe and annotate These difficulties lead to atrade-off where most studies either include dense data about a smallnumber of children or smaller amounts of data with a larger sample sizeDense datasets currently provide the best method for in-depth study ofthe interaction between learning mechanisms and language input inindividuals (eg Lieven Salomo amp Tomasello Roy Frank

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

DeCamp Miller amp Roy ) although the generality of these studies isnecessarily limited by their small sample sizes At the other end of thespectrum assessment of many individual language samples can yieldinformation about individual variability (eg Dickinson amp Tabors Cartmill et al Weisleder amp Fernald ) but at some cost interms of depth

In addition naturalistic observations do not measure childrenrsquos languagecomprehension a variable of interest for many early language researchersEstimates of production vocabulary from naturalistic observation arehighly correlated with the CDI within studies (eg Bornstein amp Haynes) but affected substantially by length of the session context andinterlocutor when comparing across studies And although there existmethods to extract insights about global vocabulary from naturalisticobservation these statistical extrapolations are relatively new and have notbeen validated extensively (Hidaka ) Other comprehension vocabularymeasures are also available across some range of languages (eg the PeabodyPicture Vocabulary Test Dunn amp Dunn ) but these assessments aretailored for substantially older children

Parent-report measures like the CDI and LDS take advantage of the factthat parents are expert observers of their child CDI instruments ask aboutuse of communicative gestures grammar and symbolic play as well asvocabulary which is measured using checklists consisting of representativesamples of words Parents choose the words their child currentlylsquounderstandsrsquo (comprehension measured for younger children) or lsquosaysrsquo(production measured for both younger and older children) Thechecklists contain words from many different semantic (eg animal nameshousehold items) and syntactic (eg action words connectives) categoriesresulting in broader samples of lexical knowledge than are available fromother methods In their English and Spanish instantiations theinstruments come in two versions Words amp Gestures (ndash months) andWords amp Sentences (ndash months) Originally designed for Englishparallel instruments have now been adapted for more than sixty languages(Dale amp Penfold nd)

Limitations of parent report

Although the standardization of parent reports using the CDI contributes tothe availability of large amounts of data in a comparable format there aresignificant limitations to the parent-report methodology as well (Tomaselloamp Mervis Feldman et al ) First parents may be biasedobservers some may overestimate while others likely underestimate theirchildrenrsquos abilities There is also some evidence that some variability maybe due to reporting biases linked to factors such as socioeconomic status

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

(Feldman et al Fenson et al ) Second parent reports ofcomprehension for younger children likely suffer from a number of biases andare probably substantially more accurate for content words than functionwords Third the items on the original CDI instruments were chosen to be arepresentative sample of vocabulary items for the appropriate age andlanguage (Fenson et al ) not with the intention that they would be acomplete set of words that could be compared across instruments or that theywould be individually reliable and license the conclusion that a particularchild knows a particular word Fourth although the length of the CDI maygive the impression that it yields an estimate of the childrsquos full vocabulary infact it likely understates the size of a childrsquos vocabulary substantiallyespecially for older children (Mayor amp Plunkett )

Despite these limitations when used appropriately the CDI instruments arean important tool The instruments were designed to minimize bias bytargeting current behaviors and asking parents about highly salient featuresof their childrsquos abilities They yield reliable and valid estimates of totalvocabulary size with dozens of studies demonstrating concurrent andpredictive relations with naturalistic and observational measures in bothtypically developing and at-risk populations (eg Dale amp Fenson Thal Jackson-Maldonado amp Acosta Marchman amp Martiacutenez-Sussmann ) In addition a variety of recent work has shown thatindividual item-level responses can yield exciting new insights for exampleabout the growth patterns of semantic networks (Hills et al HillsMaouene Riordan amp Smith ) Such analyses have the potential to beeven more powerful when applied to larger samples and across languages

WORDBANK

To take advantage of the opportunity posed by the broad use of CDIinstruments in the child language community we have constructedWordbank an open repository for CDI data that allows for interactiveanalysis and visualization The main page of the site at time of writing isshown in Figure In this section we begin by describing technical detailsof the sitersquos database architecture We then describe the two primaryanalysis tools that form the heart of the sitersquos interactive functionality Wegive a worked example of how to use these and then end by discussing theextensibility of the Wordbank framework highlighting opportunities forcontributing data and for building new analyses

Our inspiration for Wordbank comes from two successful projects forsharing data on childrenrsquos language acquisition The first is the ChildLanguage Data Exchange System (CHILDES MacWhinney ) Adatabase of transcripts of childrenrsquos speech and speech to childrenCHILDES has grown into a robust and important tool for the

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

community with many contributors and affiliated projects The second isthe Cross Linguistic Lexical Norms site (CLEX ltwwwcdi-clexorggtJoslashrgensen et al ) which is closer in content to Wordbank andeffectively our precursor CLEX archives normative data from a range ofCDI adaptations across languages allowing browsing of acquisitiontrajectories for individual items or age groups

Wordbank builds on CLEX offering the same functionality but allowingflexible and interactive visualization and analysis as well as direct databaseaccess and data download In addition Wordbankrsquos goal is to extendbeyond the norming data provided by the developers of individual CDIsby dynamically incorporating data from many different researchers andprojects of varying sizes and scopes While the resulting datasets inWordbank are likely more heterogeneous they nevertheless have thepotential to be considerably larger and more representative than theindividual norming datasets Wordbank provides tools that enable morepowerful flexible and nuanced analyses of general trends and comparisonsacross sub-populations in a variety of different languages

While the general Wordbank architecture enables a huge variety ofanalyses in principle some illustrative examples are helpful for

Fig Screenshot of the Wordbank main page Visitors can navigate from this page to theinteractive reports as well as to a statistics page that shows the database composition acontributors page that shows citation information and a blog that highlights recentupdates

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

understanding the site Consider an experimenter constructing a new set ofstimuli for a word recognition experiment the appropriate tool for thistask would be the Item Trajectories analysis which shows the trajectory ofacquisition for individual words The experimenter could explore differentcombinations of items using this tool and match them for age ofacquisition Or consider a researcher interested in gender effects onvocabulary growth the appropriate tool would be the Vocabulary Normsanalysis which shows percentile curves for a particular instrument (Wewalk through detailed instructions for how such an analysis would beconducted below)

Database architecture

Why use a database to store vocabulary data Consider the standard format ofraw CDI data Figure shows a small slice of the original CDI norming data(Fenson et al ) Each row is a child each column gives a variable ndasheither a demographic variable or the result of a particular word beingadministered to a particular child Although this format is useful forhomogeneous administrations of a single instrument it cannot accommodatemultiple instruments multiple languages or datasets with different sourcesor kinds of demographic information Consolidating data across differentinstruments is very difficult in this format and tracking data on childrenwith multiple longitudinal administrations of a single instrument must alsobe done in an ad-hoc manner The move to a database format allows farmore flexible and programmatic handling of heterogeneous data structuresfrom different sources

A relational database such as Wordbank is at its heart a series of tables linkedby unique identifiers There are two primary groups of tables in WordbankThe COMMON tables store data that is shared between CDI instrumentsincluding information about children administrations (individual instancesof a form being filled out for a child) and items (words and other questionson a form) The INSTRUMENT tables store response data for particular CDIinstruments We currently include all items on CDI instruments includingquestions about communication gesture morphology and grammar (thoughin many of the datasets that we archive these non-vocabulary questions havenot been digitized so data on them are sparse at present)

One strength of the Wordbank framework is that it allows the storage ofsubsidiary information about the words that are included in a particularinstrument so that this information can be used in future analyses Forexample information about grammatical and semantic categories or normslike concreteness and imageability could all be appended to particularwords This functionality is not yet present in Wordbank however Thedifficulty of compiling this kind of information for a particular set of

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

words is compounded by the large number of languages that the databaseincludes We hope that in future this functionality will allow the gradualaccumulation of information about the words included in the database

Technical details Wordbank is constructed using free open-source toolsThe database is a standard MySQL database managed using Python andDjango Analysis apps are constructed using the Shiny package for R anopen-source statistical programming language The code is hosted in aGitHub repository (lthttpgithubcomlangcogwordbankgt) where interestedusers can browse leave comments and contribute modifications

All data uploaded to Wordbank are open and freely available for downloadboth through the site itself and through the GitHub repository The siteincludes only de-identified data that cannot be linked to the parents andchildren who provided it Because of these features the StanfordInstitutional Review Board has determined that the Wordbank project doesnot constitute human subjects research

Cross-linguistic and cross-instrument architecture The general philosophy ofcreating CDIs for new languages has been summarized as ldquoadaptation nottranslationrdquo (Dale nd) In other words CDIs are a useful tool for manylanguages but the forms differ between languages ndash words and even wholesections are added dropped and modified to ensure that the form capturesthe details of the particular language for which it is designed To date morethan sixty adaptations of the original English CDI have been documented(Dale and Penfold nd) These forms vary widely including differences inlength and intended age range Some forms include hundreds of items morethan the original words on the English Words amp Sentences form othersare so-called lsquoshort formsrsquo and include only a hundred or a few hundredcarefully selected words Some are designed to capture development fromthe emergence of language through ages three to four years while others arefocused on very early development (like the English Words amp Gesturesform designed for ages ndash months) All of these differences make itproblematic to compare scores and score distributions across forms evenusing percentile ranks since some instruments will have more or moredifficult items than others

Wordbank is designed so that it can accommodate data from a wide varietyof instruments both within and across languages Indeed at the time of

Fig Example data from the CDI norming sample (Fenson et al ) Each row has aunique child identifier demographics and word-by-word checklist data

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

writing the site includes data from more than administrations of theCDI across fourteen different languages and twenty-four differentinstruments But because of the difficulties in comparison acrossinstruments our approach to cross-linguistic and cross-instrument data is toprovide standardized analyses within each instrument and language withoutassuming equivalence across words instruments or populations Thus ourprimary exploratory visualization tools in general do not allow comparisonacross languages and we urge users to interpret cross-linguistic and cross-instrument differences with caution Developing statistical techniques tofacilitate these comparisons is a current focus of our research

Interactive analysis tools

The primarymethod for users to interact withWordbank is through interactiveanalysis tools that are hosted on the website These tools allow for fast andflexible exploration of the dataset the results of which can be exported intabular and graphical formats for further analysis and presentation

Vocabulary Norms One of the primary purposes of the CDI instruments isto provide percentile ranks for vocabulary growth across ages both forvisualizing the variability of early vocabulary growth and for examiningdifferences in these growth patterns due to individual differences anddemographic variables Accordingly Wordbank provides a VocabularyNorms analysis pictured in Figure The inset plot shows alladministrations of a particular CDI instrument within the instrumentrsquos validage range Dots show individual children with age binned by month andjittered to avoid overplotting Lines on the plot indicate estimates ofpercentiles fit using quantile regression with monotonic polynomial splinesas the base function (using the gcrq function of the quantregGrowthpackage Muggeo Sciandra Tomasello amp Calvo ) An important featureof the norms app is that it can be split by any demographic field in the dataso that comparisons on variables like gender birth order or maternaleducation can be conducted

The original and updated norming studies (Fenson et al )gathered data from a diverse (though not nationally representative) sampleand used these data to construct normative curves from which percentileranks could be derived In contrast to these studies Wordbank is notexplicitly designed to provide stable clinically relevant normsWordbankrsquos sample is heterogeneous and continually growing and itsanalyses are subject to revision and update Thus Wordbank does notcurrently generate percentile ranks and we do not recommend that

The only exception to this policy currently is that we allow users to see responses acrossinstruments for individual words in the Item Trajectories analysis (eg the proportionof children who say the word cat on both Words amp Gestures and Words amp Sentences forms)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Wordbank-generated norming values be used for research or clinicalpurposes in which the goal is to evaluate childrenrsquos performance inreference to an established normative standard For these types ofapplications users should refer to the published norms in the appropriatelanguage

Item Trajectories A second function of the CDI instruments is to provideaggregate data on the proportion of children at a particular age who know aspecific word (Dale amp Fenson Joslashrgensen et al ) Such analysescan be extremely helpful for the design and evaluation of materials foryoung children including experimental stimuli Accordingly the secondmajor interactive visualization in Wordbank is the Item Trajectoriesanalysis tool

Fig A screenshot of the Vocabulary Norms analysis tool showing th th thth and th percentiles (default) for English production scores Dots show individualadministrations jittered slightly to avoid overplotting Curves show polynomial spline fits(See text for more details color online)

Users can always generate percentile ranks themselves and this may be desirable ornecessary for research purposes but we caution against the clinical use of such ad-hocnorms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

This tool allows exploration of growth curves for individual words on aCDI form Users can select a language and instrument (and chooseproduction or comprehension where available) and then select or input alist of words whose trajectories are plotted (Figure ) The lsquobothrsquo measureoption shows data from multiple forms for the same language withdifferent markers for each item In general our exploration suggests thatthere are only small differences across different instruments for the sameitem and age Lines on the plot show a local polynomial regressionsmoothing line (loess in R)

Other features static reports and tabular data download In addition to theinteractive analysis tools described above Wordbank also includes a numberof non-interactive but continuously updated reports on features likevocabulary composition across languages links between grammar and thelexicon (Braginsky Yurovsky Marchman amp Frank ) and genderdifferences in vocabulary growth (see below) On the Analyses page(lthttpwordbankstanfordeduanalysesgt) we provide a gallery of bothinteractive and non-interactive analyses

Wordbank also allows raw tabular data to be browsed and downloaded forsubsequent analysis in all popular statistical packages Using the same basicinterface as the Vocabulary Norms and Item Trajectory tools users canbrowse raw data aggregated across children (similar to the VocabularyNorms tool) across items (similar to the Items Trajectory tool) or evenview the raw subject-by-item data All data in these lsquostandardrsquo reports canbe downloaded in CSV format

A worked example gender differences Imagine a student interested ingender differences in production vocabulary size perhaps for a classproject Gender differences in language production are commonly found inindividual studies (eg Fenson et al Huttenlocher Haight BrykSeltzer amp Lyons see Wallentin for review) and onelarge-scale previous study found differences in production vocabulary inten languages (Eriksson et al )

To explore these differences using Wordbank the student would navigatefrom the home page to the Vocabulary Norms report English is the defaultlanguage for the report but the student could in principle select anylanguage in the database Similarly she could select her desired instrumentin the lsquoFormsrsquo menu (Words amp Sentences is the default) She would thenselect lsquoGenderrsquo as a split variable for the data (in the lsquoSplit Variablersquo menu)to see normative curves and sample sizes for each part of the dataset Or to

The distinction between sex (biological characteristic) and gender (social characteristic) iscomplex and not well understood in early childhood We defer discussion of this issuesince the CDI is a parent-report form we do not have access to either sex or genderinformation directly

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

make a plot that enabled comparison of the median level of productionvocabulary she could select lsquoMedianrsquo in the lsquoQuantilesrsquo menu

Selecting lsquoDownload Plotrsquo would result in the plot shown in Figure Orshe could navigate to the lsquoTablersquo tab of the display window to see tabularform data showing the th percentile (median) for both females andmales by age These tabular summary data are available for download viathe lsquoDownload Tablersquo button and the raw data (with a row for each oneof the children represented in the plot) are available via thelsquoDownload Raw Datarsquo button In sum this graphical workflow allowsinterested users to manipulate and download individual parts of thedataset as well as to create visualizations of basic analyses

Extensibility

Extensibility is one of the major strengths of Wordbank Althoughprogramming knowledge is not necessary for interacting with Wordbankinterested researchers with programming skills can contribute to the

Fig A screenshot of the Item Trajectories analysis tool showing a visualization of thedevelopmental trajectory of production for three words (dog choo choo and table) acrossboth Words amp Gestures and Words amp Sentences forms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

development effort by adding new analyses Each Wordbank analysis app isconstructed as a standalone script or set of scripts in the R languageConstructing an interactive analysis requires specifying a visualization andsome interactive functionality using Shiny Non-interactive analyses can beconstructed as R Markdown documents that execute scripts using theWordbank database Both of these have the virtue of rerunning on thenewest version of the database whenever they are opened so they do notgo out of date as new data are added

In addition we encourage contributions of individual datasets Wordbankcurrently imports data from Excel and CSV formats via automated importscripts Individuals or labs interested in contributing should consult withthe authors for advice about data formatting and upload

WORDBANKR AN R PACKAGE FOR ACCESSING WORDBANK

Although the analysis tools described above suffice for many needsresearchers interested in detailed quantitative or cross-linguistic analysesmay wish to connect directly to the Wordbank database and manipulatethe data directly Making use of the R programming language (RFoundation for Statistical Computing ) we provide the wordbankrpackage to help researchers accomplish this task R is an open-source

Fig A downloaded plot of gender differences in production language forEnglish-speaking children (color online)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

extensible statistical computing environment that is rapidly growing inpopularity across fields and is increasing in use in child language research(eg Norrman amp Bylund Song Shattuck-Hufnagel amp Demuth) The wordbankr package abstracts away the details of connectingto the database Users can take advantage of the SQL tools developed inthe popular dplyr package (Wickham amp Francois ) which makemanipulating large datasets quick and easy We describe the commandsthat the package provides and then give a worked example of using thepackage for a simple analysis

Package details

The wordbankr package is easily installed via CRAN the comprehensive Rarchive network To install simply type installpackages(ldquowordbankrrdquo) After installation users can use the three main data loadingfunctions provided by wordbankrget_administration_data toretrieve information about each CDI administration including the childrsquosdemographics and vocabulary sizes get_item_data to retrieve informationabout each CDI item including its text and categories andget_instrument_data to retrieve administration-by-item response valuesEach of these can be run in remote mode which loads data from theWordbank server or in local mode if the user has a copy of the database setup on their local machine For more detailed documentation see the packagerepository (lthttpgithubcomlangcogwordbankrgt)

Worked example part gender differences across languages

We next demonstrate the analytic potential of direct manipulation of theWordbank database using wordbankr by using the package to extend theworked example of gender differences above This section also replicates alarge-scale analysis by Eriksson et al () To perform the analysis wefirst begin by using wordbankr to load the data from Wordbank andconnect to the tables we need

admins lt- get_administration_data()items lt- get_item_data()

We next use a series of dplyr calls to compute the number of words in eachlanguage select the appropriate subset of the data and calculate theproportion of words produced for this data subset

num_words lt- items gtfilter(form == WS type == word) gtgroup_by(language) gtsummarise(n = n())

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Fig Median production vocabulary as a proportion of total words on an instrument plotted by age in months Red and blue lines showfemales and males respectively (color online)

WORD

BAN

K

of use available at httpsww

wcam

bridgeorgcoreterms httpsdoiorg101017S0305000916000209

Dow

nloaded from httpsw

ww

cambridgeorgcore U

niversity of Chicago on 17 May 2017 at 144547 subject to the Cam

bridge Core terms

vocab_admins lt- admins gtfilter(form == WS isna(sex)) gtselect(data_id language form age sex production)

vocab_data lt- vocab_admins gtgroup_by(language sex age) gtleft_join(num_words) gtmutate(production = production n) gtsummarise(median = median(production))

We then plot the vocab_data data frame using the ggplot2 package(Wickham ) Full code for the analysis as a whole (including the plot)is available at lthttpmikabrgithubiodemo-vocabgenderhtmlgt

The results of this analysis are shown in Figure As expected wereplicate the gender differences found in previous work (Eriksson et al) females showed a small but highly reliable advantage in earlyproduction This effect is highly consistent and clearly visible in elevenout of twelve languages with Italian being the only exception Forcomparison the previous work found a positive female effect for all tenout of ten languages but the size of the effect was close to zero for two ofthese Observational data such as those contained in Wordbank allow usonly to speculate about the origins of this gender difference or the sourcesof cross-linguistic variation (for some discussion see Eriksson et al) But the Wordbank platform dramatically facilitates the formulationand testing of analyses of this sort allowing hypotheses to be testedquickly and easily against large datasets

CONCLUSION

In this paper we have presented Wordbank an open repository forparent-report vocabulary data from the MacArthur-Bates CDI Theinteractive analysis tools available on the Wordbank site allow interestedresearchers to explore a wide variety of phenomena in vocabularydevelopment quickly and easily exporting data and downloadingpresentation-quality graphics that document their analysis In additionusers can contribute new analyses and data to the site and connect to itdirectly using an R package for data loading These functions all facilitategreater sharing and reuse of existing data on childrenrsquos vocabularyenabling new discoveries in the future

REFERENCES

Bates E () Language and context the acquisition of pragmatics (Vol ) New York NYAcademic Press

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Bates E amp Goodman J () On the emergence of grammar from the lexicon InB MacWhinney (ed) The emergence of language (pp ndash) Mahwah NJ LawrenceErlbaum Associates

Bates E Marchman V Thal D Fenson L Dale P Reznick J S Hartung J() Developmental and stylistic variation in the composition of early vocabularyJournal of Child Language ndash

Bloom P () How children learn the meanings of words Cambridge MA MIT PressBornstein M H amp Haynes O M () Vocabulary competence in early childhoodmeasurement latent construct and predictive validity Child Development ndash

Braginsky M Yurovsky D Marchman V A amp Frank M C () Developmentalchanges in the relationship between grammar and the lexicon In D C Noelle R DaleA S Warlaumont J Yoshimi T Matlock C D Jennings amp P P Maglio (Eds)Proceedings of the th Annual Meeting of the Cognitive Science Society Austin TXCognitive Science Society

Brown R () A first language the early stages Cambridge MA Harvard University PressCartmill E A Armstrong B F Gleitman L R Goldin-Meadow S Medina T N ampTrueswell J C () Quality of early parent input predicts child vocabulary yearslater Proceedings of the National Academy of Sciences ndash

Clark E () First language acquisition Cambridge Cambridge University PressDale P S (nd) Adaptations not translations Online lthttpmb-cdistanfordeduadaptationshtmlgt (last accessed )

Dale P S amp Fenson L () Lexical development norms for young children BehaviorResearch Methods Instruments amp Computers ndash

Dale P S amp PenfoldM (nd) Adaptations of theMacArthur-Bates CDI into non-US Englishlanguages Online lthttpmb-cdistanfordedudocumentsAdaptationsSurvey--Webpdfgt (last accessed )

Dickinson D K amp Tabors P O () Beginning literacy with language young childrenlearning at home and school Baltimore MD Paul H Brookes Publishing

Dunn L M amp Dunn L M () Peabody Picture Vocabulary Test th ed ParsippanyNJ AGS Publishing Pearson Assessments

Eriksson M Marschik P B Tulviste T Almgren M Peacuterez Pereira M Wehberg S Gallego C () Differences between girls and boys in emerging language skills evidencefrom language communities British Journal of Developmental Psychology ndash

Feldman H M Dale P S Campbell T F Colborn D K Kurs-Lasky M RocketteH E amp Paradise J L () Concurrent and predictive validity of parent reports of childlanguage at ages and years Child Development ndash

Feldman H M Dollaghan C A Campbell T F Kurs-Lasky M Janosky J E ampParadise J L () Measurement properties of the MacArthur CommunicativeDevelopment Inventories at ages one and two years Child Development ndash

Fenson L Bates E Dale P Goodman J Reznick J S amp Thal D () Replymeasuring variability in early child language donrsquot shoot the messenger ChildDevelopment ndash

Fenson L Dale P S Reznick J S Bates E Hartung J P Pethick S amp Reilly J() MacArthur Communicative Development Inventories userrsquos guide and technicalmanual Baltimore MD Paul H Brookes Publishing Co

Fenson L Dale P Reznick J Bates E Thal D Pethick S Stiles J ()Variability in early communicative development Monographs of the Society for Researchin Child Development

Fenson L Marchman V A Thal D Dale P Reznick J S amp Bates E ()MacArthur-Bates Communicative Development Inventories userrsquos guide and technicalmanual nd ed Baltimore MD Brookes Publishing Company

Hidaka S () Estimating the latent number of types in growing corpora with reducedcostndashaccuracy trade-off Journal of Child Language ndash

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Hills T T Maouene J Riordan B amp Smith L B () The associative structure oflanguage contextual diversity in early word learning Journal of Memory and Language ndash

Hills T T Maouene M Maouene J Sheya A amp Smith L () Longitudinal analysisof early semantic networks Preferential attachment or preferential acquisitionPsychological Science ndash

Huttenlocher J Haight W Bryk A Seltzer M amp Lyons T () Early vocabularygrowth relation to language input and gender Developmental Psychology ndash

Joslashrgensen R N Dale P S Bleses D amp Fenson L () CLEX a cross-linguistic lexicalnorms database Journal of Child Language ndash

Kristoffersen K E Simonsen H G Bleses D Wehberg S Joslashrgensen R N EieslandE A amp Henriksen L Y () The use of the Internet in collecting CDI data ndash anexample from Norway Journal of Child Language ndash

Lieven E Salomo D amp Tomasello M () Two-year-old childrenrsquos production ofmultiword utterances a usage-based analysis Cognitive Linguistics ndash

MacWhinney B () The CHILDES Project tools for analyzing talk rd ed MahwahNJ Lawrence Erlbaum Associates

Marchman V A amp Martiacutenez-Sussmann C () Concurrent validity of caregiverparentreport measures of language for children who are learning both English and SpanishJournal of Speech Language and Hearing Research ndash

Mayor J amp Plunkett K () A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis Developmental Science ndash

Muggeo V M Sciandra M Tomasello A amp Calvo S () Estimating growth chartsvia nonparametric quantile regression a practical framework with application in ecologyEnvironmental and Ecological Statistics ndash

Nelson K () Structure and strategy in learning to talk Monographs of the Society forResearch in Child Development ndash

Norrman G amp Bylund E () The irreversibility of sensitive period effects in languagedevelopment evidence from second language acquisition in international adopteesDevelopmental Science ndash

R Foundation for Statistical Computing () R a language and environment for statisticalcomputing Software online lthttpwwwr-projectorggt

Rescorla L () The language development survey a screening tool for delayed languagein toddlers Journal of Speech and Hearing Disorders ndash

Roy B C Frank M C DeCamp P Miller M amp Roy D () Predicting the birth of aspoken word Proceedings of the National Academy of Sciences ndash

Song J Y Shattuck-Hufnagel S amp Demuth K () Development of phonetic variants(allophones) in -year-olds learning American English a study of alveolar stop t d codasJournal of Phonetics ndash

Tardif T Fletcher P Liang W Zhang Z Kaciroti N amp Marchman V A ()Babyrsquos first words Developmental Psychology ndash

Thal D Jackson-Maldonado D amp Acosta D () Validity of a parent-report measure ofvocabulary and grammar for Spanish-speaking toddlers Journal of Speech Language andHearing Research ndash

Tomasello M amp Mervis C B () The instrument is great but measuringcomprehension is still a problem Monographs of the Society for Research in ChildDevelopment ndash

Wallentin M () Putative sex differences in verbal abilities and language cortex a criticalreview Brain and Language ndash

Weisleder A amp Fernald A () Talking to children matters early language experiencestrengthens processing and builds vocabulary Psychological Science ndash

Wickham H () Ggplot elegant graphics for data analysis New York NY SpringerScience amp Business Media

Wickham H amp Francois R () Dplyr a grammar of data manipulation R packageversion middotmiddotmiddot Online lthttpscranr-projectorgwebpackagesdplyrgt

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Page 2: Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms Communicative Development Inventories

Communicative Development Inventories (Fenson et al ) are awidely used family of parent-report instruments for easy and inexpensivedata-gathering about early language acquisition CDI data have been usedto explore many theoretically rich topics including variation in early wordproduction (Fenson et al ) vocabulary composition (Bates et al) the relationship between lexical and grammatical development(Bates amp Goodman ) and the growth of lexical networks (HillsMaouene Maouene Sheya amp Smith ) With few exceptions howeverresearchers have had to rely on data collected in their own lab While CDInorms are available (Fenson et al Joslashrgensen Dale Bleses amp Fenson) no public resource offers researchers the opportunity to share andaccess raw cross-linguistic data at the scale necessary to address questionsabout demographic variation vocabulary composition relations withgrammatical development and other important issues

To remedy this issue we introduce Wordbank (lthttpwordbankstanfordedugt) a structured database of developmental vocabulary data Buildingon previous tools like Cross Linguistic Lexical Norms (CLEX Joslashrgensenet al ) Wordbank archives raw CDI data across languages and labsproviding a large-scale database of information about childrenrsquos vocabularyknowledge The site hosts an interactive and expandable set of in-depthanalyses that can be explored by interested researchers students andmembers of the public Wordbank lowers the cost of new exploratoryanalyses by facilitating the productive reuse of data

The current paper presents the Wordbank site in detail We begin bydiscussing the motivations for constructing such a site The bulk of thepaper then describes the Wordbank site including its database architectureits web-based front-end and its extensibility In particular we highlight twoanalysis functions that are provided by the online interface vocabularygrowth norms across individuals and trajectories of acquisition forindividual words These broad analyses allow a very wide range of targetedinvestigations Throughout the paper we use an exploration of genderdifferences in production vocabulary as a worked case study that illustratesthe various features of the site We end by presenting wordbankr apackage for the R statistical programming language that allows researchusers to access the database directly

MOTIVATION AND BACKGROUND

The nature and course of early word learning is an important window intochildrenrsquos growing understanding of the world Early words cross-cut avariety of linguistic categories but generally consist of names for

We use the umbrella abbreviation lsquoCDIrsquo to refer to the broader class of parent-reportinstruments adapted from the original English version

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

caregivers (eg mama) common objects (eg bottle shoe) social expressions(eg bye-bye) and actions or routines (eg peekaboo throw) (Nelson Tardif et al ) New words enter childrenrsquos expressive vocabulariesslowly at first but this process accelerates over the second year such thatchildren reach an average of words by months and more than by the time they graduate from high school (Fenson et al )At the same time there are significant individual differences in languageacquisition For example according to detailed observational studiesalthough some -month-olds already produce ndash words othersproduce no words at all and will not do so until they are months orolder (eg Brown Bloom Clark ) How can suchdifferences be measured accurately and efficiently And can we promoteearly detection of differences in vocabulary growth that will be clinicallysignificant later in development

Measuring early vocabulary

Traditional studies of language development typically apply a combinationof observational assessment and structured tests frequently relying onshort samples of interactions and small samples of children Discerningboth the universal features and natural variation of early lexicaldevelopment has been greatly facilitated by the development ofparent-report instruments like the MacArthur-Bates CDI (Fenson et al ) and the Language Development Survey (LDS Rescorla) The CDIs in particular were developed across a period of morethan forty years Originally designed for use in a research study(Bates ) the instruments have evolved from a structured interview tothe current paper-and-pencil format and are now increasinglyadministered online (eg Kristoffersen et al for Norwegian orlthttplaboratoriumdetskarecskgt for Slovak) While other assessmenttools exist for slightly older children to our knowledge no other measureallows cost-effective global language assessment for children in the criticalage ranges between the emergence of language and the period whenchildren become more able to engage in structured face-to-face activities(around months)

Naturalistic observations are the other leading candidate for measurementof early language but such observations are extremely costly andtime-consuming to transcribe and annotate These difficulties lead to atrade-off where most studies either include dense data about a smallnumber of children or smaller amounts of data with a larger sample sizeDense datasets currently provide the best method for in-depth study ofthe interaction between learning mechanisms and language input inindividuals (eg Lieven Salomo amp Tomasello Roy Frank

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

DeCamp Miller amp Roy ) although the generality of these studies isnecessarily limited by their small sample sizes At the other end of thespectrum assessment of many individual language samples can yieldinformation about individual variability (eg Dickinson amp Tabors Cartmill et al Weisleder amp Fernald ) but at some cost interms of depth

In addition naturalistic observations do not measure childrenrsquos languagecomprehension a variable of interest for many early language researchersEstimates of production vocabulary from naturalistic observation arehighly correlated with the CDI within studies (eg Bornstein amp Haynes) but affected substantially by length of the session context andinterlocutor when comparing across studies And although there existmethods to extract insights about global vocabulary from naturalisticobservation these statistical extrapolations are relatively new and have notbeen validated extensively (Hidaka ) Other comprehension vocabularymeasures are also available across some range of languages (eg the PeabodyPicture Vocabulary Test Dunn amp Dunn ) but these assessments aretailored for substantially older children

Parent-report measures like the CDI and LDS take advantage of the factthat parents are expert observers of their child CDI instruments ask aboutuse of communicative gestures grammar and symbolic play as well asvocabulary which is measured using checklists consisting of representativesamples of words Parents choose the words their child currentlylsquounderstandsrsquo (comprehension measured for younger children) or lsquosaysrsquo(production measured for both younger and older children) Thechecklists contain words from many different semantic (eg animal nameshousehold items) and syntactic (eg action words connectives) categoriesresulting in broader samples of lexical knowledge than are available fromother methods In their English and Spanish instantiations theinstruments come in two versions Words amp Gestures (ndash months) andWords amp Sentences (ndash months) Originally designed for Englishparallel instruments have now been adapted for more than sixty languages(Dale amp Penfold nd)

Limitations of parent report

Although the standardization of parent reports using the CDI contributes tothe availability of large amounts of data in a comparable format there aresignificant limitations to the parent-report methodology as well (Tomaselloamp Mervis Feldman et al ) First parents may be biasedobservers some may overestimate while others likely underestimate theirchildrenrsquos abilities There is also some evidence that some variability maybe due to reporting biases linked to factors such as socioeconomic status

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

(Feldman et al Fenson et al ) Second parent reports ofcomprehension for younger children likely suffer from a number of biases andare probably substantially more accurate for content words than functionwords Third the items on the original CDI instruments were chosen to be arepresentative sample of vocabulary items for the appropriate age andlanguage (Fenson et al ) not with the intention that they would be acomplete set of words that could be compared across instruments or that theywould be individually reliable and license the conclusion that a particularchild knows a particular word Fourth although the length of the CDI maygive the impression that it yields an estimate of the childrsquos full vocabulary infact it likely understates the size of a childrsquos vocabulary substantiallyespecially for older children (Mayor amp Plunkett )

Despite these limitations when used appropriately the CDI instruments arean important tool The instruments were designed to minimize bias bytargeting current behaviors and asking parents about highly salient featuresof their childrsquos abilities They yield reliable and valid estimates of totalvocabulary size with dozens of studies demonstrating concurrent andpredictive relations with naturalistic and observational measures in bothtypically developing and at-risk populations (eg Dale amp Fenson Thal Jackson-Maldonado amp Acosta Marchman amp Martiacutenez-Sussmann ) In addition a variety of recent work has shown thatindividual item-level responses can yield exciting new insights for exampleabout the growth patterns of semantic networks (Hills et al HillsMaouene Riordan amp Smith ) Such analyses have the potential to beeven more powerful when applied to larger samples and across languages

WORDBANK

To take advantage of the opportunity posed by the broad use of CDIinstruments in the child language community we have constructedWordbank an open repository for CDI data that allows for interactiveanalysis and visualization The main page of the site at time of writing isshown in Figure In this section we begin by describing technical detailsof the sitersquos database architecture We then describe the two primaryanalysis tools that form the heart of the sitersquos interactive functionality Wegive a worked example of how to use these and then end by discussing theextensibility of the Wordbank framework highlighting opportunities forcontributing data and for building new analyses

Our inspiration for Wordbank comes from two successful projects forsharing data on childrenrsquos language acquisition The first is the ChildLanguage Data Exchange System (CHILDES MacWhinney ) Adatabase of transcripts of childrenrsquos speech and speech to childrenCHILDES has grown into a robust and important tool for the

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

community with many contributors and affiliated projects The second isthe Cross Linguistic Lexical Norms site (CLEX ltwwwcdi-clexorggtJoslashrgensen et al ) which is closer in content to Wordbank andeffectively our precursor CLEX archives normative data from a range ofCDI adaptations across languages allowing browsing of acquisitiontrajectories for individual items or age groups

Wordbank builds on CLEX offering the same functionality but allowingflexible and interactive visualization and analysis as well as direct databaseaccess and data download In addition Wordbankrsquos goal is to extendbeyond the norming data provided by the developers of individual CDIsby dynamically incorporating data from many different researchers andprojects of varying sizes and scopes While the resulting datasets inWordbank are likely more heterogeneous they nevertheless have thepotential to be considerably larger and more representative than theindividual norming datasets Wordbank provides tools that enable morepowerful flexible and nuanced analyses of general trends and comparisonsacross sub-populations in a variety of different languages

While the general Wordbank architecture enables a huge variety ofanalyses in principle some illustrative examples are helpful for

Fig Screenshot of the Wordbank main page Visitors can navigate from this page to theinteractive reports as well as to a statistics page that shows the database composition acontributors page that shows citation information and a blog that highlights recentupdates

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

understanding the site Consider an experimenter constructing a new set ofstimuli for a word recognition experiment the appropriate tool for thistask would be the Item Trajectories analysis which shows the trajectory ofacquisition for individual words The experimenter could explore differentcombinations of items using this tool and match them for age ofacquisition Or consider a researcher interested in gender effects onvocabulary growth the appropriate tool would be the Vocabulary Normsanalysis which shows percentile curves for a particular instrument (Wewalk through detailed instructions for how such an analysis would beconducted below)

Database architecture

Why use a database to store vocabulary data Consider the standard format ofraw CDI data Figure shows a small slice of the original CDI norming data(Fenson et al ) Each row is a child each column gives a variable ndasheither a demographic variable or the result of a particular word beingadministered to a particular child Although this format is useful forhomogeneous administrations of a single instrument it cannot accommodatemultiple instruments multiple languages or datasets with different sourcesor kinds of demographic information Consolidating data across differentinstruments is very difficult in this format and tracking data on childrenwith multiple longitudinal administrations of a single instrument must alsobe done in an ad-hoc manner The move to a database format allows farmore flexible and programmatic handling of heterogeneous data structuresfrom different sources

A relational database such as Wordbank is at its heart a series of tables linkedby unique identifiers There are two primary groups of tables in WordbankThe COMMON tables store data that is shared between CDI instrumentsincluding information about children administrations (individual instancesof a form being filled out for a child) and items (words and other questionson a form) The INSTRUMENT tables store response data for particular CDIinstruments We currently include all items on CDI instruments includingquestions about communication gesture morphology and grammar (thoughin many of the datasets that we archive these non-vocabulary questions havenot been digitized so data on them are sparse at present)

One strength of the Wordbank framework is that it allows the storage ofsubsidiary information about the words that are included in a particularinstrument so that this information can be used in future analyses Forexample information about grammatical and semantic categories or normslike concreteness and imageability could all be appended to particularwords This functionality is not yet present in Wordbank however Thedifficulty of compiling this kind of information for a particular set of

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

words is compounded by the large number of languages that the databaseincludes We hope that in future this functionality will allow the gradualaccumulation of information about the words included in the database

Technical details Wordbank is constructed using free open-source toolsThe database is a standard MySQL database managed using Python andDjango Analysis apps are constructed using the Shiny package for R anopen-source statistical programming language The code is hosted in aGitHub repository (lthttpgithubcomlangcogwordbankgt) where interestedusers can browse leave comments and contribute modifications

All data uploaded to Wordbank are open and freely available for downloadboth through the site itself and through the GitHub repository The siteincludes only de-identified data that cannot be linked to the parents andchildren who provided it Because of these features the StanfordInstitutional Review Board has determined that the Wordbank project doesnot constitute human subjects research

Cross-linguistic and cross-instrument architecture The general philosophy ofcreating CDIs for new languages has been summarized as ldquoadaptation nottranslationrdquo (Dale nd) In other words CDIs are a useful tool for manylanguages but the forms differ between languages ndash words and even wholesections are added dropped and modified to ensure that the form capturesthe details of the particular language for which it is designed To date morethan sixty adaptations of the original English CDI have been documented(Dale and Penfold nd) These forms vary widely including differences inlength and intended age range Some forms include hundreds of items morethan the original words on the English Words amp Sentences form othersare so-called lsquoshort formsrsquo and include only a hundred or a few hundredcarefully selected words Some are designed to capture development fromthe emergence of language through ages three to four years while others arefocused on very early development (like the English Words amp Gesturesform designed for ages ndash months) All of these differences make itproblematic to compare scores and score distributions across forms evenusing percentile ranks since some instruments will have more or moredifficult items than others

Wordbank is designed so that it can accommodate data from a wide varietyof instruments both within and across languages Indeed at the time of

Fig Example data from the CDI norming sample (Fenson et al ) Each row has aunique child identifier demographics and word-by-word checklist data

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

writing the site includes data from more than administrations of theCDI across fourteen different languages and twenty-four differentinstruments But because of the difficulties in comparison acrossinstruments our approach to cross-linguistic and cross-instrument data is toprovide standardized analyses within each instrument and language withoutassuming equivalence across words instruments or populations Thus ourprimary exploratory visualization tools in general do not allow comparisonacross languages and we urge users to interpret cross-linguistic and cross-instrument differences with caution Developing statistical techniques tofacilitate these comparisons is a current focus of our research

Interactive analysis tools

The primarymethod for users to interact withWordbank is through interactiveanalysis tools that are hosted on the website These tools allow for fast andflexible exploration of the dataset the results of which can be exported intabular and graphical formats for further analysis and presentation

Vocabulary Norms One of the primary purposes of the CDI instruments isto provide percentile ranks for vocabulary growth across ages both forvisualizing the variability of early vocabulary growth and for examiningdifferences in these growth patterns due to individual differences anddemographic variables Accordingly Wordbank provides a VocabularyNorms analysis pictured in Figure The inset plot shows alladministrations of a particular CDI instrument within the instrumentrsquos validage range Dots show individual children with age binned by month andjittered to avoid overplotting Lines on the plot indicate estimates ofpercentiles fit using quantile regression with monotonic polynomial splinesas the base function (using the gcrq function of the quantregGrowthpackage Muggeo Sciandra Tomasello amp Calvo ) An important featureof the norms app is that it can be split by any demographic field in the dataso that comparisons on variables like gender birth order or maternaleducation can be conducted

The original and updated norming studies (Fenson et al )gathered data from a diverse (though not nationally representative) sampleand used these data to construct normative curves from which percentileranks could be derived In contrast to these studies Wordbank is notexplicitly designed to provide stable clinically relevant normsWordbankrsquos sample is heterogeneous and continually growing and itsanalyses are subject to revision and update Thus Wordbank does notcurrently generate percentile ranks and we do not recommend that

The only exception to this policy currently is that we allow users to see responses acrossinstruments for individual words in the Item Trajectories analysis (eg the proportionof children who say the word cat on both Words amp Gestures and Words amp Sentences forms)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Wordbank-generated norming values be used for research or clinicalpurposes in which the goal is to evaluate childrenrsquos performance inreference to an established normative standard For these types ofapplications users should refer to the published norms in the appropriatelanguage

Item Trajectories A second function of the CDI instruments is to provideaggregate data on the proportion of children at a particular age who know aspecific word (Dale amp Fenson Joslashrgensen et al ) Such analysescan be extremely helpful for the design and evaluation of materials foryoung children including experimental stimuli Accordingly the secondmajor interactive visualization in Wordbank is the Item Trajectoriesanalysis tool

Fig A screenshot of the Vocabulary Norms analysis tool showing th th thth and th percentiles (default) for English production scores Dots show individualadministrations jittered slightly to avoid overplotting Curves show polynomial spline fits(See text for more details color online)

Users can always generate percentile ranks themselves and this may be desirable ornecessary for research purposes but we caution against the clinical use of such ad-hocnorms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

This tool allows exploration of growth curves for individual words on aCDI form Users can select a language and instrument (and chooseproduction or comprehension where available) and then select or input alist of words whose trajectories are plotted (Figure ) The lsquobothrsquo measureoption shows data from multiple forms for the same language withdifferent markers for each item In general our exploration suggests thatthere are only small differences across different instruments for the sameitem and age Lines on the plot show a local polynomial regressionsmoothing line (loess in R)

Other features static reports and tabular data download In addition to theinteractive analysis tools described above Wordbank also includes a numberof non-interactive but continuously updated reports on features likevocabulary composition across languages links between grammar and thelexicon (Braginsky Yurovsky Marchman amp Frank ) and genderdifferences in vocabulary growth (see below) On the Analyses page(lthttpwordbankstanfordeduanalysesgt) we provide a gallery of bothinteractive and non-interactive analyses

Wordbank also allows raw tabular data to be browsed and downloaded forsubsequent analysis in all popular statistical packages Using the same basicinterface as the Vocabulary Norms and Item Trajectory tools users canbrowse raw data aggregated across children (similar to the VocabularyNorms tool) across items (similar to the Items Trajectory tool) or evenview the raw subject-by-item data All data in these lsquostandardrsquo reports canbe downloaded in CSV format

A worked example gender differences Imagine a student interested ingender differences in production vocabulary size perhaps for a classproject Gender differences in language production are commonly found inindividual studies (eg Fenson et al Huttenlocher Haight BrykSeltzer amp Lyons see Wallentin for review) and onelarge-scale previous study found differences in production vocabulary inten languages (Eriksson et al )

To explore these differences using Wordbank the student would navigatefrom the home page to the Vocabulary Norms report English is the defaultlanguage for the report but the student could in principle select anylanguage in the database Similarly she could select her desired instrumentin the lsquoFormsrsquo menu (Words amp Sentences is the default) She would thenselect lsquoGenderrsquo as a split variable for the data (in the lsquoSplit Variablersquo menu)to see normative curves and sample sizes for each part of the dataset Or to

The distinction between sex (biological characteristic) and gender (social characteristic) iscomplex and not well understood in early childhood We defer discussion of this issuesince the CDI is a parent-report form we do not have access to either sex or genderinformation directly

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

make a plot that enabled comparison of the median level of productionvocabulary she could select lsquoMedianrsquo in the lsquoQuantilesrsquo menu

Selecting lsquoDownload Plotrsquo would result in the plot shown in Figure Orshe could navigate to the lsquoTablersquo tab of the display window to see tabularform data showing the th percentile (median) for both females andmales by age These tabular summary data are available for download viathe lsquoDownload Tablersquo button and the raw data (with a row for each oneof the children represented in the plot) are available via thelsquoDownload Raw Datarsquo button In sum this graphical workflow allowsinterested users to manipulate and download individual parts of thedataset as well as to create visualizations of basic analyses

Extensibility

Extensibility is one of the major strengths of Wordbank Althoughprogramming knowledge is not necessary for interacting with Wordbankinterested researchers with programming skills can contribute to the

Fig A screenshot of the Item Trajectories analysis tool showing a visualization of thedevelopmental trajectory of production for three words (dog choo choo and table) acrossboth Words amp Gestures and Words amp Sentences forms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

development effort by adding new analyses Each Wordbank analysis app isconstructed as a standalone script or set of scripts in the R languageConstructing an interactive analysis requires specifying a visualization andsome interactive functionality using Shiny Non-interactive analyses can beconstructed as R Markdown documents that execute scripts using theWordbank database Both of these have the virtue of rerunning on thenewest version of the database whenever they are opened so they do notgo out of date as new data are added

In addition we encourage contributions of individual datasets Wordbankcurrently imports data from Excel and CSV formats via automated importscripts Individuals or labs interested in contributing should consult withthe authors for advice about data formatting and upload

WORDBANKR AN R PACKAGE FOR ACCESSING WORDBANK

Although the analysis tools described above suffice for many needsresearchers interested in detailed quantitative or cross-linguistic analysesmay wish to connect directly to the Wordbank database and manipulatethe data directly Making use of the R programming language (RFoundation for Statistical Computing ) we provide the wordbankrpackage to help researchers accomplish this task R is an open-source

Fig A downloaded plot of gender differences in production language forEnglish-speaking children (color online)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

extensible statistical computing environment that is rapidly growing inpopularity across fields and is increasing in use in child language research(eg Norrman amp Bylund Song Shattuck-Hufnagel amp Demuth) The wordbankr package abstracts away the details of connectingto the database Users can take advantage of the SQL tools developed inthe popular dplyr package (Wickham amp Francois ) which makemanipulating large datasets quick and easy We describe the commandsthat the package provides and then give a worked example of using thepackage for a simple analysis

Package details

The wordbankr package is easily installed via CRAN the comprehensive Rarchive network To install simply type installpackages(ldquowordbankrrdquo) After installation users can use the three main data loadingfunctions provided by wordbankrget_administration_data toretrieve information about each CDI administration including the childrsquosdemographics and vocabulary sizes get_item_data to retrieve informationabout each CDI item including its text and categories andget_instrument_data to retrieve administration-by-item response valuesEach of these can be run in remote mode which loads data from theWordbank server or in local mode if the user has a copy of the database setup on their local machine For more detailed documentation see the packagerepository (lthttpgithubcomlangcogwordbankrgt)

Worked example part gender differences across languages

We next demonstrate the analytic potential of direct manipulation of theWordbank database using wordbankr by using the package to extend theworked example of gender differences above This section also replicates alarge-scale analysis by Eriksson et al () To perform the analysis wefirst begin by using wordbankr to load the data from Wordbank andconnect to the tables we need

admins lt- get_administration_data()items lt- get_item_data()

We next use a series of dplyr calls to compute the number of words in eachlanguage select the appropriate subset of the data and calculate theproportion of words produced for this data subset

num_words lt- items gtfilter(form == WS type == word) gtgroup_by(language) gtsummarise(n = n())

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Fig Median production vocabulary as a proportion of total words on an instrument plotted by age in months Red and blue lines showfemales and males respectively (color online)

WORD

BAN

K

of use available at httpsww

wcam

bridgeorgcoreterms httpsdoiorg101017S0305000916000209

Dow

nloaded from httpsw

ww

cambridgeorgcore U

niversity of Chicago on 17 May 2017 at 144547 subject to the Cam

bridge Core terms

vocab_admins lt- admins gtfilter(form == WS isna(sex)) gtselect(data_id language form age sex production)

vocab_data lt- vocab_admins gtgroup_by(language sex age) gtleft_join(num_words) gtmutate(production = production n) gtsummarise(median = median(production))

We then plot the vocab_data data frame using the ggplot2 package(Wickham ) Full code for the analysis as a whole (including the plot)is available at lthttpmikabrgithubiodemo-vocabgenderhtmlgt

The results of this analysis are shown in Figure As expected wereplicate the gender differences found in previous work (Eriksson et al) females showed a small but highly reliable advantage in earlyproduction This effect is highly consistent and clearly visible in elevenout of twelve languages with Italian being the only exception Forcomparison the previous work found a positive female effect for all tenout of ten languages but the size of the effect was close to zero for two ofthese Observational data such as those contained in Wordbank allow usonly to speculate about the origins of this gender difference or the sourcesof cross-linguistic variation (for some discussion see Eriksson et al) But the Wordbank platform dramatically facilitates the formulationand testing of analyses of this sort allowing hypotheses to be testedquickly and easily against large datasets

CONCLUSION

In this paper we have presented Wordbank an open repository forparent-report vocabulary data from the MacArthur-Bates CDI Theinteractive analysis tools available on the Wordbank site allow interestedresearchers to explore a wide variety of phenomena in vocabularydevelopment quickly and easily exporting data and downloadingpresentation-quality graphics that document their analysis In additionusers can contribute new analyses and data to the site and connect to itdirectly using an R package for data loading These functions all facilitategreater sharing and reuse of existing data on childrenrsquos vocabularyenabling new discoveries in the future

REFERENCES

Bates E () Language and context the acquisition of pragmatics (Vol ) New York NYAcademic Press

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Bates E amp Goodman J () On the emergence of grammar from the lexicon InB MacWhinney (ed) The emergence of language (pp ndash) Mahwah NJ LawrenceErlbaum Associates

Bates E Marchman V Thal D Fenson L Dale P Reznick J S Hartung J() Developmental and stylistic variation in the composition of early vocabularyJournal of Child Language ndash

Bloom P () How children learn the meanings of words Cambridge MA MIT PressBornstein M H amp Haynes O M () Vocabulary competence in early childhoodmeasurement latent construct and predictive validity Child Development ndash

Braginsky M Yurovsky D Marchman V A amp Frank M C () Developmentalchanges in the relationship between grammar and the lexicon In D C Noelle R DaleA S Warlaumont J Yoshimi T Matlock C D Jennings amp P P Maglio (Eds)Proceedings of the th Annual Meeting of the Cognitive Science Society Austin TXCognitive Science Society

Brown R () A first language the early stages Cambridge MA Harvard University PressCartmill E A Armstrong B F Gleitman L R Goldin-Meadow S Medina T N ampTrueswell J C () Quality of early parent input predicts child vocabulary yearslater Proceedings of the National Academy of Sciences ndash

Clark E () First language acquisition Cambridge Cambridge University PressDale P S (nd) Adaptations not translations Online lthttpmb-cdistanfordeduadaptationshtmlgt (last accessed )

Dale P S amp Fenson L () Lexical development norms for young children BehaviorResearch Methods Instruments amp Computers ndash

Dale P S amp PenfoldM (nd) Adaptations of theMacArthur-Bates CDI into non-US Englishlanguages Online lthttpmb-cdistanfordedudocumentsAdaptationsSurvey--Webpdfgt (last accessed )

Dickinson D K amp Tabors P O () Beginning literacy with language young childrenlearning at home and school Baltimore MD Paul H Brookes Publishing

Dunn L M amp Dunn L M () Peabody Picture Vocabulary Test th ed ParsippanyNJ AGS Publishing Pearson Assessments

Eriksson M Marschik P B Tulviste T Almgren M Peacuterez Pereira M Wehberg S Gallego C () Differences between girls and boys in emerging language skills evidencefrom language communities British Journal of Developmental Psychology ndash

Feldman H M Dale P S Campbell T F Colborn D K Kurs-Lasky M RocketteH E amp Paradise J L () Concurrent and predictive validity of parent reports of childlanguage at ages and years Child Development ndash

Feldman H M Dollaghan C A Campbell T F Kurs-Lasky M Janosky J E ampParadise J L () Measurement properties of the MacArthur CommunicativeDevelopment Inventories at ages one and two years Child Development ndash

Fenson L Bates E Dale P Goodman J Reznick J S amp Thal D () Replymeasuring variability in early child language donrsquot shoot the messenger ChildDevelopment ndash

Fenson L Dale P S Reznick J S Bates E Hartung J P Pethick S amp Reilly J() MacArthur Communicative Development Inventories userrsquos guide and technicalmanual Baltimore MD Paul H Brookes Publishing Co

Fenson L Dale P Reznick J Bates E Thal D Pethick S Stiles J ()Variability in early communicative development Monographs of the Society for Researchin Child Development

Fenson L Marchman V A Thal D Dale P Reznick J S amp Bates E ()MacArthur-Bates Communicative Development Inventories userrsquos guide and technicalmanual nd ed Baltimore MD Brookes Publishing Company

Hidaka S () Estimating the latent number of types in growing corpora with reducedcostndashaccuracy trade-off Journal of Child Language ndash

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Hills T T Maouene J Riordan B amp Smith L B () The associative structure oflanguage contextual diversity in early word learning Journal of Memory and Language ndash

Hills T T Maouene M Maouene J Sheya A amp Smith L () Longitudinal analysisof early semantic networks Preferential attachment or preferential acquisitionPsychological Science ndash

Huttenlocher J Haight W Bryk A Seltzer M amp Lyons T () Early vocabularygrowth relation to language input and gender Developmental Psychology ndash

Joslashrgensen R N Dale P S Bleses D amp Fenson L () CLEX a cross-linguistic lexicalnorms database Journal of Child Language ndash

Kristoffersen K E Simonsen H G Bleses D Wehberg S Joslashrgensen R N EieslandE A amp Henriksen L Y () The use of the Internet in collecting CDI data ndash anexample from Norway Journal of Child Language ndash

Lieven E Salomo D amp Tomasello M () Two-year-old childrenrsquos production ofmultiword utterances a usage-based analysis Cognitive Linguistics ndash

MacWhinney B () The CHILDES Project tools for analyzing talk rd ed MahwahNJ Lawrence Erlbaum Associates

Marchman V A amp Martiacutenez-Sussmann C () Concurrent validity of caregiverparentreport measures of language for children who are learning both English and SpanishJournal of Speech Language and Hearing Research ndash

Mayor J amp Plunkett K () A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis Developmental Science ndash

Muggeo V M Sciandra M Tomasello A amp Calvo S () Estimating growth chartsvia nonparametric quantile regression a practical framework with application in ecologyEnvironmental and Ecological Statistics ndash

Nelson K () Structure and strategy in learning to talk Monographs of the Society forResearch in Child Development ndash

Norrman G amp Bylund E () The irreversibility of sensitive period effects in languagedevelopment evidence from second language acquisition in international adopteesDevelopmental Science ndash

R Foundation for Statistical Computing () R a language and environment for statisticalcomputing Software online lthttpwwwr-projectorggt

Rescorla L () The language development survey a screening tool for delayed languagein toddlers Journal of Speech and Hearing Disorders ndash

Roy B C Frank M C DeCamp P Miller M amp Roy D () Predicting the birth of aspoken word Proceedings of the National Academy of Sciences ndash

Song J Y Shattuck-Hufnagel S amp Demuth K () Development of phonetic variants(allophones) in -year-olds learning American English a study of alveolar stop t d codasJournal of Phonetics ndash

Tardif T Fletcher P Liang W Zhang Z Kaciroti N amp Marchman V A ()Babyrsquos first words Developmental Psychology ndash

Thal D Jackson-Maldonado D amp Acosta D () Validity of a parent-report measure ofvocabulary and grammar for Spanish-speaking toddlers Journal of Speech Language andHearing Research ndash

Tomasello M amp Mervis C B () The instrument is great but measuringcomprehension is still a problem Monographs of the Society for Research in ChildDevelopment ndash

Wallentin M () Putative sex differences in verbal abilities and language cortex a criticalreview Brain and Language ndash

Weisleder A amp Fernald A () Talking to children matters early language experiencestrengthens processing and builds vocabulary Psychological Science ndash

Wickham H () Ggplot elegant graphics for data analysis New York NY SpringerScience amp Business Media

Wickham H amp Francois R () Dplyr a grammar of data manipulation R packageversion middotmiddotmiddot Online lthttpscranr-projectorgwebpackagesdplyrgt

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Page 3: Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms Communicative Development Inventories

caregivers (eg mama) common objects (eg bottle shoe) social expressions(eg bye-bye) and actions or routines (eg peekaboo throw) (Nelson Tardif et al ) New words enter childrenrsquos expressive vocabulariesslowly at first but this process accelerates over the second year such thatchildren reach an average of words by months and more than by the time they graduate from high school (Fenson et al )At the same time there are significant individual differences in languageacquisition For example according to detailed observational studiesalthough some -month-olds already produce ndash words othersproduce no words at all and will not do so until they are months orolder (eg Brown Bloom Clark ) How can suchdifferences be measured accurately and efficiently And can we promoteearly detection of differences in vocabulary growth that will be clinicallysignificant later in development

Measuring early vocabulary

Traditional studies of language development typically apply a combinationof observational assessment and structured tests frequently relying onshort samples of interactions and small samples of children Discerningboth the universal features and natural variation of early lexicaldevelopment has been greatly facilitated by the development ofparent-report instruments like the MacArthur-Bates CDI (Fenson et al ) and the Language Development Survey (LDS Rescorla) The CDIs in particular were developed across a period of morethan forty years Originally designed for use in a research study(Bates ) the instruments have evolved from a structured interview tothe current paper-and-pencil format and are now increasinglyadministered online (eg Kristoffersen et al for Norwegian orlthttplaboratoriumdetskarecskgt for Slovak) While other assessmenttools exist for slightly older children to our knowledge no other measureallows cost-effective global language assessment for children in the criticalage ranges between the emergence of language and the period whenchildren become more able to engage in structured face-to-face activities(around months)

Naturalistic observations are the other leading candidate for measurementof early language but such observations are extremely costly andtime-consuming to transcribe and annotate These difficulties lead to atrade-off where most studies either include dense data about a smallnumber of children or smaller amounts of data with a larger sample sizeDense datasets currently provide the best method for in-depth study ofthe interaction between learning mechanisms and language input inindividuals (eg Lieven Salomo amp Tomasello Roy Frank

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

DeCamp Miller amp Roy ) although the generality of these studies isnecessarily limited by their small sample sizes At the other end of thespectrum assessment of many individual language samples can yieldinformation about individual variability (eg Dickinson amp Tabors Cartmill et al Weisleder amp Fernald ) but at some cost interms of depth

In addition naturalistic observations do not measure childrenrsquos languagecomprehension a variable of interest for many early language researchersEstimates of production vocabulary from naturalistic observation arehighly correlated with the CDI within studies (eg Bornstein amp Haynes) but affected substantially by length of the session context andinterlocutor when comparing across studies And although there existmethods to extract insights about global vocabulary from naturalisticobservation these statistical extrapolations are relatively new and have notbeen validated extensively (Hidaka ) Other comprehension vocabularymeasures are also available across some range of languages (eg the PeabodyPicture Vocabulary Test Dunn amp Dunn ) but these assessments aretailored for substantially older children

Parent-report measures like the CDI and LDS take advantage of the factthat parents are expert observers of their child CDI instruments ask aboutuse of communicative gestures grammar and symbolic play as well asvocabulary which is measured using checklists consisting of representativesamples of words Parents choose the words their child currentlylsquounderstandsrsquo (comprehension measured for younger children) or lsquosaysrsquo(production measured for both younger and older children) Thechecklists contain words from many different semantic (eg animal nameshousehold items) and syntactic (eg action words connectives) categoriesresulting in broader samples of lexical knowledge than are available fromother methods In their English and Spanish instantiations theinstruments come in two versions Words amp Gestures (ndash months) andWords amp Sentences (ndash months) Originally designed for Englishparallel instruments have now been adapted for more than sixty languages(Dale amp Penfold nd)

Limitations of parent report

Although the standardization of parent reports using the CDI contributes tothe availability of large amounts of data in a comparable format there aresignificant limitations to the parent-report methodology as well (Tomaselloamp Mervis Feldman et al ) First parents may be biasedobservers some may overestimate while others likely underestimate theirchildrenrsquos abilities There is also some evidence that some variability maybe due to reporting biases linked to factors such as socioeconomic status

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

(Feldman et al Fenson et al ) Second parent reports ofcomprehension for younger children likely suffer from a number of biases andare probably substantially more accurate for content words than functionwords Third the items on the original CDI instruments were chosen to be arepresentative sample of vocabulary items for the appropriate age andlanguage (Fenson et al ) not with the intention that they would be acomplete set of words that could be compared across instruments or that theywould be individually reliable and license the conclusion that a particularchild knows a particular word Fourth although the length of the CDI maygive the impression that it yields an estimate of the childrsquos full vocabulary infact it likely understates the size of a childrsquos vocabulary substantiallyespecially for older children (Mayor amp Plunkett )

Despite these limitations when used appropriately the CDI instruments arean important tool The instruments were designed to minimize bias bytargeting current behaviors and asking parents about highly salient featuresof their childrsquos abilities They yield reliable and valid estimates of totalvocabulary size with dozens of studies demonstrating concurrent andpredictive relations with naturalistic and observational measures in bothtypically developing and at-risk populations (eg Dale amp Fenson Thal Jackson-Maldonado amp Acosta Marchman amp Martiacutenez-Sussmann ) In addition a variety of recent work has shown thatindividual item-level responses can yield exciting new insights for exampleabout the growth patterns of semantic networks (Hills et al HillsMaouene Riordan amp Smith ) Such analyses have the potential to beeven more powerful when applied to larger samples and across languages

WORDBANK

To take advantage of the opportunity posed by the broad use of CDIinstruments in the child language community we have constructedWordbank an open repository for CDI data that allows for interactiveanalysis and visualization The main page of the site at time of writing isshown in Figure In this section we begin by describing technical detailsof the sitersquos database architecture We then describe the two primaryanalysis tools that form the heart of the sitersquos interactive functionality Wegive a worked example of how to use these and then end by discussing theextensibility of the Wordbank framework highlighting opportunities forcontributing data and for building new analyses

Our inspiration for Wordbank comes from two successful projects forsharing data on childrenrsquos language acquisition The first is the ChildLanguage Data Exchange System (CHILDES MacWhinney ) Adatabase of transcripts of childrenrsquos speech and speech to childrenCHILDES has grown into a robust and important tool for the

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

community with many contributors and affiliated projects The second isthe Cross Linguistic Lexical Norms site (CLEX ltwwwcdi-clexorggtJoslashrgensen et al ) which is closer in content to Wordbank andeffectively our precursor CLEX archives normative data from a range ofCDI adaptations across languages allowing browsing of acquisitiontrajectories for individual items or age groups

Wordbank builds on CLEX offering the same functionality but allowingflexible and interactive visualization and analysis as well as direct databaseaccess and data download In addition Wordbankrsquos goal is to extendbeyond the norming data provided by the developers of individual CDIsby dynamically incorporating data from many different researchers andprojects of varying sizes and scopes While the resulting datasets inWordbank are likely more heterogeneous they nevertheless have thepotential to be considerably larger and more representative than theindividual norming datasets Wordbank provides tools that enable morepowerful flexible and nuanced analyses of general trends and comparisonsacross sub-populations in a variety of different languages

While the general Wordbank architecture enables a huge variety ofanalyses in principle some illustrative examples are helpful for

Fig Screenshot of the Wordbank main page Visitors can navigate from this page to theinteractive reports as well as to a statistics page that shows the database composition acontributors page that shows citation information and a blog that highlights recentupdates

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

understanding the site Consider an experimenter constructing a new set ofstimuli for a word recognition experiment the appropriate tool for thistask would be the Item Trajectories analysis which shows the trajectory ofacquisition for individual words The experimenter could explore differentcombinations of items using this tool and match them for age ofacquisition Or consider a researcher interested in gender effects onvocabulary growth the appropriate tool would be the Vocabulary Normsanalysis which shows percentile curves for a particular instrument (Wewalk through detailed instructions for how such an analysis would beconducted below)

Database architecture

Why use a database to store vocabulary data Consider the standard format ofraw CDI data Figure shows a small slice of the original CDI norming data(Fenson et al ) Each row is a child each column gives a variable ndasheither a demographic variable or the result of a particular word beingadministered to a particular child Although this format is useful forhomogeneous administrations of a single instrument it cannot accommodatemultiple instruments multiple languages or datasets with different sourcesor kinds of demographic information Consolidating data across differentinstruments is very difficult in this format and tracking data on childrenwith multiple longitudinal administrations of a single instrument must alsobe done in an ad-hoc manner The move to a database format allows farmore flexible and programmatic handling of heterogeneous data structuresfrom different sources

A relational database such as Wordbank is at its heart a series of tables linkedby unique identifiers There are two primary groups of tables in WordbankThe COMMON tables store data that is shared between CDI instrumentsincluding information about children administrations (individual instancesof a form being filled out for a child) and items (words and other questionson a form) The INSTRUMENT tables store response data for particular CDIinstruments We currently include all items on CDI instruments includingquestions about communication gesture morphology and grammar (thoughin many of the datasets that we archive these non-vocabulary questions havenot been digitized so data on them are sparse at present)

One strength of the Wordbank framework is that it allows the storage ofsubsidiary information about the words that are included in a particularinstrument so that this information can be used in future analyses Forexample information about grammatical and semantic categories or normslike concreteness and imageability could all be appended to particularwords This functionality is not yet present in Wordbank however Thedifficulty of compiling this kind of information for a particular set of

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

words is compounded by the large number of languages that the databaseincludes We hope that in future this functionality will allow the gradualaccumulation of information about the words included in the database

Technical details Wordbank is constructed using free open-source toolsThe database is a standard MySQL database managed using Python andDjango Analysis apps are constructed using the Shiny package for R anopen-source statistical programming language The code is hosted in aGitHub repository (lthttpgithubcomlangcogwordbankgt) where interestedusers can browse leave comments and contribute modifications

All data uploaded to Wordbank are open and freely available for downloadboth through the site itself and through the GitHub repository The siteincludes only de-identified data that cannot be linked to the parents andchildren who provided it Because of these features the StanfordInstitutional Review Board has determined that the Wordbank project doesnot constitute human subjects research

Cross-linguistic and cross-instrument architecture The general philosophy ofcreating CDIs for new languages has been summarized as ldquoadaptation nottranslationrdquo (Dale nd) In other words CDIs are a useful tool for manylanguages but the forms differ between languages ndash words and even wholesections are added dropped and modified to ensure that the form capturesthe details of the particular language for which it is designed To date morethan sixty adaptations of the original English CDI have been documented(Dale and Penfold nd) These forms vary widely including differences inlength and intended age range Some forms include hundreds of items morethan the original words on the English Words amp Sentences form othersare so-called lsquoshort formsrsquo and include only a hundred or a few hundredcarefully selected words Some are designed to capture development fromthe emergence of language through ages three to four years while others arefocused on very early development (like the English Words amp Gesturesform designed for ages ndash months) All of these differences make itproblematic to compare scores and score distributions across forms evenusing percentile ranks since some instruments will have more or moredifficult items than others

Wordbank is designed so that it can accommodate data from a wide varietyof instruments both within and across languages Indeed at the time of

Fig Example data from the CDI norming sample (Fenson et al ) Each row has aunique child identifier demographics and word-by-word checklist data

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

writing the site includes data from more than administrations of theCDI across fourteen different languages and twenty-four differentinstruments But because of the difficulties in comparison acrossinstruments our approach to cross-linguistic and cross-instrument data is toprovide standardized analyses within each instrument and language withoutassuming equivalence across words instruments or populations Thus ourprimary exploratory visualization tools in general do not allow comparisonacross languages and we urge users to interpret cross-linguistic and cross-instrument differences with caution Developing statistical techniques tofacilitate these comparisons is a current focus of our research

Interactive analysis tools

The primarymethod for users to interact withWordbank is through interactiveanalysis tools that are hosted on the website These tools allow for fast andflexible exploration of the dataset the results of which can be exported intabular and graphical formats for further analysis and presentation

Vocabulary Norms One of the primary purposes of the CDI instruments isto provide percentile ranks for vocabulary growth across ages both forvisualizing the variability of early vocabulary growth and for examiningdifferences in these growth patterns due to individual differences anddemographic variables Accordingly Wordbank provides a VocabularyNorms analysis pictured in Figure The inset plot shows alladministrations of a particular CDI instrument within the instrumentrsquos validage range Dots show individual children with age binned by month andjittered to avoid overplotting Lines on the plot indicate estimates ofpercentiles fit using quantile regression with monotonic polynomial splinesas the base function (using the gcrq function of the quantregGrowthpackage Muggeo Sciandra Tomasello amp Calvo ) An important featureof the norms app is that it can be split by any demographic field in the dataso that comparisons on variables like gender birth order or maternaleducation can be conducted

The original and updated norming studies (Fenson et al )gathered data from a diverse (though not nationally representative) sampleand used these data to construct normative curves from which percentileranks could be derived In contrast to these studies Wordbank is notexplicitly designed to provide stable clinically relevant normsWordbankrsquos sample is heterogeneous and continually growing and itsanalyses are subject to revision and update Thus Wordbank does notcurrently generate percentile ranks and we do not recommend that

The only exception to this policy currently is that we allow users to see responses acrossinstruments for individual words in the Item Trajectories analysis (eg the proportionof children who say the word cat on both Words amp Gestures and Words amp Sentences forms)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Wordbank-generated norming values be used for research or clinicalpurposes in which the goal is to evaluate childrenrsquos performance inreference to an established normative standard For these types ofapplications users should refer to the published norms in the appropriatelanguage

Item Trajectories A second function of the CDI instruments is to provideaggregate data on the proportion of children at a particular age who know aspecific word (Dale amp Fenson Joslashrgensen et al ) Such analysescan be extremely helpful for the design and evaluation of materials foryoung children including experimental stimuli Accordingly the secondmajor interactive visualization in Wordbank is the Item Trajectoriesanalysis tool

Fig A screenshot of the Vocabulary Norms analysis tool showing th th thth and th percentiles (default) for English production scores Dots show individualadministrations jittered slightly to avoid overplotting Curves show polynomial spline fits(See text for more details color online)

Users can always generate percentile ranks themselves and this may be desirable ornecessary for research purposes but we caution against the clinical use of such ad-hocnorms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

This tool allows exploration of growth curves for individual words on aCDI form Users can select a language and instrument (and chooseproduction or comprehension where available) and then select or input alist of words whose trajectories are plotted (Figure ) The lsquobothrsquo measureoption shows data from multiple forms for the same language withdifferent markers for each item In general our exploration suggests thatthere are only small differences across different instruments for the sameitem and age Lines on the plot show a local polynomial regressionsmoothing line (loess in R)

Other features static reports and tabular data download In addition to theinteractive analysis tools described above Wordbank also includes a numberof non-interactive but continuously updated reports on features likevocabulary composition across languages links between grammar and thelexicon (Braginsky Yurovsky Marchman amp Frank ) and genderdifferences in vocabulary growth (see below) On the Analyses page(lthttpwordbankstanfordeduanalysesgt) we provide a gallery of bothinteractive and non-interactive analyses

Wordbank also allows raw tabular data to be browsed and downloaded forsubsequent analysis in all popular statistical packages Using the same basicinterface as the Vocabulary Norms and Item Trajectory tools users canbrowse raw data aggregated across children (similar to the VocabularyNorms tool) across items (similar to the Items Trajectory tool) or evenview the raw subject-by-item data All data in these lsquostandardrsquo reports canbe downloaded in CSV format

A worked example gender differences Imagine a student interested ingender differences in production vocabulary size perhaps for a classproject Gender differences in language production are commonly found inindividual studies (eg Fenson et al Huttenlocher Haight BrykSeltzer amp Lyons see Wallentin for review) and onelarge-scale previous study found differences in production vocabulary inten languages (Eriksson et al )

To explore these differences using Wordbank the student would navigatefrom the home page to the Vocabulary Norms report English is the defaultlanguage for the report but the student could in principle select anylanguage in the database Similarly she could select her desired instrumentin the lsquoFormsrsquo menu (Words amp Sentences is the default) She would thenselect lsquoGenderrsquo as a split variable for the data (in the lsquoSplit Variablersquo menu)to see normative curves and sample sizes for each part of the dataset Or to

The distinction between sex (biological characteristic) and gender (social characteristic) iscomplex and not well understood in early childhood We defer discussion of this issuesince the CDI is a parent-report form we do not have access to either sex or genderinformation directly

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

make a plot that enabled comparison of the median level of productionvocabulary she could select lsquoMedianrsquo in the lsquoQuantilesrsquo menu

Selecting lsquoDownload Plotrsquo would result in the plot shown in Figure Orshe could navigate to the lsquoTablersquo tab of the display window to see tabularform data showing the th percentile (median) for both females andmales by age These tabular summary data are available for download viathe lsquoDownload Tablersquo button and the raw data (with a row for each oneof the children represented in the plot) are available via thelsquoDownload Raw Datarsquo button In sum this graphical workflow allowsinterested users to manipulate and download individual parts of thedataset as well as to create visualizations of basic analyses

Extensibility

Extensibility is one of the major strengths of Wordbank Althoughprogramming knowledge is not necessary for interacting with Wordbankinterested researchers with programming skills can contribute to the

Fig A screenshot of the Item Trajectories analysis tool showing a visualization of thedevelopmental trajectory of production for three words (dog choo choo and table) acrossboth Words amp Gestures and Words amp Sentences forms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

development effort by adding new analyses Each Wordbank analysis app isconstructed as a standalone script or set of scripts in the R languageConstructing an interactive analysis requires specifying a visualization andsome interactive functionality using Shiny Non-interactive analyses can beconstructed as R Markdown documents that execute scripts using theWordbank database Both of these have the virtue of rerunning on thenewest version of the database whenever they are opened so they do notgo out of date as new data are added

In addition we encourage contributions of individual datasets Wordbankcurrently imports data from Excel and CSV formats via automated importscripts Individuals or labs interested in contributing should consult withthe authors for advice about data formatting and upload

WORDBANKR AN R PACKAGE FOR ACCESSING WORDBANK

Although the analysis tools described above suffice for many needsresearchers interested in detailed quantitative or cross-linguistic analysesmay wish to connect directly to the Wordbank database and manipulatethe data directly Making use of the R programming language (RFoundation for Statistical Computing ) we provide the wordbankrpackage to help researchers accomplish this task R is an open-source

Fig A downloaded plot of gender differences in production language forEnglish-speaking children (color online)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

extensible statistical computing environment that is rapidly growing inpopularity across fields and is increasing in use in child language research(eg Norrman amp Bylund Song Shattuck-Hufnagel amp Demuth) The wordbankr package abstracts away the details of connectingto the database Users can take advantage of the SQL tools developed inthe popular dplyr package (Wickham amp Francois ) which makemanipulating large datasets quick and easy We describe the commandsthat the package provides and then give a worked example of using thepackage for a simple analysis

Package details

The wordbankr package is easily installed via CRAN the comprehensive Rarchive network To install simply type installpackages(ldquowordbankrrdquo) After installation users can use the three main data loadingfunctions provided by wordbankrget_administration_data toretrieve information about each CDI administration including the childrsquosdemographics and vocabulary sizes get_item_data to retrieve informationabout each CDI item including its text and categories andget_instrument_data to retrieve administration-by-item response valuesEach of these can be run in remote mode which loads data from theWordbank server or in local mode if the user has a copy of the database setup on their local machine For more detailed documentation see the packagerepository (lthttpgithubcomlangcogwordbankrgt)

Worked example part gender differences across languages

We next demonstrate the analytic potential of direct manipulation of theWordbank database using wordbankr by using the package to extend theworked example of gender differences above This section also replicates alarge-scale analysis by Eriksson et al () To perform the analysis wefirst begin by using wordbankr to load the data from Wordbank andconnect to the tables we need

admins lt- get_administration_data()items lt- get_item_data()

We next use a series of dplyr calls to compute the number of words in eachlanguage select the appropriate subset of the data and calculate theproportion of words produced for this data subset

num_words lt- items gtfilter(form == WS type == word) gtgroup_by(language) gtsummarise(n = n())

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Fig Median production vocabulary as a proportion of total words on an instrument plotted by age in months Red and blue lines showfemales and males respectively (color online)

WORD

BAN

K

of use available at httpsww

wcam

bridgeorgcoreterms httpsdoiorg101017S0305000916000209

Dow

nloaded from httpsw

ww

cambridgeorgcore U

niversity of Chicago on 17 May 2017 at 144547 subject to the Cam

bridge Core terms

vocab_admins lt- admins gtfilter(form == WS isna(sex)) gtselect(data_id language form age sex production)

vocab_data lt- vocab_admins gtgroup_by(language sex age) gtleft_join(num_words) gtmutate(production = production n) gtsummarise(median = median(production))

We then plot the vocab_data data frame using the ggplot2 package(Wickham ) Full code for the analysis as a whole (including the plot)is available at lthttpmikabrgithubiodemo-vocabgenderhtmlgt

The results of this analysis are shown in Figure As expected wereplicate the gender differences found in previous work (Eriksson et al) females showed a small but highly reliable advantage in earlyproduction This effect is highly consistent and clearly visible in elevenout of twelve languages with Italian being the only exception Forcomparison the previous work found a positive female effect for all tenout of ten languages but the size of the effect was close to zero for two ofthese Observational data such as those contained in Wordbank allow usonly to speculate about the origins of this gender difference or the sourcesof cross-linguistic variation (for some discussion see Eriksson et al) But the Wordbank platform dramatically facilitates the formulationand testing of analyses of this sort allowing hypotheses to be testedquickly and easily against large datasets

CONCLUSION

In this paper we have presented Wordbank an open repository forparent-report vocabulary data from the MacArthur-Bates CDI Theinteractive analysis tools available on the Wordbank site allow interestedresearchers to explore a wide variety of phenomena in vocabularydevelopment quickly and easily exporting data and downloadingpresentation-quality graphics that document their analysis In additionusers can contribute new analyses and data to the site and connect to itdirectly using an R package for data loading These functions all facilitategreater sharing and reuse of existing data on childrenrsquos vocabularyenabling new discoveries in the future

REFERENCES

Bates E () Language and context the acquisition of pragmatics (Vol ) New York NYAcademic Press

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Bates E amp Goodman J () On the emergence of grammar from the lexicon InB MacWhinney (ed) The emergence of language (pp ndash) Mahwah NJ LawrenceErlbaum Associates

Bates E Marchman V Thal D Fenson L Dale P Reznick J S Hartung J() Developmental and stylistic variation in the composition of early vocabularyJournal of Child Language ndash

Bloom P () How children learn the meanings of words Cambridge MA MIT PressBornstein M H amp Haynes O M () Vocabulary competence in early childhoodmeasurement latent construct and predictive validity Child Development ndash

Braginsky M Yurovsky D Marchman V A amp Frank M C () Developmentalchanges in the relationship between grammar and the lexicon In D C Noelle R DaleA S Warlaumont J Yoshimi T Matlock C D Jennings amp P P Maglio (Eds)Proceedings of the th Annual Meeting of the Cognitive Science Society Austin TXCognitive Science Society

Brown R () A first language the early stages Cambridge MA Harvard University PressCartmill E A Armstrong B F Gleitman L R Goldin-Meadow S Medina T N ampTrueswell J C () Quality of early parent input predicts child vocabulary yearslater Proceedings of the National Academy of Sciences ndash

Clark E () First language acquisition Cambridge Cambridge University PressDale P S (nd) Adaptations not translations Online lthttpmb-cdistanfordeduadaptationshtmlgt (last accessed )

Dale P S amp Fenson L () Lexical development norms for young children BehaviorResearch Methods Instruments amp Computers ndash

Dale P S amp PenfoldM (nd) Adaptations of theMacArthur-Bates CDI into non-US Englishlanguages Online lthttpmb-cdistanfordedudocumentsAdaptationsSurvey--Webpdfgt (last accessed )

Dickinson D K amp Tabors P O () Beginning literacy with language young childrenlearning at home and school Baltimore MD Paul H Brookes Publishing

Dunn L M amp Dunn L M () Peabody Picture Vocabulary Test th ed ParsippanyNJ AGS Publishing Pearson Assessments

Eriksson M Marschik P B Tulviste T Almgren M Peacuterez Pereira M Wehberg S Gallego C () Differences between girls and boys in emerging language skills evidencefrom language communities British Journal of Developmental Psychology ndash

Feldman H M Dale P S Campbell T F Colborn D K Kurs-Lasky M RocketteH E amp Paradise J L () Concurrent and predictive validity of parent reports of childlanguage at ages and years Child Development ndash

Feldman H M Dollaghan C A Campbell T F Kurs-Lasky M Janosky J E ampParadise J L () Measurement properties of the MacArthur CommunicativeDevelopment Inventories at ages one and two years Child Development ndash

Fenson L Bates E Dale P Goodman J Reznick J S amp Thal D () Replymeasuring variability in early child language donrsquot shoot the messenger ChildDevelopment ndash

Fenson L Dale P S Reznick J S Bates E Hartung J P Pethick S amp Reilly J() MacArthur Communicative Development Inventories userrsquos guide and technicalmanual Baltimore MD Paul H Brookes Publishing Co

Fenson L Dale P Reznick J Bates E Thal D Pethick S Stiles J ()Variability in early communicative development Monographs of the Society for Researchin Child Development

Fenson L Marchman V A Thal D Dale P Reznick J S amp Bates E ()MacArthur-Bates Communicative Development Inventories userrsquos guide and technicalmanual nd ed Baltimore MD Brookes Publishing Company

Hidaka S () Estimating the latent number of types in growing corpora with reducedcostndashaccuracy trade-off Journal of Child Language ndash

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Hills T T Maouene J Riordan B amp Smith L B () The associative structure oflanguage contextual diversity in early word learning Journal of Memory and Language ndash

Hills T T Maouene M Maouene J Sheya A amp Smith L () Longitudinal analysisof early semantic networks Preferential attachment or preferential acquisitionPsychological Science ndash

Huttenlocher J Haight W Bryk A Seltzer M amp Lyons T () Early vocabularygrowth relation to language input and gender Developmental Psychology ndash

Joslashrgensen R N Dale P S Bleses D amp Fenson L () CLEX a cross-linguistic lexicalnorms database Journal of Child Language ndash

Kristoffersen K E Simonsen H G Bleses D Wehberg S Joslashrgensen R N EieslandE A amp Henriksen L Y () The use of the Internet in collecting CDI data ndash anexample from Norway Journal of Child Language ndash

Lieven E Salomo D amp Tomasello M () Two-year-old childrenrsquos production ofmultiword utterances a usage-based analysis Cognitive Linguistics ndash

MacWhinney B () The CHILDES Project tools for analyzing talk rd ed MahwahNJ Lawrence Erlbaum Associates

Marchman V A amp Martiacutenez-Sussmann C () Concurrent validity of caregiverparentreport measures of language for children who are learning both English and SpanishJournal of Speech Language and Hearing Research ndash

Mayor J amp Plunkett K () A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis Developmental Science ndash

Muggeo V M Sciandra M Tomasello A amp Calvo S () Estimating growth chartsvia nonparametric quantile regression a practical framework with application in ecologyEnvironmental and Ecological Statistics ndash

Nelson K () Structure and strategy in learning to talk Monographs of the Society forResearch in Child Development ndash

Norrman G amp Bylund E () The irreversibility of sensitive period effects in languagedevelopment evidence from second language acquisition in international adopteesDevelopmental Science ndash

R Foundation for Statistical Computing () R a language and environment for statisticalcomputing Software online lthttpwwwr-projectorggt

Rescorla L () The language development survey a screening tool for delayed languagein toddlers Journal of Speech and Hearing Disorders ndash

Roy B C Frank M C DeCamp P Miller M amp Roy D () Predicting the birth of aspoken word Proceedings of the National Academy of Sciences ndash

Song J Y Shattuck-Hufnagel S amp Demuth K () Development of phonetic variants(allophones) in -year-olds learning American English a study of alveolar stop t d codasJournal of Phonetics ndash

Tardif T Fletcher P Liang W Zhang Z Kaciroti N amp Marchman V A ()Babyrsquos first words Developmental Psychology ndash

Thal D Jackson-Maldonado D amp Acosta D () Validity of a parent-report measure ofvocabulary and grammar for Spanish-speaking toddlers Journal of Speech Language andHearing Research ndash

Tomasello M amp Mervis C B () The instrument is great but measuringcomprehension is still a problem Monographs of the Society for Research in ChildDevelopment ndash

Wallentin M () Putative sex differences in verbal abilities and language cortex a criticalreview Brain and Language ndash

Weisleder A amp Fernald A () Talking to children matters early language experiencestrengthens processing and builds vocabulary Psychological Science ndash

Wickham H () Ggplot elegant graphics for data analysis New York NY SpringerScience amp Business Media

Wickham H amp Francois R () Dplyr a grammar of data manipulation R packageversion middotmiddotmiddot Online lthttpscranr-projectorgwebpackagesdplyrgt

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Page 4: Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms Communicative Development Inventories

DeCamp Miller amp Roy ) although the generality of these studies isnecessarily limited by their small sample sizes At the other end of thespectrum assessment of many individual language samples can yieldinformation about individual variability (eg Dickinson amp Tabors Cartmill et al Weisleder amp Fernald ) but at some cost interms of depth

In addition naturalistic observations do not measure childrenrsquos languagecomprehension a variable of interest for many early language researchersEstimates of production vocabulary from naturalistic observation arehighly correlated with the CDI within studies (eg Bornstein amp Haynes) but affected substantially by length of the session context andinterlocutor when comparing across studies And although there existmethods to extract insights about global vocabulary from naturalisticobservation these statistical extrapolations are relatively new and have notbeen validated extensively (Hidaka ) Other comprehension vocabularymeasures are also available across some range of languages (eg the PeabodyPicture Vocabulary Test Dunn amp Dunn ) but these assessments aretailored for substantially older children

Parent-report measures like the CDI and LDS take advantage of the factthat parents are expert observers of their child CDI instruments ask aboutuse of communicative gestures grammar and symbolic play as well asvocabulary which is measured using checklists consisting of representativesamples of words Parents choose the words their child currentlylsquounderstandsrsquo (comprehension measured for younger children) or lsquosaysrsquo(production measured for both younger and older children) Thechecklists contain words from many different semantic (eg animal nameshousehold items) and syntactic (eg action words connectives) categoriesresulting in broader samples of lexical knowledge than are available fromother methods In their English and Spanish instantiations theinstruments come in two versions Words amp Gestures (ndash months) andWords amp Sentences (ndash months) Originally designed for Englishparallel instruments have now been adapted for more than sixty languages(Dale amp Penfold nd)

Limitations of parent report

Although the standardization of parent reports using the CDI contributes tothe availability of large amounts of data in a comparable format there aresignificant limitations to the parent-report methodology as well (Tomaselloamp Mervis Feldman et al ) First parents may be biasedobservers some may overestimate while others likely underestimate theirchildrenrsquos abilities There is also some evidence that some variability maybe due to reporting biases linked to factors such as socioeconomic status

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

(Feldman et al Fenson et al ) Second parent reports ofcomprehension for younger children likely suffer from a number of biases andare probably substantially more accurate for content words than functionwords Third the items on the original CDI instruments were chosen to be arepresentative sample of vocabulary items for the appropriate age andlanguage (Fenson et al ) not with the intention that they would be acomplete set of words that could be compared across instruments or that theywould be individually reliable and license the conclusion that a particularchild knows a particular word Fourth although the length of the CDI maygive the impression that it yields an estimate of the childrsquos full vocabulary infact it likely understates the size of a childrsquos vocabulary substantiallyespecially for older children (Mayor amp Plunkett )

Despite these limitations when used appropriately the CDI instruments arean important tool The instruments were designed to minimize bias bytargeting current behaviors and asking parents about highly salient featuresof their childrsquos abilities They yield reliable and valid estimates of totalvocabulary size with dozens of studies demonstrating concurrent andpredictive relations with naturalistic and observational measures in bothtypically developing and at-risk populations (eg Dale amp Fenson Thal Jackson-Maldonado amp Acosta Marchman amp Martiacutenez-Sussmann ) In addition a variety of recent work has shown thatindividual item-level responses can yield exciting new insights for exampleabout the growth patterns of semantic networks (Hills et al HillsMaouene Riordan amp Smith ) Such analyses have the potential to beeven more powerful when applied to larger samples and across languages

WORDBANK

To take advantage of the opportunity posed by the broad use of CDIinstruments in the child language community we have constructedWordbank an open repository for CDI data that allows for interactiveanalysis and visualization The main page of the site at time of writing isshown in Figure In this section we begin by describing technical detailsof the sitersquos database architecture We then describe the two primaryanalysis tools that form the heart of the sitersquos interactive functionality Wegive a worked example of how to use these and then end by discussing theextensibility of the Wordbank framework highlighting opportunities forcontributing data and for building new analyses

Our inspiration for Wordbank comes from two successful projects forsharing data on childrenrsquos language acquisition The first is the ChildLanguage Data Exchange System (CHILDES MacWhinney ) Adatabase of transcripts of childrenrsquos speech and speech to childrenCHILDES has grown into a robust and important tool for the

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

community with many contributors and affiliated projects The second isthe Cross Linguistic Lexical Norms site (CLEX ltwwwcdi-clexorggtJoslashrgensen et al ) which is closer in content to Wordbank andeffectively our precursor CLEX archives normative data from a range ofCDI adaptations across languages allowing browsing of acquisitiontrajectories for individual items or age groups

Wordbank builds on CLEX offering the same functionality but allowingflexible and interactive visualization and analysis as well as direct databaseaccess and data download In addition Wordbankrsquos goal is to extendbeyond the norming data provided by the developers of individual CDIsby dynamically incorporating data from many different researchers andprojects of varying sizes and scopes While the resulting datasets inWordbank are likely more heterogeneous they nevertheless have thepotential to be considerably larger and more representative than theindividual norming datasets Wordbank provides tools that enable morepowerful flexible and nuanced analyses of general trends and comparisonsacross sub-populations in a variety of different languages

While the general Wordbank architecture enables a huge variety ofanalyses in principle some illustrative examples are helpful for

Fig Screenshot of the Wordbank main page Visitors can navigate from this page to theinteractive reports as well as to a statistics page that shows the database composition acontributors page that shows citation information and a blog that highlights recentupdates

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

understanding the site Consider an experimenter constructing a new set ofstimuli for a word recognition experiment the appropriate tool for thistask would be the Item Trajectories analysis which shows the trajectory ofacquisition for individual words The experimenter could explore differentcombinations of items using this tool and match them for age ofacquisition Or consider a researcher interested in gender effects onvocabulary growth the appropriate tool would be the Vocabulary Normsanalysis which shows percentile curves for a particular instrument (Wewalk through detailed instructions for how such an analysis would beconducted below)

Database architecture

Why use a database to store vocabulary data Consider the standard format ofraw CDI data Figure shows a small slice of the original CDI norming data(Fenson et al ) Each row is a child each column gives a variable ndasheither a demographic variable or the result of a particular word beingadministered to a particular child Although this format is useful forhomogeneous administrations of a single instrument it cannot accommodatemultiple instruments multiple languages or datasets with different sourcesor kinds of demographic information Consolidating data across differentinstruments is very difficult in this format and tracking data on childrenwith multiple longitudinal administrations of a single instrument must alsobe done in an ad-hoc manner The move to a database format allows farmore flexible and programmatic handling of heterogeneous data structuresfrom different sources

A relational database such as Wordbank is at its heart a series of tables linkedby unique identifiers There are two primary groups of tables in WordbankThe COMMON tables store data that is shared between CDI instrumentsincluding information about children administrations (individual instancesof a form being filled out for a child) and items (words and other questionson a form) The INSTRUMENT tables store response data for particular CDIinstruments We currently include all items on CDI instruments includingquestions about communication gesture morphology and grammar (thoughin many of the datasets that we archive these non-vocabulary questions havenot been digitized so data on them are sparse at present)

One strength of the Wordbank framework is that it allows the storage ofsubsidiary information about the words that are included in a particularinstrument so that this information can be used in future analyses Forexample information about grammatical and semantic categories or normslike concreteness and imageability could all be appended to particularwords This functionality is not yet present in Wordbank however Thedifficulty of compiling this kind of information for a particular set of

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

words is compounded by the large number of languages that the databaseincludes We hope that in future this functionality will allow the gradualaccumulation of information about the words included in the database

Technical details Wordbank is constructed using free open-source toolsThe database is a standard MySQL database managed using Python andDjango Analysis apps are constructed using the Shiny package for R anopen-source statistical programming language The code is hosted in aGitHub repository (lthttpgithubcomlangcogwordbankgt) where interestedusers can browse leave comments and contribute modifications

All data uploaded to Wordbank are open and freely available for downloadboth through the site itself and through the GitHub repository The siteincludes only de-identified data that cannot be linked to the parents andchildren who provided it Because of these features the StanfordInstitutional Review Board has determined that the Wordbank project doesnot constitute human subjects research

Cross-linguistic and cross-instrument architecture The general philosophy ofcreating CDIs for new languages has been summarized as ldquoadaptation nottranslationrdquo (Dale nd) In other words CDIs are a useful tool for manylanguages but the forms differ between languages ndash words and even wholesections are added dropped and modified to ensure that the form capturesthe details of the particular language for which it is designed To date morethan sixty adaptations of the original English CDI have been documented(Dale and Penfold nd) These forms vary widely including differences inlength and intended age range Some forms include hundreds of items morethan the original words on the English Words amp Sentences form othersare so-called lsquoshort formsrsquo and include only a hundred or a few hundredcarefully selected words Some are designed to capture development fromthe emergence of language through ages three to four years while others arefocused on very early development (like the English Words amp Gesturesform designed for ages ndash months) All of these differences make itproblematic to compare scores and score distributions across forms evenusing percentile ranks since some instruments will have more or moredifficult items than others

Wordbank is designed so that it can accommodate data from a wide varietyof instruments both within and across languages Indeed at the time of

Fig Example data from the CDI norming sample (Fenson et al ) Each row has aunique child identifier demographics and word-by-word checklist data

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

writing the site includes data from more than administrations of theCDI across fourteen different languages and twenty-four differentinstruments But because of the difficulties in comparison acrossinstruments our approach to cross-linguistic and cross-instrument data is toprovide standardized analyses within each instrument and language withoutassuming equivalence across words instruments or populations Thus ourprimary exploratory visualization tools in general do not allow comparisonacross languages and we urge users to interpret cross-linguistic and cross-instrument differences with caution Developing statistical techniques tofacilitate these comparisons is a current focus of our research

Interactive analysis tools

The primarymethod for users to interact withWordbank is through interactiveanalysis tools that are hosted on the website These tools allow for fast andflexible exploration of the dataset the results of which can be exported intabular and graphical formats for further analysis and presentation

Vocabulary Norms One of the primary purposes of the CDI instruments isto provide percentile ranks for vocabulary growth across ages both forvisualizing the variability of early vocabulary growth and for examiningdifferences in these growth patterns due to individual differences anddemographic variables Accordingly Wordbank provides a VocabularyNorms analysis pictured in Figure The inset plot shows alladministrations of a particular CDI instrument within the instrumentrsquos validage range Dots show individual children with age binned by month andjittered to avoid overplotting Lines on the plot indicate estimates ofpercentiles fit using quantile regression with monotonic polynomial splinesas the base function (using the gcrq function of the quantregGrowthpackage Muggeo Sciandra Tomasello amp Calvo ) An important featureof the norms app is that it can be split by any demographic field in the dataso that comparisons on variables like gender birth order or maternaleducation can be conducted

The original and updated norming studies (Fenson et al )gathered data from a diverse (though not nationally representative) sampleand used these data to construct normative curves from which percentileranks could be derived In contrast to these studies Wordbank is notexplicitly designed to provide stable clinically relevant normsWordbankrsquos sample is heterogeneous and continually growing and itsanalyses are subject to revision and update Thus Wordbank does notcurrently generate percentile ranks and we do not recommend that

The only exception to this policy currently is that we allow users to see responses acrossinstruments for individual words in the Item Trajectories analysis (eg the proportionof children who say the word cat on both Words amp Gestures and Words amp Sentences forms)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Wordbank-generated norming values be used for research or clinicalpurposes in which the goal is to evaluate childrenrsquos performance inreference to an established normative standard For these types ofapplications users should refer to the published norms in the appropriatelanguage

Item Trajectories A second function of the CDI instruments is to provideaggregate data on the proportion of children at a particular age who know aspecific word (Dale amp Fenson Joslashrgensen et al ) Such analysescan be extremely helpful for the design and evaluation of materials foryoung children including experimental stimuli Accordingly the secondmajor interactive visualization in Wordbank is the Item Trajectoriesanalysis tool

Fig A screenshot of the Vocabulary Norms analysis tool showing th th thth and th percentiles (default) for English production scores Dots show individualadministrations jittered slightly to avoid overplotting Curves show polynomial spline fits(See text for more details color online)

Users can always generate percentile ranks themselves and this may be desirable ornecessary for research purposes but we caution against the clinical use of such ad-hocnorms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

This tool allows exploration of growth curves for individual words on aCDI form Users can select a language and instrument (and chooseproduction or comprehension where available) and then select or input alist of words whose trajectories are plotted (Figure ) The lsquobothrsquo measureoption shows data from multiple forms for the same language withdifferent markers for each item In general our exploration suggests thatthere are only small differences across different instruments for the sameitem and age Lines on the plot show a local polynomial regressionsmoothing line (loess in R)

Other features static reports and tabular data download In addition to theinteractive analysis tools described above Wordbank also includes a numberof non-interactive but continuously updated reports on features likevocabulary composition across languages links between grammar and thelexicon (Braginsky Yurovsky Marchman amp Frank ) and genderdifferences in vocabulary growth (see below) On the Analyses page(lthttpwordbankstanfordeduanalysesgt) we provide a gallery of bothinteractive and non-interactive analyses

Wordbank also allows raw tabular data to be browsed and downloaded forsubsequent analysis in all popular statistical packages Using the same basicinterface as the Vocabulary Norms and Item Trajectory tools users canbrowse raw data aggregated across children (similar to the VocabularyNorms tool) across items (similar to the Items Trajectory tool) or evenview the raw subject-by-item data All data in these lsquostandardrsquo reports canbe downloaded in CSV format

A worked example gender differences Imagine a student interested ingender differences in production vocabulary size perhaps for a classproject Gender differences in language production are commonly found inindividual studies (eg Fenson et al Huttenlocher Haight BrykSeltzer amp Lyons see Wallentin for review) and onelarge-scale previous study found differences in production vocabulary inten languages (Eriksson et al )

To explore these differences using Wordbank the student would navigatefrom the home page to the Vocabulary Norms report English is the defaultlanguage for the report but the student could in principle select anylanguage in the database Similarly she could select her desired instrumentin the lsquoFormsrsquo menu (Words amp Sentences is the default) She would thenselect lsquoGenderrsquo as a split variable for the data (in the lsquoSplit Variablersquo menu)to see normative curves and sample sizes for each part of the dataset Or to

The distinction between sex (biological characteristic) and gender (social characteristic) iscomplex and not well understood in early childhood We defer discussion of this issuesince the CDI is a parent-report form we do not have access to either sex or genderinformation directly

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

make a plot that enabled comparison of the median level of productionvocabulary she could select lsquoMedianrsquo in the lsquoQuantilesrsquo menu

Selecting lsquoDownload Plotrsquo would result in the plot shown in Figure Orshe could navigate to the lsquoTablersquo tab of the display window to see tabularform data showing the th percentile (median) for both females andmales by age These tabular summary data are available for download viathe lsquoDownload Tablersquo button and the raw data (with a row for each oneof the children represented in the plot) are available via thelsquoDownload Raw Datarsquo button In sum this graphical workflow allowsinterested users to manipulate and download individual parts of thedataset as well as to create visualizations of basic analyses

Extensibility

Extensibility is one of the major strengths of Wordbank Althoughprogramming knowledge is not necessary for interacting with Wordbankinterested researchers with programming skills can contribute to the

Fig A screenshot of the Item Trajectories analysis tool showing a visualization of thedevelopmental trajectory of production for three words (dog choo choo and table) acrossboth Words amp Gestures and Words amp Sentences forms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

development effort by adding new analyses Each Wordbank analysis app isconstructed as a standalone script or set of scripts in the R languageConstructing an interactive analysis requires specifying a visualization andsome interactive functionality using Shiny Non-interactive analyses can beconstructed as R Markdown documents that execute scripts using theWordbank database Both of these have the virtue of rerunning on thenewest version of the database whenever they are opened so they do notgo out of date as new data are added

In addition we encourage contributions of individual datasets Wordbankcurrently imports data from Excel and CSV formats via automated importscripts Individuals or labs interested in contributing should consult withthe authors for advice about data formatting and upload

WORDBANKR AN R PACKAGE FOR ACCESSING WORDBANK

Although the analysis tools described above suffice for many needsresearchers interested in detailed quantitative or cross-linguistic analysesmay wish to connect directly to the Wordbank database and manipulatethe data directly Making use of the R programming language (RFoundation for Statistical Computing ) we provide the wordbankrpackage to help researchers accomplish this task R is an open-source

Fig A downloaded plot of gender differences in production language forEnglish-speaking children (color online)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

extensible statistical computing environment that is rapidly growing inpopularity across fields and is increasing in use in child language research(eg Norrman amp Bylund Song Shattuck-Hufnagel amp Demuth) The wordbankr package abstracts away the details of connectingto the database Users can take advantage of the SQL tools developed inthe popular dplyr package (Wickham amp Francois ) which makemanipulating large datasets quick and easy We describe the commandsthat the package provides and then give a worked example of using thepackage for a simple analysis

Package details

The wordbankr package is easily installed via CRAN the comprehensive Rarchive network To install simply type installpackages(ldquowordbankrrdquo) After installation users can use the three main data loadingfunctions provided by wordbankrget_administration_data toretrieve information about each CDI administration including the childrsquosdemographics and vocabulary sizes get_item_data to retrieve informationabout each CDI item including its text and categories andget_instrument_data to retrieve administration-by-item response valuesEach of these can be run in remote mode which loads data from theWordbank server or in local mode if the user has a copy of the database setup on their local machine For more detailed documentation see the packagerepository (lthttpgithubcomlangcogwordbankrgt)

Worked example part gender differences across languages

We next demonstrate the analytic potential of direct manipulation of theWordbank database using wordbankr by using the package to extend theworked example of gender differences above This section also replicates alarge-scale analysis by Eriksson et al () To perform the analysis wefirst begin by using wordbankr to load the data from Wordbank andconnect to the tables we need

admins lt- get_administration_data()items lt- get_item_data()

We next use a series of dplyr calls to compute the number of words in eachlanguage select the appropriate subset of the data and calculate theproportion of words produced for this data subset

num_words lt- items gtfilter(form == WS type == word) gtgroup_by(language) gtsummarise(n = n())

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Fig Median production vocabulary as a proportion of total words on an instrument plotted by age in months Red and blue lines showfemales and males respectively (color online)

WORD

BAN

K

of use available at httpsww

wcam

bridgeorgcoreterms httpsdoiorg101017S0305000916000209

Dow

nloaded from httpsw

ww

cambridgeorgcore U

niversity of Chicago on 17 May 2017 at 144547 subject to the Cam

bridge Core terms

vocab_admins lt- admins gtfilter(form == WS isna(sex)) gtselect(data_id language form age sex production)

vocab_data lt- vocab_admins gtgroup_by(language sex age) gtleft_join(num_words) gtmutate(production = production n) gtsummarise(median = median(production))

We then plot the vocab_data data frame using the ggplot2 package(Wickham ) Full code for the analysis as a whole (including the plot)is available at lthttpmikabrgithubiodemo-vocabgenderhtmlgt

The results of this analysis are shown in Figure As expected wereplicate the gender differences found in previous work (Eriksson et al) females showed a small but highly reliable advantage in earlyproduction This effect is highly consistent and clearly visible in elevenout of twelve languages with Italian being the only exception Forcomparison the previous work found a positive female effect for all tenout of ten languages but the size of the effect was close to zero for two ofthese Observational data such as those contained in Wordbank allow usonly to speculate about the origins of this gender difference or the sourcesof cross-linguistic variation (for some discussion see Eriksson et al) But the Wordbank platform dramatically facilitates the formulationand testing of analyses of this sort allowing hypotheses to be testedquickly and easily against large datasets

CONCLUSION

In this paper we have presented Wordbank an open repository forparent-report vocabulary data from the MacArthur-Bates CDI Theinteractive analysis tools available on the Wordbank site allow interestedresearchers to explore a wide variety of phenomena in vocabularydevelopment quickly and easily exporting data and downloadingpresentation-quality graphics that document their analysis In additionusers can contribute new analyses and data to the site and connect to itdirectly using an R package for data loading These functions all facilitategreater sharing and reuse of existing data on childrenrsquos vocabularyenabling new discoveries in the future

REFERENCES

Bates E () Language and context the acquisition of pragmatics (Vol ) New York NYAcademic Press

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Bates E amp Goodman J () On the emergence of grammar from the lexicon InB MacWhinney (ed) The emergence of language (pp ndash) Mahwah NJ LawrenceErlbaum Associates

Bates E Marchman V Thal D Fenson L Dale P Reznick J S Hartung J() Developmental and stylistic variation in the composition of early vocabularyJournal of Child Language ndash

Bloom P () How children learn the meanings of words Cambridge MA MIT PressBornstein M H amp Haynes O M () Vocabulary competence in early childhoodmeasurement latent construct and predictive validity Child Development ndash

Braginsky M Yurovsky D Marchman V A amp Frank M C () Developmentalchanges in the relationship between grammar and the lexicon In D C Noelle R DaleA S Warlaumont J Yoshimi T Matlock C D Jennings amp P P Maglio (Eds)Proceedings of the th Annual Meeting of the Cognitive Science Society Austin TXCognitive Science Society

Brown R () A first language the early stages Cambridge MA Harvard University PressCartmill E A Armstrong B F Gleitman L R Goldin-Meadow S Medina T N ampTrueswell J C () Quality of early parent input predicts child vocabulary yearslater Proceedings of the National Academy of Sciences ndash

Clark E () First language acquisition Cambridge Cambridge University PressDale P S (nd) Adaptations not translations Online lthttpmb-cdistanfordeduadaptationshtmlgt (last accessed )

Dale P S amp Fenson L () Lexical development norms for young children BehaviorResearch Methods Instruments amp Computers ndash

Dale P S amp PenfoldM (nd) Adaptations of theMacArthur-Bates CDI into non-US Englishlanguages Online lthttpmb-cdistanfordedudocumentsAdaptationsSurvey--Webpdfgt (last accessed )

Dickinson D K amp Tabors P O () Beginning literacy with language young childrenlearning at home and school Baltimore MD Paul H Brookes Publishing

Dunn L M amp Dunn L M () Peabody Picture Vocabulary Test th ed ParsippanyNJ AGS Publishing Pearson Assessments

Eriksson M Marschik P B Tulviste T Almgren M Peacuterez Pereira M Wehberg S Gallego C () Differences between girls and boys in emerging language skills evidencefrom language communities British Journal of Developmental Psychology ndash

Feldman H M Dale P S Campbell T F Colborn D K Kurs-Lasky M RocketteH E amp Paradise J L () Concurrent and predictive validity of parent reports of childlanguage at ages and years Child Development ndash

Feldman H M Dollaghan C A Campbell T F Kurs-Lasky M Janosky J E ampParadise J L () Measurement properties of the MacArthur CommunicativeDevelopment Inventories at ages one and two years Child Development ndash

Fenson L Bates E Dale P Goodman J Reznick J S amp Thal D () Replymeasuring variability in early child language donrsquot shoot the messenger ChildDevelopment ndash

Fenson L Dale P S Reznick J S Bates E Hartung J P Pethick S amp Reilly J() MacArthur Communicative Development Inventories userrsquos guide and technicalmanual Baltimore MD Paul H Brookes Publishing Co

Fenson L Dale P Reznick J Bates E Thal D Pethick S Stiles J ()Variability in early communicative development Monographs of the Society for Researchin Child Development

Fenson L Marchman V A Thal D Dale P Reznick J S amp Bates E ()MacArthur-Bates Communicative Development Inventories userrsquos guide and technicalmanual nd ed Baltimore MD Brookes Publishing Company

Hidaka S () Estimating the latent number of types in growing corpora with reducedcostndashaccuracy trade-off Journal of Child Language ndash

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Hills T T Maouene J Riordan B amp Smith L B () The associative structure oflanguage contextual diversity in early word learning Journal of Memory and Language ndash

Hills T T Maouene M Maouene J Sheya A amp Smith L () Longitudinal analysisof early semantic networks Preferential attachment or preferential acquisitionPsychological Science ndash

Huttenlocher J Haight W Bryk A Seltzer M amp Lyons T () Early vocabularygrowth relation to language input and gender Developmental Psychology ndash

Joslashrgensen R N Dale P S Bleses D amp Fenson L () CLEX a cross-linguistic lexicalnorms database Journal of Child Language ndash

Kristoffersen K E Simonsen H G Bleses D Wehberg S Joslashrgensen R N EieslandE A amp Henriksen L Y () The use of the Internet in collecting CDI data ndash anexample from Norway Journal of Child Language ndash

Lieven E Salomo D amp Tomasello M () Two-year-old childrenrsquos production ofmultiword utterances a usage-based analysis Cognitive Linguistics ndash

MacWhinney B () The CHILDES Project tools for analyzing talk rd ed MahwahNJ Lawrence Erlbaum Associates

Marchman V A amp Martiacutenez-Sussmann C () Concurrent validity of caregiverparentreport measures of language for children who are learning both English and SpanishJournal of Speech Language and Hearing Research ndash

Mayor J amp Plunkett K () A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis Developmental Science ndash

Muggeo V M Sciandra M Tomasello A amp Calvo S () Estimating growth chartsvia nonparametric quantile regression a practical framework with application in ecologyEnvironmental and Ecological Statistics ndash

Nelson K () Structure and strategy in learning to talk Monographs of the Society forResearch in Child Development ndash

Norrman G amp Bylund E () The irreversibility of sensitive period effects in languagedevelopment evidence from second language acquisition in international adopteesDevelopmental Science ndash

R Foundation for Statistical Computing () R a language and environment for statisticalcomputing Software online lthttpwwwr-projectorggt

Rescorla L () The language development survey a screening tool for delayed languagein toddlers Journal of Speech and Hearing Disorders ndash

Roy B C Frank M C DeCamp P Miller M amp Roy D () Predicting the birth of aspoken word Proceedings of the National Academy of Sciences ndash

Song J Y Shattuck-Hufnagel S amp Demuth K () Development of phonetic variants(allophones) in -year-olds learning American English a study of alveolar stop t d codasJournal of Phonetics ndash

Tardif T Fletcher P Liang W Zhang Z Kaciroti N amp Marchman V A ()Babyrsquos first words Developmental Psychology ndash

Thal D Jackson-Maldonado D amp Acosta D () Validity of a parent-report measure ofvocabulary and grammar for Spanish-speaking toddlers Journal of Speech Language andHearing Research ndash

Tomasello M amp Mervis C B () The instrument is great but measuringcomprehension is still a problem Monographs of the Society for Research in ChildDevelopment ndash

Wallentin M () Putative sex differences in verbal abilities and language cortex a criticalreview Brain and Language ndash

Weisleder A amp Fernald A () Talking to children matters early language experiencestrengthens processing and builds vocabulary Psychological Science ndash

Wickham H () Ggplot elegant graphics for data analysis New York NY SpringerScience amp Business Media

Wickham H amp Francois R () Dplyr a grammar of data manipulation R packageversion middotmiddotmiddot Online lthttpscranr-projectorgwebpackagesdplyrgt

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Page 5: Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms Communicative Development Inventories

(Feldman et al Fenson et al ) Second parent reports ofcomprehension for younger children likely suffer from a number of biases andare probably substantially more accurate for content words than functionwords Third the items on the original CDI instruments were chosen to be arepresentative sample of vocabulary items for the appropriate age andlanguage (Fenson et al ) not with the intention that they would be acomplete set of words that could be compared across instruments or that theywould be individually reliable and license the conclusion that a particularchild knows a particular word Fourth although the length of the CDI maygive the impression that it yields an estimate of the childrsquos full vocabulary infact it likely understates the size of a childrsquos vocabulary substantiallyespecially for older children (Mayor amp Plunkett )

Despite these limitations when used appropriately the CDI instruments arean important tool The instruments were designed to minimize bias bytargeting current behaviors and asking parents about highly salient featuresof their childrsquos abilities They yield reliable and valid estimates of totalvocabulary size with dozens of studies demonstrating concurrent andpredictive relations with naturalistic and observational measures in bothtypically developing and at-risk populations (eg Dale amp Fenson Thal Jackson-Maldonado amp Acosta Marchman amp Martiacutenez-Sussmann ) In addition a variety of recent work has shown thatindividual item-level responses can yield exciting new insights for exampleabout the growth patterns of semantic networks (Hills et al HillsMaouene Riordan amp Smith ) Such analyses have the potential to beeven more powerful when applied to larger samples and across languages

WORDBANK

To take advantage of the opportunity posed by the broad use of CDIinstruments in the child language community we have constructedWordbank an open repository for CDI data that allows for interactiveanalysis and visualization The main page of the site at time of writing isshown in Figure In this section we begin by describing technical detailsof the sitersquos database architecture We then describe the two primaryanalysis tools that form the heart of the sitersquos interactive functionality Wegive a worked example of how to use these and then end by discussing theextensibility of the Wordbank framework highlighting opportunities forcontributing data and for building new analyses

Our inspiration for Wordbank comes from two successful projects forsharing data on childrenrsquos language acquisition The first is the ChildLanguage Data Exchange System (CHILDES MacWhinney ) Adatabase of transcripts of childrenrsquos speech and speech to childrenCHILDES has grown into a robust and important tool for the

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

community with many contributors and affiliated projects The second isthe Cross Linguistic Lexical Norms site (CLEX ltwwwcdi-clexorggtJoslashrgensen et al ) which is closer in content to Wordbank andeffectively our precursor CLEX archives normative data from a range ofCDI adaptations across languages allowing browsing of acquisitiontrajectories for individual items or age groups

Wordbank builds on CLEX offering the same functionality but allowingflexible and interactive visualization and analysis as well as direct databaseaccess and data download In addition Wordbankrsquos goal is to extendbeyond the norming data provided by the developers of individual CDIsby dynamically incorporating data from many different researchers andprojects of varying sizes and scopes While the resulting datasets inWordbank are likely more heterogeneous they nevertheless have thepotential to be considerably larger and more representative than theindividual norming datasets Wordbank provides tools that enable morepowerful flexible and nuanced analyses of general trends and comparisonsacross sub-populations in a variety of different languages

While the general Wordbank architecture enables a huge variety ofanalyses in principle some illustrative examples are helpful for

Fig Screenshot of the Wordbank main page Visitors can navigate from this page to theinteractive reports as well as to a statistics page that shows the database composition acontributors page that shows citation information and a blog that highlights recentupdates

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

understanding the site Consider an experimenter constructing a new set ofstimuli for a word recognition experiment the appropriate tool for thistask would be the Item Trajectories analysis which shows the trajectory ofacquisition for individual words The experimenter could explore differentcombinations of items using this tool and match them for age ofacquisition Or consider a researcher interested in gender effects onvocabulary growth the appropriate tool would be the Vocabulary Normsanalysis which shows percentile curves for a particular instrument (Wewalk through detailed instructions for how such an analysis would beconducted below)

Database architecture

Why use a database to store vocabulary data Consider the standard format ofraw CDI data Figure shows a small slice of the original CDI norming data(Fenson et al ) Each row is a child each column gives a variable ndasheither a demographic variable or the result of a particular word beingadministered to a particular child Although this format is useful forhomogeneous administrations of a single instrument it cannot accommodatemultiple instruments multiple languages or datasets with different sourcesor kinds of demographic information Consolidating data across differentinstruments is very difficult in this format and tracking data on childrenwith multiple longitudinal administrations of a single instrument must alsobe done in an ad-hoc manner The move to a database format allows farmore flexible and programmatic handling of heterogeneous data structuresfrom different sources

A relational database such as Wordbank is at its heart a series of tables linkedby unique identifiers There are two primary groups of tables in WordbankThe COMMON tables store data that is shared between CDI instrumentsincluding information about children administrations (individual instancesof a form being filled out for a child) and items (words and other questionson a form) The INSTRUMENT tables store response data for particular CDIinstruments We currently include all items on CDI instruments includingquestions about communication gesture morphology and grammar (thoughin many of the datasets that we archive these non-vocabulary questions havenot been digitized so data on them are sparse at present)

One strength of the Wordbank framework is that it allows the storage ofsubsidiary information about the words that are included in a particularinstrument so that this information can be used in future analyses Forexample information about grammatical and semantic categories or normslike concreteness and imageability could all be appended to particularwords This functionality is not yet present in Wordbank however Thedifficulty of compiling this kind of information for a particular set of

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

words is compounded by the large number of languages that the databaseincludes We hope that in future this functionality will allow the gradualaccumulation of information about the words included in the database

Technical details Wordbank is constructed using free open-source toolsThe database is a standard MySQL database managed using Python andDjango Analysis apps are constructed using the Shiny package for R anopen-source statistical programming language The code is hosted in aGitHub repository (lthttpgithubcomlangcogwordbankgt) where interestedusers can browse leave comments and contribute modifications

All data uploaded to Wordbank are open and freely available for downloadboth through the site itself and through the GitHub repository The siteincludes only de-identified data that cannot be linked to the parents andchildren who provided it Because of these features the StanfordInstitutional Review Board has determined that the Wordbank project doesnot constitute human subjects research

Cross-linguistic and cross-instrument architecture The general philosophy ofcreating CDIs for new languages has been summarized as ldquoadaptation nottranslationrdquo (Dale nd) In other words CDIs are a useful tool for manylanguages but the forms differ between languages ndash words and even wholesections are added dropped and modified to ensure that the form capturesthe details of the particular language for which it is designed To date morethan sixty adaptations of the original English CDI have been documented(Dale and Penfold nd) These forms vary widely including differences inlength and intended age range Some forms include hundreds of items morethan the original words on the English Words amp Sentences form othersare so-called lsquoshort formsrsquo and include only a hundred or a few hundredcarefully selected words Some are designed to capture development fromthe emergence of language through ages three to four years while others arefocused on very early development (like the English Words amp Gesturesform designed for ages ndash months) All of these differences make itproblematic to compare scores and score distributions across forms evenusing percentile ranks since some instruments will have more or moredifficult items than others

Wordbank is designed so that it can accommodate data from a wide varietyof instruments both within and across languages Indeed at the time of

Fig Example data from the CDI norming sample (Fenson et al ) Each row has aunique child identifier demographics and word-by-word checklist data

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

writing the site includes data from more than administrations of theCDI across fourteen different languages and twenty-four differentinstruments But because of the difficulties in comparison acrossinstruments our approach to cross-linguistic and cross-instrument data is toprovide standardized analyses within each instrument and language withoutassuming equivalence across words instruments or populations Thus ourprimary exploratory visualization tools in general do not allow comparisonacross languages and we urge users to interpret cross-linguistic and cross-instrument differences with caution Developing statistical techniques tofacilitate these comparisons is a current focus of our research

Interactive analysis tools

The primarymethod for users to interact withWordbank is through interactiveanalysis tools that are hosted on the website These tools allow for fast andflexible exploration of the dataset the results of which can be exported intabular and graphical formats for further analysis and presentation

Vocabulary Norms One of the primary purposes of the CDI instruments isto provide percentile ranks for vocabulary growth across ages both forvisualizing the variability of early vocabulary growth and for examiningdifferences in these growth patterns due to individual differences anddemographic variables Accordingly Wordbank provides a VocabularyNorms analysis pictured in Figure The inset plot shows alladministrations of a particular CDI instrument within the instrumentrsquos validage range Dots show individual children with age binned by month andjittered to avoid overplotting Lines on the plot indicate estimates ofpercentiles fit using quantile regression with monotonic polynomial splinesas the base function (using the gcrq function of the quantregGrowthpackage Muggeo Sciandra Tomasello amp Calvo ) An important featureof the norms app is that it can be split by any demographic field in the dataso that comparisons on variables like gender birth order or maternaleducation can be conducted

The original and updated norming studies (Fenson et al )gathered data from a diverse (though not nationally representative) sampleand used these data to construct normative curves from which percentileranks could be derived In contrast to these studies Wordbank is notexplicitly designed to provide stable clinically relevant normsWordbankrsquos sample is heterogeneous and continually growing and itsanalyses are subject to revision and update Thus Wordbank does notcurrently generate percentile ranks and we do not recommend that

The only exception to this policy currently is that we allow users to see responses acrossinstruments for individual words in the Item Trajectories analysis (eg the proportionof children who say the word cat on both Words amp Gestures and Words amp Sentences forms)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Wordbank-generated norming values be used for research or clinicalpurposes in which the goal is to evaluate childrenrsquos performance inreference to an established normative standard For these types ofapplications users should refer to the published norms in the appropriatelanguage

Item Trajectories A second function of the CDI instruments is to provideaggregate data on the proportion of children at a particular age who know aspecific word (Dale amp Fenson Joslashrgensen et al ) Such analysescan be extremely helpful for the design and evaluation of materials foryoung children including experimental stimuli Accordingly the secondmajor interactive visualization in Wordbank is the Item Trajectoriesanalysis tool

Fig A screenshot of the Vocabulary Norms analysis tool showing th th thth and th percentiles (default) for English production scores Dots show individualadministrations jittered slightly to avoid overplotting Curves show polynomial spline fits(See text for more details color online)

Users can always generate percentile ranks themselves and this may be desirable ornecessary for research purposes but we caution against the clinical use of such ad-hocnorms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

This tool allows exploration of growth curves for individual words on aCDI form Users can select a language and instrument (and chooseproduction or comprehension where available) and then select or input alist of words whose trajectories are plotted (Figure ) The lsquobothrsquo measureoption shows data from multiple forms for the same language withdifferent markers for each item In general our exploration suggests thatthere are only small differences across different instruments for the sameitem and age Lines on the plot show a local polynomial regressionsmoothing line (loess in R)

Other features static reports and tabular data download In addition to theinteractive analysis tools described above Wordbank also includes a numberof non-interactive but continuously updated reports on features likevocabulary composition across languages links between grammar and thelexicon (Braginsky Yurovsky Marchman amp Frank ) and genderdifferences in vocabulary growth (see below) On the Analyses page(lthttpwordbankstanfordeduanalysesgt) we provide a gallery of bothinteractive and non-interactive analyses

Wordbank also allows raw tabular data to be browsed and downloaded forsubsequent analysis in all popular statistical packages Using the same basicinterface as the Vocabulary Norms and Item Trajectory tools users canbrowse raw data aggregated across children (similar to the VocabularyNorms tool) across items (similar to the Items Trajectory tool) or evenview the raw subject-by-item data All data in these lsquostandardrsquo reports canbe downloaded in CSV format

A worked example gender differences Imagine a student interested ingender differences in production vocabulary size perhaps for a classproject Gender differences in language production are commonly found inindividual studies (eg Fenson et al Huttenlocher Haight BrykSeltzer amp Lyons see Wallentin for review) and onelarge-scale previous study found differences in production vocabulary inten languages (Eriksson et al )

To explore these differences using Wordbank the student would navigatefrom the home page to the Vocabulary Norms report English is the defaultlanguage for the report but the student could in principle select anylanguage in the database Similarly she could select her desired instrumentin the lsquoFormsrsquo menu (Words amp Sentences is the default) She would thenselect lsquoGenderrsquo as a split variable for the data (in the lsquoSplit Variablersquo menu)to see normative curves and sample sizes for each part of the dataset Or to

The distinction between sex (biological characteristic) and gender (social characteristic) iscomplex and not well understood in early childhood We defer discussion of this issuesince the CDI is a parent-report form we do not have access to either sex or genderinformation directly

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

make a plot that enabled comparison of the median level of productionvocabulary she could select lsquoMedianrsquo in the lsquoQuantilesrsquo menu

Selecting lsquoDownload Plotrsquo would result in the plot shown in Figure Orshe could navigate to the lsquoTablersquo tab of the display window to see tabularform data showing the th percentile (median) for both females andmales by age These tabular summary data are available for download viathe lsquoDownload Tablersquo button and the raw data (with a row for each oneof the children represented in the plot) are available via thelsquoDownload Raw Datarsquo button In sum this graphical workflow allowsinterested users to manipulate and download individual parts of thedataset as well as to create visualizations of basic analyses

Extensibility

Extensibility is one of the major strengths of Wordbank Althoughprogramming knowledge is not necessary for interacting with Wordbankinterested researchers with programming skills can contribute to the

Fig A screenshot of the Item Trajectories analysis tool showing a visualization of thedevelopmental trajectory of production for three words (dog choo choo and table) acrossboth Words amp Gestures and Words amp Sentences forms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

development effort by adding new analyses Each Wordbank analysis app isconstructed as a standalone script or set of scripts in the R languageConstructing an interactive analysis requires specifying a visualization andsome interactive functionality using Shiny Non-interactive analyses can beconstructed as R Markdown documents that execute scripts using theWordbank database Both of these have the virtue of rerunning on thenewest version of the database whenever they are opened so they do notgo out of date as new data are added

In addition we encourage contributions of individual datasets Wordbankcurrently imports data from Excel and CSV formats via automated importscripts Individuals or labs interested in contributing should consult withthe authors for advice about data formatting and upload

WORDBANKR AN R PACKAGE FOR ACCESSING WORDBANK

Although the analysis tools described above suffice for many needsresearchers interested in detailed quantitative or cross-linguistic analysesmay wish to connect directly to the Wordbank database and manipulatethe data directly Making use of the R programming language (RFoundation for Statistical Computing ) we provide the wordbankrpackage to help researchers accomplish this task R is an open-source

Fig A downloaded plot of gender differences in production language forEnglish-speaking children (color online)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

extensible statistical computing environment that is rapidly growing inpopularity across fields and is increasing in use in child language research(eg Norrman amp Bylund Song Shattuck-Hufnagel amp Demuth) The wordbankr package abstracts away the details of connectingto the database Users can take advantage of the SQL tools developed inthe popular dplyr package (Wickham amp Francois ) which makemanipulating large datasets quick and easy We describe the commandsthat the package provides and then give a worked example of using thepackage for a simple analysis

Package details

The wordbankr package is easily installed via CRAN the comprehensive Rarchive network To install simply type installpackages(ldquowordbankrrdquo) After installation users can use the three main data loadingfunctions provided by wordbankrget_administration_data toretrieve information about each CDI administration including the childrsquosdemographics and vocabulary sizes get_item_data to retrieve informationabout each CDI item including its text and categories andget_instrument_data to retrieve administration-by-item response valuesEach of these can be run in remote mode which loads data from theWordbank server or in local mode if the user has a copy of the database setup on their local machine For more detailed documentation see the packagerepository (lthttpgithubcomlangcogwordbankrgt)

Worked example part gender differences across languages

We next demonstrate the analytic potential of direct manipulation of theWordbank database using wordbankr by using the package to extend theworked example of gender differences above This section also replicates alarge-scale analysis by Eriksson et al () To perform the analysis wefirst begin by using wordbankr to load the data from Wordbank andconnect to the tables we need

admins lt- get_administration_data()items lt- get_item_data()

We next use a series of dplyr calls to compute the number of words in eachlanguage select the appropriate subset of the data and calculate theproportion of words produced for this data subset

num_words lt- items gtfilter(form == WS type == word) gtgroup_by(language) gtsummarise(n = n())

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Fig Median production vocabulary as a proportion of total words on an instrument plotted by age in months Red and blue lines showfemales and males respectively (color online)

WORD

BAN

K

of use available at httpsww

wcam

bridgeorgcoreterms httpsdoiorg101017S0305000916000209

Dow

nloaded from httpsw

ww

cambridgeorgcore U

niversity of Chicago on 17 May 2017 at 144547 subject to the Cam

bridge Core terms

vocab_admins lt- admins gtfilter(form == WS isna(sex)) gtselect(data_id language form age sex production)

vocab_data lt- vocab_admins gtgroup_by(language sex age) gtleft_join(num_words) gtmutate(production = production n) gtsummarise(median = median(production))

We then plot the vocab_data data frame using the ggplot2 package(Wickham ) Full code for the analysis as a whole (including the plot)is available at lthttpmikabrgithubiodemo-vocabgenderhtmlgt

The results of this analysis are shown in Figure As expected wereplicate the gender differences found in previous work (Eriksson et al) females showed a small but highly reliable advantage in earlyproduction This effect is highly consistent and clearly visible in elevenout of twelve languages with Italian being the only exception Forcomparison the previous work found a positive female effect for all tenout of ten languages but the size of the effect was close to zero for two ofthese Observational data such as those contained in Wordbank allow usonly to speculate about the origins of this gender difference or the sourcesof cross-linguistic variation (for some discussion see Eriksson et al) But the Wordbank platform dramatically facilitates the formulationand testing of analyses of this sort allowing hypotheses to be testedquickly and easily against large datasets

CONCLUSION

In this paper we have presented Wordbank an open repository forparent-report vocabulary data from the MacArthur-Bates CDI Theinteractive analysis tools available on the Wordbank site allow interestedresearchers to explore a wide variety of phenomena in vocabularydevelopment quickly and easily exporting data and downloadingpresentation-quality graphics that document their analysis In additionusers can contribute new analyses and data to the site and connect to itdirectly using an R package for data loading These functions all facilitategreater sharing and reuse of existing data on childrenrsquos vocabularyenabling new discoveries in the future

REFERENCES

Bates E () Language and context the acquisition of pragmatics (Vol ) New York NYAcademic Press

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Bates E amp Goodman J () On the emergence of grammar from the lexicon InB MacWhinney (ed) The emergence of language (pp ndash) Mahwah NJ LawrenceErlbaum Associates

Bates E Marchman V Thal D Fenson L Dale P Reznick J S Hartung J() Developmental and stylistic variation in the composition of early vocabularyJournal of Child Language ndash

Bloom P () How children learn the meanings of words Cambridge MA MIT PressBornstein M H amp Haynes O M () Vocabulary competence in early childhoodmeasurement latent construct and predictive validity Child Development ndash

Braginsky M Yurovsky D Marchman V A amp Frank M C () Developmentalchanges in the relationship between grammar and the lexicon In D C Noelle R DaleA S Warlaumont J Yoshimi T Matlock C D Jennings amp P P Maglio (Eds)Proceedings of the th Annual Meeting of the Cognitive Science Society Austin TXCognitive Science Society

Brown R () A first language the early stages Cambridge MA Harvard University PressCartmill E A Armstrong B F Gleitman L R Goldin-Meadow S Medina T N ampTrueswell J C () Quality of early parent input predicts child vocabulary yearslater Proceedings of the National Academy of Sciences ndash

Clark E () First language acquisition Cambridge Cambridge University PressDale P S (nd) Adaptations not translations Online lthttpmb-cdistanfordeduadaptationshtmlgt (last accessed )

Dale P S amp Fenson L () Lexical development norms for young children BehaviorResearch Methods Instruments amp Computers ndash

Dale P S amp PenfoldM (nd) Adaptations of theMacArthur-Bates CDI into non-US Englishlanguages Online lthttpmb-cdistanfordedudocumentsAdaptationsSurvey--Webpdfgt (last accessed )

Dickinson D K amp Tabors P O () Beginning literacy with language young childrenlearning at home and school Baltimore MD Paul H Brookes Publishing

Dunn L M amp Dunn L M () Peabody Picture Vocabulary Test th ed ParsippanyNJ AGS Publishing Pearson Assessments

Eriksson M Marschik P B Tulviste T Almgren M Peacuterez Pereira M Wehberg S Gallego C () Differences between girls and boys in emerging language skills evidencefrom language communities British Journal of Developmental Psychology ndash

Feldman H M Dale P S Campbell T F Colborn D K Kurs-Lasky M RocketteH E amp Paradise J L () Concurrent and predictive validity of parent reports of childlanguage at ages and years Child Development ndash

Feldman H M Dollaghan C A Campbell T F Kurs-Lasky M Janosky J E ampParadise J L () Measurement properties of the MacArthur CommunicativeDevelopment Inventories at ages one and two years Child Development ndash

Fenson L Bates E Dale P Goodman J Reznick J S amp Thal D () Replymeasuring variability in early child language donrsquot shoot the messenger ChildDevelopment ndash

Fenson L Dale P S Reznick J S Bates E Hartung J P Pethick S amp Reilly J() MacArthur Communicative Development Inventories userrsquos guide and technicalmanual Baltimore MD Paul H Brookes Publishing Co

Fenson L Dale P Reznick J Bates E Thal D Pethick S Stiles J ()Variability in early communicative development Monographs of the Society for Researchin Child Development

Fenson L Marchman V A Thal D Dale P Reznick J S amp Bates E ()MacArthur-Bates Communicative Development Inventories userrsquos guide and technicalmanual nd ed Baltimore MD Brookes Publishing Company

Hidaka S () Estimating the latent number of types in growing corpora with reducedcostndashaccuracy trade-off Journal of Child Language ndash

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Hills T T Maouene J Riordan B amp Smith L B () The associative structure oflanguage contextual diversity in early word learning Journal of Memory and Language ndash

Hills T T Maouene M Maouene J Sheya A amp Smith L () Longitudinal analysisof early semantic networks Preferential attachment or preferential acquisitionPsychological Science ndash

Huttenlocher J Haight W Bryk A Seltzer M amp Lyons T () Early vocabularygrowth relation to language input and gender Developmental Psychology ndash

Joslashrgensen R N Dale P S Bleses D amp Fenson L () CLEX a cross-linguistic lexicalnorms database Journal of Child Language ndash

Kristoffersen K E Simonsen H G Bleses D Wehberg S Joslashrgensen R N EieslandE A amp Henriksen L Y () The use of the Internet in collecting CDI data ndash anexample from Norway Journal of Child Language ndash

Lieven E Salomo D amp Tomasello M () Two-year-old childrenrsquos production ofmultiword utterances a usage-based analysis Cognitive Linguistics ndash

MacWhinney B () The CHILDES Project tools for analyzing talk rd ed MahwahNJ Lawrence Erlbaum Associates

Marchman V A amp Martiacutenez-Sussmann C () Concurrent validity of caregiverparentreport measures of language for children who are learning both English and SpanishJournal of Speech Language and Hearing Research ndash

Mayor J amp Plunkett K () A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis Developmental Science ndash

Muggeo V M Sciandra M Tomasello A amp Calvo S () Estimating growth chartsvia nonparametric quantile regression a practical framework with application in ecologyEnvironmental and Ecological Statistics ndash

Nelson K () Structure and strategy in learning to talk Monographs of the Society forResearch in Child Development ndash

Norrman G amp Bylund E () The irreversibility of sensitive period effects in languagedevelopment evidence from second language acquisition in international adopteesDevelopmental Science ndash

R Foundation for Statistical Computing () R a language and environment for statisticalcomputing Software online lthttpwwwr-projectorggt

Rescorla L () The language development survey a screening tool for delayed languagein toddlers Journal of Speech and Hearing Disorders ndash

Roy B C Frank M C DeCamp P Miller M amp Roy D () Predicting the birth of aspoken word Proceedings of the National Academy of Sciences ndash

Song J Y Shattuck-Hufnagel S amp Demuth K () Development of phonetic variants(allophones) in -year-olds learning American English a study of alveolar stop t d codasJournal of Phonetics ndash

Tardif T Fletcher P Liang W Zhang Z Kaciroti N amp Marchman V A ()Babyrsquos first words Developmental Psychology ndash

Thal D Jackson-Maldonado D amp Acosta D () Validity of a parent-report measure ofvocabulary and grammar for Spanish-speaking toddlers Journal of Speech Language andHearing Research ndash

Tomasello M amp Mervis C B () The instrument is great but measuringcomprehension is still a problem Monographs of the Society for Research in ChildDevelopment ndash

Wallentin M () Putative sex differences in verbal abilities and language cortex a criticalreview Brain and Language ndash

Weisleder A amp Fernald A () Talking to children matters early language experiencestrengthens processing and builds vocabulary Psychological Science ndash

Wickham H () Ggplot elegant graphics for data analysis New York NY SpringerScience amp Business Media

Wickham H amp Francois R () Dplyr a grammar of data manipulation R packageversion middotmiddotmiddot Online lthttpscranr-projectorgwebpackagesdplyrgt

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Page 6: Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms Communicative Development Inventories

community with many contributors and affiliated projects The second isthe Cross Linguistic Lexical Norms site (CLEX ltwwwcdi-clexorggtJoslashrgensen et al ) which is closer in content to Wordbank andeffectively our precursor CLEX archives normative data from a range ofCDI adaptations across languages allowing browsing of acquisitiontrajectories for individual items or age groups

Wordbank builds on CLEX offering the same functionality but allowingflexible and interactive visualization and analysis as well as direct databaseaccess and data download In addition Wordbankrsquos goal is to extendbeyond the norming data provided by the developers of individual CDIsby dynamically incorporating data from many different researchers andprojects of varying sizes and scopes While the resulting datasets inWordbank are likely more heterogeneous they nevertheless have thepotential to be considerably larger and more representative than theindividual norming datasets Wordbank provides tools that enable morepowerful flexible and nuanced analyses of general trends and comparisonsacross sub-populations in a variety of different languages

While the general Wordbank architecture enables a huge variety ofanalyses in principle some illustrative examples are helpful for

Fig Screenshot of the Wordbank main page Visitors can navigate from this page to theinteractive reports as well as to a statistics page that shows the database composition acontributors page that shows citation information and a blog that highlights recentupdates

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

understanding the site Consider an experimenter constructing a new set ofstimuli for a word recognition experiment the appropriate tool for thistask would be the Item Trajectories analysis which shows the trajectory ofacquisition for individual words The experimenter could explore differentcombinations of items using this tool and match them for age ofacquisition Or consider a researcher interested in gender effects onvocabulary growth the appropriate tool would be the Vocabulary Normsanalysis which shows percentile curves for a particular instrument (Wewalk through detailed instructions for how such an analysis would beconducted below)

Database architecture

Why use a database to store vocabulary data Consider the standard format ofraw CDI data Figure shows a small slice of the original CDI norming data(Fenson et al ) Each row is a child each column gives a variable ndasheither a demographic variable or the result of a particular word beingadministered to a particular child Although this format is useful forhomogeneous administrations of a single instrument it cannot accommodatemultiple instruments multiple languages or datasets with different sourcesor kinds of demographic information Consolidating data across differentinstruments is very difficult in this format and tracking data on childrenwith multiple longitudinal administrations of a single instrument must alsobe done in an ad-hoc manner The move to a database format allows farmore flexible and programmatic handling of heterogeneous data structuresfrom different sources

A relational database such as Wordbank is at its heart a series of tables linkedby unique identifiers There are two primary groups of tables in WordbankThe COMMON tables store data that is shared between CDI instrumentsincluding information about children administrations (individual instancesof a form being filled out for a child) and items (words and other questionson a form) The INSTRUMENT tables store response data for particular CDIinstruments We currently include all items on CDI instruments includingquestions about communication gesture morphology and grammar (thoughin many of the datasets that we archive these non-vocabulary questions havenot been digitized so data on them are sparse at present)

One strength of the Wordbank framework is that it allows the storage ofsubsidiary information about the words that are included in a particularinstrument so that this information can be used in future analyses Forexample information about grammatical and semantic categories or normslike concreteness and imageability could all be appended to particularwords This functionality is not yet present in Wordbank however Thedifficulty of compiling this kind of information for a particular set of

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

words is compounded by the large number of languages that the databaseincludes We hope that in future this functionality will allow the gradualaccumulation of information about the words included in the database

Technical details Wordbank is constructed using free open-source toolsThe database is a standard MySQL database managed using Python andDjango Analysis apps are constructed using the Shiny package for R anopen-source statistical programming language The code is hosted in aGitHub repository (lthttpgithubcomlangcogwordbankgt) where interestedusers can browse leave comments and contribute modifications

All data uploaded to Wordbank are open and freely available for downloadboth through the site itself and through the GitHub repository The siteincludes only de-identified data that cannot be linked to the parents andchildren who provided it Because of these features the StanfordInstitutional Review Board has determined that the Wordbank project doesnot constitute human subjects research

Cross-linguistic and cross-instrument architecture The general philosophy ofcreating CDIs for new languages has been summarized as ldquoadaptation nottranslationrdquo (Dale nd) In other words CDIs are a useful tool for manylanguages but the forms differ between languages ndash words and even wholesections are added dropped and modified to ensure that the form capturesthe details of the particular language for which it is designed To date morethan sixty adaptations of the original English CDI have been documented(Dale and Penfold nd) These forms vary widely including differences inlength and intended age range Some forms include hundreds of items morethan the original words on the English Words amp Sentences form othersare so-called lsquoshort formsrsquo and include only a hundred or a few hundredcarefully selected words Some are designed to capture development fromthe emergence of language through ages three to four years while others arefocused on very early development (like the English Words amp Gesturesform designed for ages ndash months) All of these differences make itproblematic to compare scores and score distributions across forms evenusing percentile ranks since some instruments will have more or moredifficult items than others

Wordbank is designed so that it can accommodate data from a wide varietyof instruments both within and across languages Indeed at the time of

Fig Example data from the CDI norming sample (Fenson et al ) Each row has aunique child identifier demographics and word-by-word checklist data

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

writing the site includes data from more than administrations of theCDI across fourteen different languages and twenty-four differentinstruments But because of the difficulties in comparison acrossinstruments our approach to cross-linguistic and cross-instrument data is toprovide standardized analyses within each instrument and language withoutassuming equivalence across words instruments or populations Thus ourprimary exploratory visualization tools in general do not allow comparisonacross languages and we urge users to interpret cross-linguistic and cross-instrument differences with caution Developing statistical techniques tofacilitate these comparisons is a current focus of our research

Interactive analysis tools

The primarymethod for users to interact withWordbank is through interactiveanalysis tools that are hosted on the website These tools allow for fast andflexible exploration of the dataset the results of which can be exported intabular and graphical formats for further analysis and presentation

Vocabulary Norms One of the primary purposes of the CDI instruments isto provide percentile ranks for vocabulary growth across ages both forvisualizing the variability of early vocabulary growth and for examiningdifferences in these growth patterns due to individual differences anddemographic variables Accordingly Wordbank provides a VocabularyNorms analysis pictured in Figure The inset plot shows alladministrations of a particular CDI instrument within the instrumentrsquos validage range Dots show individual children with age binned by month andjittered to avoid overplotting Lines on the plot indicate estimates ofpercentiles fit using quantile regression with monotonic polynomial splinesas the base function (using the gcrq function of the quantregGrowthpackage Muggeo Sciandra Tomasello amp Calvo ) An important featureof the norms app is that it can be split by any demographic field in the dataso that comparisons on variables like gender birth order or maternaleducation can be conducted

The original and updated norming studies (Fenson et al )gathered data from a diverse (though not nationally representative) sampleand used these data to construct normative curves from which percentileranks could be derived In contrast to these studies Wordbank is notexplicitly designed to provide stable clinically relevant normsWordbankrsquos sample is heterogeneous and continually growing and itsanalyses are subject to revision and update Thus Wordbank does notcurrently generate percentile ranks and we do not recommend that

The only exception to this policy currently is that we allow users to see responses acrossinstruments for individual words in the Item Trajectories analysis (eg the proportionof children who say the word cat on both Words amp Gestures and Words amp Sentences forms)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Wordbank-generated norming values be used for research or clinicalpurposes in which the goal is to evaluate childrenrsquos performance inreference to an established normative standard For these types ofapplications users should refer to the published norms in the appropriatelanguage

Item Trajectories A second function of the CDI instruments is to provideaggregate data on the proportion of children at a particular age who know aspecific word (Dale amp Fenson Joslashrgensen et al ) Such analysescan be extremely helpful for the design and evaluation of materials foryoung children including experimental stimuli Accordingly the secondmajor interactive visualization in Wordbank is the Item Trajectoriesanalysis tool

Fig A screenshot of the Vocabulary Norms analysis tool showing th th thth and th percentiles (default) for English production scores Dots show individualadministrations jittered slightly to avoid overplotting Curves show polynomial spline fits(See text for more details color online)

Users can always generate percentile ranks themselves and this may be desirable ornecessary for research purposes but we caution against the clinical use of such ad-hocnorms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

This tool allows exploration of growth curves for individual words on aCDI form Users can select a language and instrument (and chooseproduction or comprehension where available) and then select or input alist of words whose trajectories are plotted (Figure ) The lsquobothrsquo measureoption shows data from multiple forms for the same language withdifferent markers for each item In general our exploration suggests thatthere are only small differences across different instruments for the sameitem and age Lines on the plot show a local polynomial regressionsmoothing line (loess in R)

Other features static reports and tabular data download In addition to theinteractive analysis tools described above Wordbank also includes a numberof non-interactive but continuously updated reports on features likevocabulary composition across languages links between grammar and thelexicon (Braginsky Yurovsky Marchman amp Frank ) and genderdifferences in vocabulary growth (see below) On the Analyses page(lthttpwordbankstanfordeduanalysesgt) we provide a gallery of bothinteractive and non-interactive analyses

Wordbank also allows raw tabular data to be browsed and downloaded forsubsequent analysis in all popular statistical packages Using the same basicinterface as the Vocabulary Norms and Item Trajectory tools users canbrowse raw data aggregated across children (similar to the VocabularyNorms tool) across items (similar to the Items Trajectory tool) or evenview the raw subject-by-item data All data in these lsquostandardrsquo reports canbe downloaded in CSV format

A worked example gender differences Imagine a student interested ingender differences in production vocabulary size perhaps for a classproject Gender differences in language production are commonly found inindividual studies (eg Fenson et al Huttenlocher Haight BrykSeltzer amp Lyons see Wallentin for review) and onelarge-scale previous study found differences in production vocabulary inten languages (Eriksson et al )

To explore these differences using Wordbank the student would navigatefrom the home page to the Vocabulary Norms report English is the defaultlanguage for the report but the student could in principle select anylanguage in the database Similarly she could select her desired instrumentin the lsquoFormsrsquo menu (Words amp Sentences is the default) She would thenselect lsquoGenderrsquo as a split variable for the data (in the lsquoSplit Variablersquo menu)to see normative curves and sample sizes for each part of the dataset Or to

The distinction between sex (biological characteristic) and gender (social characteristic) iscomplex and not well understood in early childhood We defer discussion of this issuesince the CDI is a parent-report form we do not have access to either sex or genderinformation directly

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

make a plot that enabled comparison of the median level of productionvocabulary she could select lsquoMedianrsquo in the lsquoQuantilesrsquo menu

Selecting lsquoDownload Plotrsquo would result in the plot shown in Figure Orshe could navigate to the lsquoTablersquo tab of the display window to see tabularform data showing the th percentile (median) for both females andmales by age These tabular summary data are available for download viathe lsquoDownload Tablersquo button and the raw data (with a row for each oneof the children represented in the plot) are available via thelsquoDownload Raw Datarsquo button In sum this graphical workflow allowsinterested users to manipulate and download individual parts of thedataset as well as to create visualizations of basic analyses

Extensibility

Extensibility is one of the major strengths of Wordbank Althoughprogramming knowledge is not necessary for interacting with Wordbankinterested researchers with programming skills can contribute to the

Fig A screenshot of the Item Trajectories analysis tool showing a visualization of thedevelopmental trajectory of production for three words (dog choo choo and table) acrossboth Words amp Gestures and Words amp Sentences forms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

development effort by adding new analyses Each Wordbank analysis app isconstructed as a standalone script or set of scripts in the R languageConstructing an interactive analysis requires specifying a visualization andsome interactive functionality using Shiny Non-interactive analyses can beconstructed as R Markdown documents that execute scripts using theWordbank database Both of these have the virtue of rerunning on thenewest version of the database whenever they are opened so they do notgo out of date as new data are added

In addition we encourage contributions of individual datasets Wordbankcurrently imports data from Excel and CSV formats via automated importscripts Individuals or labs interested in contributing should consult withthe authors for advice about data formatting and upload

WORDBANKR AN R PACKAGE FOR ACCESSING WORDBANK

Although the analysis tools described above suffice for many needsresearchers interested in detailed quantitative or cross-linguistic analysesmay wish to connect directly to the Wordbank database and manipulatethe data directly Making use of the R programming language (RFoundation for Statistical Computing ) we provide the wordbankrpackage to help researchers accomplish this task R is an open-source

Fig A downloaded plot of gender differences in production language forEnglish-speaking children (color online)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

extensible statistical computing environment that is rapidly growing inpopularity across fields and is increasing in use in child language research(eg Norrman amp Bylund Song Shattuck-Hufnagel amp Demuth) The wordbankr package abstracts away the details of connectingto the database Users can take advantage of the SQL tools developed inthe popular dplyr package (Wickham amp Francois ) which makemanipulating large datasets quick and easy We describe the commandsthat the package provides and then give a worked example of using thepackage for a simple analysis

Package details

The wordbankr package is easily installed via CRAN the comprehensive Rarchive network To install simply type installpackages(ldquowordbankrrdquo) After installation users can use the three main data loadingfunctions provided by wordbankrget_administration_data toretrieve information about each CDI administration including the childrsquosdemographics and vocabulary sizes get_item_data to retrieve informationabout each CDI item including its text and categories andget_instrument_data to retrieve administration-by-item response valuesEach of these can be run in remote mode which loads data from theWordbank server or in local mode if the user has a copy of the database setup on their local machine For more detailed documentation see the packagerepository (lthttpgithubcomlangcogwordbankrgt)

Worked example part gender differences across languages

We next demonstrate the analytic potential of direct manipulation of theWordbank database using wordbankr by using the package to extend theworked example of gender differences above This section also replicates alarge-scale analysis by Eriksson et al () To perform the analysis wefirst begin by using wordbankr to load the data from Wordbank andconnect to the tables we need

admins lt- get_administration_data()items lt- get_item_data()

We next use a series of dplyr calls to compute the number of words in eachlanguage select the appropriate subset of the data and calculate theproportion of words produced for this data subset

num_words lt- items gtfilter(form == WS type == word) gtgroup_by(language) gtsummarise(n = n())

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Fig Median production vocabulary as a proportion of total words on an instrument plotted by age in months Red and blue lines showfemales and males respectively (color online)

WORD

BAN

K

of use available at httpsww

wcam

bridgeorgcoreterms httpsdoiorg101017S0305000916000209

Dow

nloaded from httpsw

ww

cambridgeorgcore U

niversity of Chicago on 17 May 2017 at 144547 subject to the Cam

bridge Core terms

vocab_admins lt- admins gtfilter(form == WS isna(sex)) gtselect(data_id language form age sex production)

vocab_data lt- vocab_admins gtgroup_by(language sex age) gtleft_join(num_words) gtmutate(production = production n) gtsummarise(median = median(production))

We then plot the vocab_data data frame using the ggplot2 package(Wickham ) Full code for the analysis as a whole (including the plot)is available at lthttpmikabrgithubiodemo-vocabgenderhtmlgt

The results of this analysis are shown in Figure As expected wereplicate the gender differences found in previous work (Eriksson et al) females showed a small but highly reliable advantage in earlyproduction This effect is highly consistent and clearly visible in elevenout of twelve languages with Italian being the only exception Forcomparison the previous work found a positive female effect for all tenout of ten languages but the size of the effect was close to zero for two ofthese Observational data such as those contained in Wordbank allow usonly to speculate about the origins of this gender difference or the sourcesof cross-linguistic variation (for some discussion see Eriksson et al) But the Wordbank platform dramatically facilitates the formulationand testing of analyses of this sort allowing hypotheses to be testedquickly and easily against large datasets

CONCLUSION

In this paper we have presented Wordbank an open repository forparent-report vocabulary data from the MacArthur-Bates CDI Theinteractive analysis tools available on the Wordbank site allow interestedresearchers to explore a wide variety of phenomena in vocabularydevelopment quickly and easily exporting data and downloadingpresentation-quality graphics that document their analysis In additionusers can contribute new analyses and data to the site and connect to itdirectly using an R package for data loading These functions all facilitategreater sharing and reuse of existing data on childrenrsquos vocabularyenabling new discoveries in the future

REFERENCES

Bates E () Language and context the acquisition of pragmatics (Vol ) New York NYAcademic Press

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Bates E amp Goodman J () On the emergence of grammar from the lexicon InB MacWhinney (ed) The emergence of language (pp ndash) Mahwah NJ LawrenceErlbaum Associates

Bates E Marchman V Thal D Fenson L Dale P Reznick J S Hartung J() Developmental and stylistic variation in the composition of early vocabularyJournal of Child Language ndash

Bloom P () How children learn the meanings of words Cambridge MA MIT PressBornstein M H amp Haynes O M () Vocabulary competence in early childhoodmeasurement latent construct and predictive validity Child Development ndash

Braginsky M Yurovsky D Marchman V A amp Frank M C () Developmentalchanges in the relationship between grammar and the lexicon In D C Noelle R DaleA S Warlaumont J Yoshimi T Matlock C D Jennings amp P P Maglio (Eds)Proceedings of the th Annual Meeting of the Cognitive Science Society Austin TXCognitive Science Society

Brown R () A first language the early stages Cambridge MA Harvard University PressCartmill E A Armstrong B F Gleitman L R Goldin-Meadow S Medina T N ampTrueswell J C () Quality of early parent input predicts child vocabulary yearslater Proceedings of the National Academy of Sciences ndash

Clark E () First language acquisition Cambridge Cambridge University PressDale P S (nd) Adaptations not translations Online lthttpmb-cdistanfordeduadaptationshtmlgt (last accessed )

Dale P S amp Fenson L () Lexical development norms for young children BehaviorResearch Methods Instruments amp Computers ndash

Dale P S amp PenfoldM (nd) Adaptations of theMacArthur-Bates CDI into non-US Englishlanguages Online lthttpmb-cdistanfordedudocumentsAdaptationsSurvey--Webpdfgt (last accessed )

Dickinson D K amp Tabors P O () Beginning literacy with language young childrenlearning at home and school Baltimore MD Paul H Brookes Publishing

Dunn L M amp Dunn L M () Peabody Picture Vocabulary Test th ed ParsippanyNJ AGS Publishing Pearson Assessments

Eriksson M Marschik P B Tulviste T Almgren M Peacuterez Pereira M Wehberg S Gallego C () Differences between girls and boys in emerging language skills evidencefrom language communities British Journal of Developmental Psychology ndash

Feldman H M Dale P S Campbell T F Colborn D K Kurs-Lasky M RocketteH E amp Paradise J L () Concurrent and predictive validity of parent reports of childlanguage at ages and years Child Development ndash

Feldman H M Dollaghan C A Campbell T F Kurs-Lasky M Janosky J E ampParadise J L () Measurement properties of the MacArthur CommunicativeDevelopment Inventories at ages one and two years Child Development ndash

Fenson L Bates E Dale P Goodman J Reznick J S amp Thal D () Replymeasuring variability in early child language donrsquot shoot the messenger ChildDevelopment ndash

Fenson L Dale P S Reznick J S Bates E Hartung J P Pethick S amp Reilly J() MacArthur Communicative Development Inventories userrsquos guide and technicalmanual Baltimore MD Paul H Brookes Publishing Co

Fenson L Dale P Reznick J Bates E Thal D Pethick S Stiles J ()Variability in early communicative development Monographs of the Society for Researchin Child Development

Fenson L Marchman V A Thal D Dale P Reznick J S amp Bates E ()MacArthur-Bates Communicative Development Inventories userrsquos guide and technicalmanual nd ed Baltimore MD Brookes Publishing Company

Hidaka S () Estimating the latent number of types in growing corpora with reducedcostndashaccuracy trade-off Journal of Child Language ndash

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Hills T T Maouene J Riordan B amp Smith L B () The associative structure oflanguage contextual diversity in early word learning Journal of Memory and Language ndash

Hills T T Maouene M Maouene J Sheya A amp Smith L () Longitudinal analysisof early semantic networks Preferential attachment or preferential acquisitionPsychological Science ndash

Huttenlocher J Haight W Bryk A Seltzer M amp Lyons T () Early vocabularygrowth relation to language input and gender Developmental Psychology ndash

Joslashrgensen R N Dale P S Bleses D amp Fenson L () CLEX a cross-linguistic lexicalnorms database Journal of Child Language ndash

Kristoffersen K E Simonsen H G Bleses D Wehberg S Joslashrgensen R N EieslandE A amp Henriksen L Y () The use of the Internet in collecting CDI data ndash anexample from Norway Journal of Child Language ndash

Lieven E Salomo D amp Tomasello M () Two-year-old childrenrsquos production ofmultiword utterances a usage-based analysis Cognitive Linguistics ndash

MacWhinney B () The CHILDES Project tools for analyzing talk rd ed MahwahNJ Lawrence Erlbaum Associates

Marchman V A amp Martiacutenez-Sussmann C () Concurrent validity of caregiverparentreport measures of language for children who are learning both English and SpanishJournal of Speech Language and Hearing Research ndash

Mayor J amp Plunkett K () A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis Developmental Science ndash

Muggeo V M Sciandra M Tomasello A amp Calvo S () Estimating growth chartsvia nonparametric quantile regression a practical framework with application in ecologyEnvironmental and Ecological Statistics ndash

Nelson K () Structure and strategy in learning to talk Monographs of the Society forResearch in Child Development ndash

Norrman G amp Bylund E () The irreversibility of sensitive period effects in languagedevelopment evidence from second language acquisition in international adopteesDevelopmental Science ndash

R Foundation for Statistical Computing () R a language and environment for statisticalcomputing Software online lthttpwwwr-projectorggt

Rescorla L () The language development survey a screening tool for delayed languagein toddlers Journal of Speech and Hearing Disorders ndash

Roy B C Frank M C DeCamp P Miller M amp Roy D () Predicting the birth of aspoken word Proceedings of the National Academy of Sciences ndash

Song J Y Shattuck-Hufnagel S amp Demuth K () Development of phonetic variants(allophones) in -year-olds learning American English a study of alveolar stop t d codasJournal of Phonetics ndash

Tardif T Fletcher P Liang W Zhang Z Kaciroti N amp Marchman V A ()Babyrsquos first words Developmental Psychology ndash

Thal D Jackson-Maldonado D amp Acosta D () Validity of a parent-report measure ofvocabulary and grammar for Spanish-speaking toddlers Journal of Speech Language andHearing Research ndash

Tomasello M amp Mervis C B () The instrument is great but measuringcomprehension is still a problem Monographs of the Society for Research in ChildDevelopment ndash

Wallentin M () Putative sex differences in verbal abilities and language cortex a criticalreview Brain and Language ndash

Weisleder A amp Fernald A () Talking to children matters early language experiencestrengthens processing and builds vocabulary Psychological Science ndash

Wickham H () Ggplot elegant graphics for data analysis New York NY SpringerScience amp Business Media

Wickham H amp Francois R () Dplyr a grammar of data manipulation R packageversion middotmiddotmiddot Online lthttpscranr-projectorgwebpackagesdplyrgt

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Page 7: Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms Communicative Development Inventories

understanding the site Consider an experimenter constructing a new set ofstimuli for a word recognition experiment the appropriate tool for thistask would be the Item Trajectories analysis which shows the trajectory ofacquisition for individual words The experimenter could explore differentcombinations of items using this tool and match them for age ofacquisition Or consider a researcher interested in gender effects onvocabulary growth the appropriate tool would be the Vocabulary Normsanalysis which shows percentile curves for a particular instrument (Wewalk through detailed instructions for how such an analysis would beconducted below)

Database architecture

Why use a database to store vocabulary data Consider the standard format ofraw CDI data Figure shows a small slice of the original CDI norming data(Fenson et al ) Each row is a child each column gives a variable ndasheither a demographic variable or the result of a particular word beingadministered to a particular child Although this format is useful forhomogeneous administrations of a single instrument it cannot accommodatemultiple instruments multiple languages or datasets with different sourcesor kinds of demographic information Consolidating data across differentinstruments is very difficult in this format and tracking data on childrenwith multiple longitudinal administrations of a single instrument must alsobe done in an ad-hoc manner The move to a database format allows farmore flexible and programmatic handling of heterogeneous data structuresfrom different sources

A relational database such as Wordbank is at its heart a series of tables linkedby unique identifiers There are two primary groups of tables in WordbankThe COMMON tables store data that is shared between CDI instrumentsincluding information about children administrations (individual instancesof a form being filled out for a child) and items (words and other questionson a form) The INSTRUMENT tables store response data for particular CDIinstruments We currently include all items on CDI instruments includingquestions about communication gesture morphology and grammar (thoughin many of the datasets that we archive these non-vocabulary questions havenot been digitized so data on them are sparse at present)

One strength of the Wordbank framework is that it allows the storage ofsubsidiary information about the words that are included in a particularinstrument so that this information can be used in future analyses Forexample information about grammatical and semantic categories or normslike concreteness and imageability could all be appended to particularwords This functionality is not yet present in Wordbank however Thedifficulty of compiling this kind of information for a particular set of

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

words is compounded by the large number of languages that the databaseincludes We hope that in future this functionality will allow the gradualaccumulation of information about the words included in the database

Technical details Wordbank is constructed using free open-source toolsThe database is a standard MySQL database managed using Python andDjango Analysis apps are constructed using the Shiny package for R anopen-source statistical programming language The code is hosted in aGitHub repository (lthttpgithubcomlangcogwordbankgt) where interestedusers can browse leave comments and contribute modifications

All data uploaded to Wordbank are open and freely available for downloadboth through the site itself and through the GitHub repository The siteincludes only de-identified data that cannot be linked to the parents andchildren who provided it Because of these features the StanfordInstitutional Review Board has determined that the Wordbank project doesnot constitute human subjects research

Cross-linguistic and cross-instrument architecture The general philosophy ofcreating CDIs for new languages has been summarized as ldquoadaptation nottranslationrdquo (Dale nd) In other words CDIs are a useful tool for manylanguages but the forms differ between languages ndash words and even wholesections are added dropped and modified to ensure that the form capturesthe details of the particular language for which it is designed To date morethan sixty adaptations of the original English CDI have been documented(Dale and Penfold nd) These forms vary widely including differences inlength and intended age range Some forms include hundreds of items morethan the original words on the English Words amp Sentences form othersare so-called lsquoshort formsrsquo and include only a hundred or a few hundredcarefully selected words Some are designed to capture development fromthe emergence of language through ages three to four years while others arefocused on very early development (like the English Words amp Gesturesform designed for ages ndash months) All of these differences make itproblematic to compare scores and score distributions across forms evenusing percentile ranks since some instruments will have more or moredifficult items than others

Wordbank is designed so that it can accommodate data from a wide varietyof instruments both within and across languages Indeed at the time of

Fig Example data from the CDI norming sample (Fenson et al ) Each row has aunique child identifier demographics and word-by-word checklist data

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

writing the site includes data from more than administrations of theCDI across fourteen different languages and twenty-four differentinstruments But because of the difficulties in comparison acrossinstruments our approach to cross-linguistic and cross-instrument data is toprovide standardized analyses within each instrument and language withoutassuming equivalence across words instruments or populations Thus ourprimary exploratory visualization tools in general do not allow comparisonacross languages and we urge users to interpret cross-linguistic and cross-instrument differences with caution Developing statistical techniques tofacilitate these comparisons is a current focus of our research

Interactive analysis tools

The primarymethod for users to interact withWordbank is through interactiveanalysis tools that are hosted on the website These tools allow for fast andflexible exploration of the dataset the results of which can be exported intabular and graphical formats for further analysis and presentation

Vocabulary Norms One of the primary purposes of the CDI instruments isto provide percentile ranks for vocabulary growth across ages both forvisualizing the variability of early vocabulary growth and for examiningdifferences in these growth patterns due to individual differences anddemographic variables Accordingly Wordbank provides a VocabularyNorms analysis pictured in Figure The inset plot shows alladministrations of a particular CDI instrument within the instrumentrsquos validage range Dots show individual children with age binned by month andjittered to avoid overplotting Lines on the plot indicate estimates ofpercentiles fit using quantile regression with monotonic polynomial splinesas the base function (using the gcrq function of the quantregGrowthpackage Muggeo Sciandra Tomasello amp Calvo ) An important featureof the norms app is that it can be split by any demographic field in the dataso that comparisons on variables like gender birth order or maternaleducation can be conducted

The original and updated norming studies (Fenson et al )gathered data from a diverse (though not nationally representative) sampleand used these data to construct normative curves from which percentileranks could be derived In contrast to these studies Wordbank is notexplicitly designed to provide stable clinically relevant normsWordbankrsquos sample is heterogeneous and continually growing and itsanalyses are subject to revision and update Thus Wordbank does notcurrently generate percentile ranks and we do not recommend that

The only exception to this policy currently is that we allow users to see responses acrossinstruments for individual words in the Item Trajectories analysis (eg the proportionof children who say the word cat on both Words amp Gestures and Words amp Sentences forms)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Wordbank-generated norming values be used for research or clinicalpurposes in which the goal is to evaluate childrenrsquos performance inreference to an established normative standard For these types ofapplications users should refer to the published norms in the appropriatelanguage

Item Trajectories A second function of the CDI instruments is to provideaggregate data on the proportion of children at a particular age who know aspecific word (Dale amp Fenson Joslashrgensen et al ) Such analysescan be extremely helpful for the design and evaluation of materials foryoung children including experimental stimuli Accordingly the secondmajor interactive visualization in Wordbank is the Item Trajectoriesanalysis tool

Fig A screenshot of the Vocabulary Norms analysis tool showing th th thth and th percentiles (default) for English production scores Dots show individualadministrations jittered slightly to avoid overplotting Curves show polynomial spline fits(See text for more details color online)

Users can always generate percentile ranks themselves and this may be desirable ornecessary for research purposes but we caution against the clinical use of such ad-hocnorms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

This tool allows exploration of growth curves for individual words on aCDI form Users can select a language and instrument (and chooseproduction or comprehension where available) and then select or input alist of words whose trajectories are plotted (Figure ) The lsquobothrsquo measureoption shows data from multiple forms for the same language withdifferent markers for each item In general our exploration suggests thatthere are only small differences across different instruments for the sameitem and age Lines on the plot show a local polynomial regressionsmoothing line (loess in R)

Other features static reports and tabular data download In addition to theinteractive analysis tools described above Wordbank also includes a numberof non-interactive but continuously updated reports on features likevocabulary composition across languages links between grammar and thelexicon (Braginsky Yurovsky Marchman amp Frank ) and genderdifferences in vocabulary growth (see below) On the Analyses page(lthttpwordbankstanfordeduanalysesgt) we provide a gallery of bothinteractive and non-interactive analyses

Wordbank also allows raw tabular data to be browsed and downloaded forsubsequent analysis in all popular statistical packages Using the same basicinterface as the Vocabulary Norms and Item Trajectory tools users canbrowse raw data aggregated across children (similar to the VocabularyNorms tool) across items (similar to the Items Trajectory tool) or evenview the raw subject-by-item data All data in these lsquostandardrsquo reports canbe downloaded in CSV format

A worked example gender differences Imagine a student interested ingender differences in production vocabulary size perhaps for a classproject Gender differences in language production are commonly found inindividual studies (eg Fenson et al Huttenlocher Haight BrykSeltzer amp Lyons see Wallentin for review) and onelarge-scale previous study found differences in production vocabulary inten languages (Eriksson et al )

To explore these differences using Wordbank the student would navigatefrom the home page to the Vocabulary Norms report English is the defaultlanguage for the report but the student could in principle select anylanguage in the database Similarly she could select her desired instrumentin the lsquoFormsrsquo menu (Words amp Sentences is the default) She would thenselect lsquoGenderrsquo as a split variable for the data (in the lsquoSplit Variablersquo menu)to see normative curves and sample sizes for each part of the dataset Or to

The distinction between sex (biological characteristic) and gender (social characteristic) iscomplex and not well understood in early childhood We defer discussion of this issuesince the CDI is a parent-report form we do not have access to either sex or genderinformation directly

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

make a plot that enabled comparison of the median level of productionvocabulary she could select lsquoMedianrsquo in the lsquoQuantilesrsquo menu

Selecting lsquoDownload Plotrsquo would result in the plot shown in Figure Orshe could navigate to the lsquoTablersquo tab of the display window to see tabularform data showing the th percentile (median) for both females andmales by age These tabular summary data are available for download viathe lsquoDownload Tablersquo button and the raw data (with a row for each oneof the children represented in the plot) are available via thelsquoDownload Raw Datarsquo button In sum this graphical workflow allowsinterested users to manipulate and download individual parts of thedataset as well as to create visualizations of basic analyses

Extensibility

Extensibility is one of the major strengths of Wordbank Althoughprogramming knowledge is not necessary for interacting with Wordbankinterested researchers with programming skills can contribute to the

Fig A screenshot of the Item Trajectories analysis tool showing a visualization of thedevelopmental trajectory of production for three words (dog choo choo and table) acrossboth Words amp Gestures and Words amp Sentences forms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

development effort by adding new analyses Each Wordbank analysis app isconstructed as a standalone script or set of scripts in the R languageConstructing an interactive analysis requires specifying a visualization andsome interactive functionality using Shiny Non-interactive analyses can beconstructed as R Markdown documents that execute scripts using theWordbank database Both of these have the virtue of rerunning on thenewest version of the database whenever they are opened so they do notgo out of date as new data are added

In addition we encourage contributions of individual datasets Wordbankcurrently imports data from Excel and CSV formats via automated importscripts Individuals or labs interested in contributing should consult withthe authors for advice about data formatting and upload

WORDBANKR AN R PACKAGE FOR ACCESSING WORDBANK

Although the analysis tools described above suffice for many needsresearchers interested in detailed quantitative or cross-linguistic analysesmay wish to connect directly to the Wordbank database and manipulatethe data directly Making use of the R programming language (RFoundation for Statistical Computing ) we provide the wordbankrpackage to help researchers accomplish this task R is an open-source

Fig A downloaded plot of gender differences in production language forEnglish-speaking children (color online)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

extensible statistical computing environment that is rapidly growing inpopularity across fields and is increasing in use in child language research(eg Norrman amp Bylund Song Shattuck-Hufnagel amp Demuth) The wordbankr package abstracts away the details of connectingto the database Users can take advantage of the SQL tools developed inthe popular dplyr package (Wickham amp Francois ) which makemanipulating large datasets quick and easy We describe the commandsthat the package provides and then give a worked example of using thepackage for a simple analysis

Package details

The wordbankr package is easily installed via CRAN the comprehensive Rarchive network To install simply type installpackages(ldquowordbankrrdquo) After installation users can use the three main data loadingfunctions provided by wordbankrget_administration_data toretrieve information about each CDI administration including the childrsquosdemographics and vocabulary sizes get_item_data to retrieve informationabout each CDI item including its text and categories andget_instrument_data to retrieve administration-by-item response valuesEach of these can be run in remote mode which loads data from theWordbank server or in local mode if the user has a copy of the database setup on their local machine For more detailed documentation see the packagerepository (lthttpgithubcomlangcogwordbankrgt)

Worked example part gender differences across languages

We next demonstrate the analytic potential of direct manipulation of theWordbank database using wordbankr by using the package to extend theworked example of gender differences above This section also replicates alarge-scale analysis by Eriksson et al () To perform the analysis wefirst begin by using wordbankr to load the data from Wordbank andconnect to the tables we need

admins lt- get_administration_data()items lt- get_item_data()

We next use a series of dplyr calls to compute the number of words in eachlanguage select the appropriate subset of the data and calculate theproportion of words produced for this data subset

num_words lt- items gtfilter(form == WS type == word) gtgroup_by(language) gtsummarise(n = n())

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Fig Median production vocabulary as a proportion of total words on an instrument plotted by age in months Red and blue lines showfemales and males respectively (color online)

WORD

BAN

K

of use available at httpsww

wcam

bridgeorgcoreterms httpsdoiorg101017S0305000916000209

Dow

nloaded from httpsw

ww

cambridgeorgcore U

niversity of Chicago on 17 May 2017 at 144547 subject to the Cam

bridge Core terms

vocab_admins lt- admins gtfilter(form == WS isna(sex)) gtselect(data_id language form age sex production)

vocab_data lt- vocab_admins gtgroup_by(language sex age) gtleft_join(num_words) gtmutate(production = production n) gtsummarise(median = median(production))

We then plot the vocab_data data frame using the ggplot2 package(Wickham ) Full code for the analysis as a whole (including the plot)is available at lthttpmikabrgithubiodemo-vocabgenderhtmlgt

The results of this analysis are shown in Figure As expected wereplicate the gender differences found in previous work (Eriksson et al) females showed a small but highly reliable advantage in earlyproduction This effect is highly consistent and clearly visible in elevenout of twelve languages with Italian being the only exception Forcomparison the previous work found a positive female effect for all tenout of ten languages but the size of the effect was close to zero for two ofthese Observational data such as those contained in Wordbank allow usonly to speculate about the origins of this gender difference or the sourcesof cross-linguistic variation (for some discussion see Eriksson et al) But the Wordbank platform dramatically facilitates the formulationand testing of analyses of this sort allowing hypotheses to be testedquickly and easily against large datasets

CONCLUSION

In this paper we have presented Wordbank an open repository forparent-report vocabulary data from the MacArthur-Bates CDI Theinteractive analysis tools available on the Wordbank site allow interestedresearchers to explore a wide variety of phenomena in vocabularydevelopment quickly and easily exporting data and downloadingpresentation-quality graphics that document their analysis In additionusers can contribute new analyses and data to the site and connect to itdirectly using an R package for data loading These functions all facilitategreater sharing and reuse of existing data on childrenrsquos vocabularyenabling new discoveries in the future

REFERENCES

Bates E () Language and context the acquisition of pragmatics (Vol ) New York NYAcademic Press

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Bates E amp Goodman J () On the emergence of grammar from the lexicon InB MacWhinney (ed) The emergence of language (pp ndash) Mahwah NJ LawrenceErlbaum Associates

Bates E Marchman V Thal D Fenson L Dale P Reznick J S Hartung J() Developmental and stylistic variation in the composition of early vocabularyJournal of Child Language ndash

Bloom P () How children learn the meanings of words Cambridge MA MIT PressBornstein M H amp Haynes O M () Vocabulary competence in early childhoodmeasurement latent construct and predictive validity Child Development ndash

Braginsky M Yurovsky D Marchman V A amp Frank M C () Developmentalchanges in the relationship between grammar and the lexicon In D C Noelle R DaleA S Warlaumont J Yoshimi T Matlock C D Jennings amp P P Maglio (Eds)Proceedings of the th Annual Meeting of the Cognitive Science Society Austin TXCognitive Science Society

Brown R () A first language the early stages Cambridge MA Harvard University PressCartmill E A Armstrong B F Gleitman L R Goldin-Meadow S Medina T N ampTrueswell J C () Quality of early parent input predicts child vocabulary yearslater Proceedings of the National Academy of Sciences ndash

Clark E () First language acquisition Cambridge Cambridge University PressDale P S (nd) Adaptations not translations Online lthttpmb-cdistanfordeduadaptationshtmlgt (last accessed )

Dale P S amp Fenson L () Lexical development norms for young children BehaviorResearch Methods Instruments amp Computers ndash

Dale P S amp PenfoldM (nd) Adaptations of theMacArthur-Bates CDI into non-US Englishlanguages Online lthttpmb-cdistanfordedudocumentsAdaptationsSurvey--Webpdfgt (last accessed )

Dickinson D K amp Tabors P O () Beginning literacy with language young childrenlearning at home and school Baltimore MD Paul H Brookes Publishing

Dunn L M amp Dunn L M () Peabody Picture Vocabulary Test th ed ParsippanyNJ AGS Publishing Pearson Assessments

Eriksson M Marschik P B Tulviste T Almgren M Peacuterez Pereira M Wehberg S Gallego C () Differences between girls and boys in emerging language skills evidencefrom language communities British Journal of Developmental Psychology ndash

Feldman H M Dale P S Campbell T F Colborn D K Kurs-Lasky M RocketteH E amp Paradise J L () Concurrent and predictive validity of parent reports of childlanguage at ages and years Child Development ndash

Feldman H M Dollaghan C A Campbell T F Kurs-Lasky M Janosky J E ampParadise J L () Measurement properties of the MacArthur CommunicativeDevelopment Inventories at ages one and two years Child Development ndash

Fenson L Bates E Dale P Goodman J Reznick J S amp Thal D () Replymeasuring variability in early child language donrsquot shoot the messenger ChildDevelopment ndash

Fenson L Dale P S Reznick J S Bates E Hartung J P Pethick S amp Reilly J() MacArthur Communicative Development Inventories userrsquos guide and technicalmanual Baltimore MD Paul H Brookes Publishing Co

Fenson L Dale P Reznick J Bates E Thal D Pethick S Stiles J ()Variability in early communicative development Monographs of the Society for Researchin Child Development

Fenson L Marchman V A Thal D Dale P Reznick J S amp Bates E ()MacArthur-Bates Communicative Development Inventories userrsquos guide and technicalmanual nd ed Baltimore MD Brookes Publishing Company

Hidaka S () Estimating the latent number of types in growing corpora with reducedcostndashaccuracy trade-off Journal of Child Language ndash

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Hills T T Maouene J Riordan B amp Smith L B () The associative structure oflanguage contextual diversity in early word learning Journal of Memory and Language ndash

Hills T T Maouene M Maouene J Sheya A amp Smith L () Longitudinal analysisof early semantic networks Preferential attachment or preferential acquisitionPsychological Science ndash

Huttenlocher J Haight W Bryk A Seltzer M amp Lyons T () Early vocabularygrowth relation to language input and gender Developmental Psychology ndash

Joslashrgensen R N Dale P S Bleses D amp Fenson L () CLEX a cross-linguistic lexicalnorms database Journal of Child Language ndash

Kristoffersen K E Simonsen H G Bleses D Wehberg S Joslashrgensen R N EieslandE A amp Henriksen L Y () The use of the Internet in collecting CDI data ndash anexample from Norway Journal of Child Language ndash

Lieven E Salomo D amp Tomasello M () Two-year-old childrenrsquos production ofmultiword utterances a usage-based analysis Cognitive Linguistics ndash

MacWhinney B () The CHILDES Project tools for analyzing talk rd ed MahwahNJ Lawrence Erlbaum Associates

Marchman V A amp Martiacutenez-Sussmann C () Concurrent validity of caregiverparentreport measures of language for children who are learning both English and SpanishJournal of Speech Language and Hearing Research ndash

Mayor J amp Plunkett K () A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis Developmental Science ndash

Muggeo V M Sciandra M Tomasello A amp Calvo S () Estimating growth chartsvia nonparametric quantile regression a practical framework with application in ecologyEnvironmental and Ecological Statistics ndash

Nelson K () Structure and strategy in learning to talk Monographs of the Society forResearch in Child Development ndash

Norrman G amp Bylund E () The irreversibility of sensitive period effects in languagedevelopment evidence from second language acquisition in international adopteesDevelopmental Science ndash

R Foundation for Statistical Computing () R a language and environment for statisticalcomputing Software online lthttpwwwr-projectorggt

Rescorla L () The language development survey a screening tool for delayed languagein toddlers Journal of Speech and Hearing Disorders ndash

Roy B C Frank M C DeCamp P Miller M amp Roy D () Predicting the birth of aspoken word Proceedings of the National Academy of Sciences ndash

Song J Y Shattuck-Hufnagel S amp Demuth K () Development of phonetic variants(allophones) in -year-olds learning American English a study of alveolar stop t d codasJournal of Phonetics ndash

Tardif T Fletcher P Liang W Zhang Z Kaciroti N amp Marchman V A ()Babyrsquos first words Developmental Psychology ndash

Thal D Jackson-Maldonado D amp Acosta D () Validity of a parent-report measure ofvocabulary and grammar for Spanish-speaking toddlers Journal of Speech Language andHearing Research ndash

Tomasello M amp Mervis C B () The instrument is great but measuringcomprehension is still a problem Monographs of the Society for Research in ChildDevelopment ndash

Wallentin M () Putative sex differences in verbal abilities and language cortex a criticalreview Brain and Language ndash

Weisleder A amp Fernald A () Talking to children matters early language experiencestrengthens processing and builds vocabulary Psychological Science ndash

Wickham H () Ggplot elegant graphics for data analysis New York NY SpringerScience amp Business Media

Wickham H amp Francois R () Dplyr a grammar of data manipulation R packageversion middotmiddotmiddot Online lthttpscranr-projectorgwebpackagesdplyrgt

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Page 8: Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms Communicative Development Inventories

words is compounded by the large number of languages that the databaseincludes We hope that in future this functionality will allow the gradualaccumulation of information about the words included in the database

Technical details Wordbank is constructed using free open-source toolsThe database is a standard MySQL database managed using Python andDjango Analysis apps are constructed using the Shiny package for R anopen-source statistical programming language The code is hosted in aGitHub repository (lthttpgithubcomlangcogwordbankgt) where interestedusers can browse leave comments and contribute modifications

All data uploaded to Wordbank are open and freely available for downloadboth through the site itself and through the GitHub repository The siteincludes only de-identified data that cannot be linked to the parents andchildren who provided it Because of these features the StanfordInstitutional Review Board has determined that the Wordbank project doesnot constitute human subjects research

Cross-linguistic and cross-instrument architecture The general philosophy ofcreating CDIs for new languages has been summarized as ldquoadaptation nottranslationrdquo (Dale nd) In other words CDIs are a useful tool for manylanguages but the forms differ between languages ndash words and even wholesections are added dropped and modified to ensure that the form capturesthe details of the particular language for which it is designed To date morethan sixty adaptations of the original English CDI have been documented(Dale and Penfold nd) These forms vary widely including differences inlength and intended age range Some forms include hundreds of items morethan the original words on the English Words amp Sentences form othersare so-called lsquoshort formsrsquo and include only a hundred or a few hundredcarefully selected words Some are designed to capture development fromthe emergence of language through ages three to four years while others arefocused on very early development (like the English Words amp Gesturesform designed for ages ndash months) All of these differences make itproblematic to compare scores and score distributions across forms evenusing percentile ranks since some instruments will have more or moredifficult items than others

Wordbank is designed so that it can accommodate data from a wide varietyof instruments both within and across languages Indeed at the time of

Fig Example data from the CDI norming sample (Fenson et al ) Each row has aunique child identifier demographics and word-by-word checklist data

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

writing the site includes data from more than administrations of theCDI across fourteen different languages and twenty-four differentinstruments But because of the difficulties in comparison acrossinstruments our approach to cross-linguistic and cross-instrument data is toprovide standardized analyses within each instrument and language withoutassuming equivalence across words instruments or populations Thus ourprimary exploratory visualization tools in general do not allow comparisonacross languages and we urge users to interpret cross-linguistic and cross-instrument differences with caution Developing statistical techniques tofacilitate these comparisons is a current focus of our research

Interactive analysis tools

The primarymethod for users to interact withWordbank is through interactiveanalysis tools that are hosted on the website These tools allow for fast andflexible exploration of the dataset the results of which can be exported intabular and graphical formats for further analysis and presentation

Vocabulary Norms One of the primary purposes of the CDI instruments isto provide percentile ranks for vocabulary growth across ages both forvisualizing the variability of early vocabulary growth and for examiningdifferences in these growth patterns due to individual differences anddemographic variables Accordingly Wordbank provides a VocabularyNorms analysis pictured in Figure The inset plot shows alladministrations of a particular CDI instrument within the instrumentrsquos validage range Dots show individual children with age binned by month andjittered to avoid overplotting Lines on the plot indicate estimates ofpercentiles fit using quantile regression with monotonic polynomial splinesas the base function (using the gcrq function of the quantregGrowthpackage Muggeo Sciandra Tomasello amp Calvo ) An important featureof the norms app is that it can be split by any demographic field in the dataso that comparisons on variables like gender birth order or maternaleducation can be conducted

The original and updated norming studies (Fenson et al )gathered data from a diverse (though not nationally representative) sampleand used these data to construct normative curves from which percentileranks could be derived In contrast to these studies Wordbank is notexplicitly designed to provide stable clinically relevant normsWordbankrsquos sample is heterogeneous and continually growing and itsanalyses are subject to revision and update Thus Wordbank does notcurrently generate percentile ranks and we do not recommend that

The only exception to this policy currently is that we allow users to see responses acrossinstruments for individual words in the Item Trajectories analysis (eg the proportionof children who say the word cat on both Words amp Gestures and Words amp Sentences forms)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Wordbank-generated norming values be used for research or clinicalpurposes in which the goal is to evaluate childrenrsquos performance inreference to an established normative standard For these types ofapplications users should refer to the published norms in the appropriatelanguage

Item Trajectories A second function of the CDI instruments is to provideaggregate data on the proportion of children at a particular age who know aspecific word (Dale amp Fenson Joslashrgensen et al ) Such analysescan be extremely helpful for the design and evaluation of materials foryoung children including experimental stimuli Accordingly the secondmajor interactive visualization in Wordbank is the Item Trajectoriesanalysis tool

Fig A screenshot of the Vocabulary Norms analysis tool showing th th thth and th percentiles (default) for English production scores Dots show individualadministrations jittered slightly to avoid overplotting Curves show polynomial spline fits(See text for more details color online)

Users can always generate percentile ranks themselves and this may be desirable ornecessary for research purposes but we caution against the clinical use of such ad-hocnorms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

This tool allows exploration of growth curves for individual words on aCDI form Users can select a language and instrument (and chooseproduction or comprehension where available) and then select or input alist of words whose trajectories are plotted (Figure ) The lsquobothrsquo measureoption shows data from multiple forms for the same language withdifferent markers for each item In general our exploration suggests thatthere are only small differences across different instruments for the sameitem and age Lines on the plot show a local polynomial regressionsmoothing line (loess in R)

Other features static reports and tabular data download In addition to theinteractive analysis tools described above Wordbank also includes a numberof non-interactive but continuously updated reports on features likevocabulary composition across languages links between grammar and thelexicon (Braginsky Yurovsky Marchman amp Frank ) and genderdifferences in vocabulary growth (see below) On the Analyses page(lthttpwordbankstanfordeduanalysesgt) we provide a gallery of bothinteractive and non-interactive analyses

Wordbank also allows raw tabular data to be browsed and downloaded forsubsequent analysis in all popular statistical packages Using the same basicinterface as the Vocabulary Norms and Item Trajectory tools users canbrowse raw data aggregated across children (similar to the VocabularyNorms tool) across items (similar to the Items Trajectory tool) or evenview the raw subject-by-item data All data in these lsquostandardrsquo reports canbe downloaded in CSV format

A worked example gender differences Imagine a student interested ingender differences in production vocabulary size perhaps for a classproject Gender differences in language production are commonly found inindividual studies (eg Fenson et al Huttenlocher Haight BrykSeltzer amp Lyons see Wallentin for review) and onelarge-scale previous study found differences in production vocabulary inten languages (Eriksson et al )

To explore these differences using Wordbank the student would navigatefrom the home page to the Vocabulary Norms report English is the defaultlanguage for the report but the student could in principle select anylanguage in the database Similarly she could select her desired instrumentin the lsquoFormsrsquo menu (Words amp Sentences is the default) She would thenselect lsquoGenderrsquo as a split variable for the data (in the lsquoSplit Variablersquo menu)to see normative curves and sample sizes for each part of the dataset Or to

The distinction between sex (biological characteristic) and gender (social characteristic) iscomplex and not well understood in early childhood We defer discussion of this issuesince the CDI is a parent-report form we do not have access to either sex or genderinformation directly

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

make a plot that enabled comparison of the median level of productionvocabulary she could select lsquoMedianrsquo in the lsquoQuantilesrsquo menu

Selecting lsquoDownload Plotrsquo would result in the plot shown in Figure Orshe could navigate to the lsquoTablersquo tab of the display window to see tabularform data showing the th percentile (median) for both females andmales by age These tabular summary data are available for download viathe lsquoDownload Tablersquo button and the raw data (with a row for each oneof the children represented in the plot) are available via thelsquoDownload Raw Datarsquo button In sum this graphical workflow allowsinterested users to manipulate and download individual parts of thedataset as well as to create visualizations of basic analyses

Extensibility

Extensibility is one of the major strengths of Wordbank Althoughprogramming knowledge is not necessary for interacting with Wordbankinterested researchers with programming skills can contribute to the

Fig A screenshot of the Item Trajectories analysis tool showing a visualization of thedevelopmental trajectory of production for three words (dog choo choo and table) acrossboth Words amp Gestures and Words amp Sentences forms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

development effort by adding new analyses Each Wordbank analysis app isconstructed as a standalone script or set of scripts in the R languageConstructing an interactive analysis requires specifying a visualization andsome interactive functionality using Shiny Non-interactive analyses can beconstructed as R Markdown documents that execute scripts using theWordbank database Both of these have the virtue of rerunning on thenewest version of the database whenever they are opened so they do notgo out of date as new data are added

In addition we encourage contributions of individual datasets Wordbankcurrently imports data from Excel and CSV formats via automated importscripts Individuals or labs interested in contributing should consult withthe authors for advice about data formatting and upload

WORDBANKR AN R PACKAGE FOR ACCESSING WORDBANK

Although the analysis tools described above suffice for many needsresearchers interested in detailed quantitative or cross-linguistic analysesmay wish to connect directly to the Wordbank database and manipulatethe data directly Making use of the R programming language (RFoundation for Statistical Computing ) we provide the wordbankrpackage to help researchers accomplish this task R is an open-source

Fig A downloaded plot of gender differences in production language forEnglish-speaking children (color online)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

extensible statistical computing environment that is rapidly growing inpopularity across fields and is increasing in use in child language research(eg Norrman amp Bylund Song Shattuck-Hufnagel amp Demuth) The wordbankr package abstracts away the details of connectingto the database Users can take advantage of the SQL tools developed inthe popular dplyr package (Wickham amp Francois ) which makemanipulating large datasets quick and easy We describe the commandsthat the package provides and then give a worked example of using thepackage for a simple analysis

Package details

The wordbankr package is easily installed via CRAN the comprehensive Rarchive network To install simply type installpackages(ldquowordbankrrdquo) After installation users can use the three main data loadingfunctions provided by wordbankrget_administration_data toretrieve information about each CDI administration including the childrsquosdemographics and vocabulary sizes get_item_data to retrieve informationabout each CDI item including its text and categories andget_instrument_data to retrieve administration-by-item response valuesEach of these can be run in remote mode which loads data from theWordbank server or in local mode if the user has a copy of the database setup on their local machine For more detailed documentation see the packagerepository (lthttpgithubcomlangcogwordbankrgt)

Worked example part gender differences across languages

We next demonstrate the analytic potential of direct manipulation of theWordbank database using wordbankr by using the package to extend theworked example of gender differences above This section also replicates alarge-scale analysis by Eriksson et al () To perform the analysis wefirst begin by using wordbankr to load the data from Wordbank andconnect to the tables we need

admins lt- get_administration_data()items lt- get_item_data()

We next use a series of dplyr calls to compute the number of words in eachlanguage select the appropriate subset of the data and calculate theproportion of words produced for this data subset

num_words lt- items gtfilter(form == WS type == word) gtgroup_by(language) gtsummarise(n = n())

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Fig Median production vocabulary as a proportion of total words on an instrument plotted by age in months Red and blue lines showfemales and males respectively (color online)

WORD

BAN

K

of use available at httpsww

wcam

bridgeorgcoreterms httpsdoiorg101017S0305000916000209

Dow

nloaded from httpsw

ww

cambridgeorgcore U

niversity of Chicago on 17 May 2017 at 144547 subject to the Cam

bridge Core terms

vocab_admins lt- admins gtfilter(form == WS isna(sex)) gtselect(data_id language form age sex production)

vocab_data lt- vocab_admins gtgroup_by(language sex age) gtleft_join(num_words) gtmutate(production = production n) gtsummarise(median = median(production))

We then plot the vocab_data data frame using the ggplot2 package(Wickham ) Full code for the analysis as a whole (including the plot)is available at lthttpmikabrgithubiodemo-vocabgenderhtmlgt

The results of this analysis are shown in Figure As expected wereplicate the gender differences found in previous work (Eriksson et al) females showed a small but highly reliable advantage in earlyproduction This effect is highly consistent and clearly visible in elevenout of twelve languages with Italian being the only exception Forcomparison the previous work found a positive female effect for all tenout of ten languages but the size of the effect was close to zero for two ofthese Observational data such as those contained in Wordbank allow usonly to speculate about the origins of this gender difference or the sourcesof cross-linguistic variation (for some discussion see Eriksson et al) But the Wordbank platform dramatically facilitates the formulationand testing of analyses of this sort allowing hypotheses to be testedquickly and easily against large datasets

CONCLUSION

In this paper we have presented Wordbank an open repository forparent-report vocabulary data from the MacArthur-Bates CDI Theinteractive analysis tools available on the Wordbank site allow interestedresearchers to explore a wide variety of phenomena in vocabularydevelopment quickly and easily exporting data and downloadingpresentation-quality graphics that document their analysis In additionusers can contribute new analyses and data to the site and connect to itdirectly using an R package for data loading These functions all facilitategreater sharing and reuse of existing data on childrenrsquos vocabularyenabling new discoveries in the future

REFERENCES

Bates E () Language and context the acquisition of pragmatics (Vol ) New York NYAcademic Press

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Bates E amp Goodman J () On the emergence of grammar from the lexicon InB MacWhinney (ed) The emergence of language (pp ndash) Mahwah NJ LawrenceErlbaum Associates

Bates E Marchman V Thal D Fenson L Dale P Reznick J S Hartung J() Developmental and stylistic variation in the composition of early vocabularyJournal of Child Language ndash

Bloom P () How children learn the meanings of words Cambridge MA MIT PressBornstein M H amp Haynes O M () Vocabulary competence in early childhoodmeasurement latent construct and predictive validity Child Development ndash

Braginsky M Yurovsky D Marchman V A amp Frank M C () Developmentalchanges in the relationship between grammar and the lexicon In D C Noelle R DaleA S Warlaumont J Yoshimi T Matlock C D Jennings amp P P Maglio (Eds)Proceedings of the th Annual Meeting of the Cognitive Science Society Austin TXCognitive Science Society

Brown R () A first language the early stages Cambridge MA Harvard University PressCartmill E A Armstrong B F Gleitman L R Goldin-Meadow S Medina T N ampTrueswell J C () Quality of early parent input predicts child vocabulary yearslater Proceedings of the National Academy of Sciences ndash

Clark E () First language acquisition Cambridge Cambridge University PressDale P S (nd) Adaptations not translations Online lthttpmb-cdistanfordeduadaptationshtmlgt (last accessed )

Dale P S amp Fenson L () Lexical development norms for young children BehaviorResearch Methods Instruments amp Computers ndash

Dale P S amp PenfoldM (nd) Adaptations of theMacArthur-Bates CDI into non-US Englishlanguages Online lthttpmb-cdistanfordedudocumentsAdaptationsSurvey--Webpdfgt (last accessed )

Dickinson D K amp Tabors P O () Beginning literacy with language young childrenlearning at home and school Baltimore MD Paul H Brookes Publishing

Dunn L M amp Dunn L M () Peabody Picture Vocabulary Test th ed ParsippanyNJ AGS Publishing Pearson Assessments

Eriksson M Marschik P B Tulviste T Almgren M Peacuterez Pereira M Wehberg S Gallego C () Differences between girls and boys in emerging language skills evidencefrom language communities British Journal of Developmental Psychology ndash

Feldman H M Dale P S Campbell T F Colborn D K Kurs-Lasky M RocketteH E amp Paradise J L () Concurrent and predictive validity of parent reports of childlanguage at ages and years Child Development ndash

Feldman H M Dollaghan C A Campbell T F Kurs-Lasky M Janosky J E ampParadise J L () Measurement properties of the MacArthur CommunicativeDevelopment Inventories at ages one and two years Child Development ndash

Fenson L Bates E Dale P Goodman J Reznick J S amp Thal D () Replymeasuring variability in early child language donrsquot shoot the messenger ChildDevelopment ndash

Fenson L Dale P S Reznick J S Bates E Hartung J P Pethick S amp Reilly J() MacArthur Communicative Development Inventories userrsquos guide and technicalmanual Baltimore MD Paul H Brookes Publishing Co

Fenson L Dale P Reznick J Bates E Thal D Pethick S Stiles J ()Variability in early communicative development Monographs of the Society for Researchin Child Development

Fenson L Marchman V A Thal D Dale P Reznick J S amp Bates E ()MacArthur-Bates Communicative Development Inventories userrsquos guide and technicalmanual nd ed Baltimore MD Brookes Publishing Company

Hidaka S () Estimating the latent number of types in growing corpora with reducedcostndashaccuracy trade-off Journal of Child Language ndash

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Hills T T Maouene J Riordan B amp Smith L B () The associative structure oflanguage contextual diversity in early word learning Journal of Memory and Language ndash

Hills T T Maouene M Maouene J Sheya A amp Smith L () Longitudinal analysisof early semantic networks Preferential attachment or preferential acquisitionPsychological Science ndash

Huttenlocher J Haight W Bryk A Seltzer M amp Lyons T () Early vocabularygrowth relation to language input and gender Developmental Psychology ndash

Joslashrgensen R N Dale P S Bleses D amp Fenson L () CLEX a cross-linguistic lexicalnorms database Journal of Child Language ndash

Kristoffersen K E Simonsen H G Bleses D Wehberg S Joslashrgensen R N EieslandE A amp Henriksen L Y () The use of the Internet in collecting CDI data ndash anexample from Norway Journal of Child Language ndash

Lieven E Salomo D amp Tomasello M () Two-year-old childrenrsquos production ofmultiword utterances a usage-based analysis Cognitive Linguistics ndash

MacWhinney B () The CHILDES Project tools for analyzing talk rd ed MahwahNJ Lawrence Erlbaum Associates

Marchman V A amp Martiacutenez-Sussmann C () Concurrent validity of caregiverparentreport measures of language for children who are learning both English and SpanishJournal of Speech Language and Hearing Research ndash

Mayor J amp Plunkett K () A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis Developmental Science ndash

Muggeo V M Sciandra M Tomasello A amp Calvo S () Estimating growth chartsvia nonparametric quantile regression a practical framework with application in ecologyEnvironmental and Ecological Statistics ndash

Nelson K () Structure and strategy in learning to talk Monographs of the Society forResearch in Child Development ndash

Norrman G amp Bylund E () The irreversibility of sensitive period effects in languagedevelopment evidence from second language acquisition in international adopteesDevelopmental Science ndash

R Foundation for Statistical Computing () R a language and environment for statisticalcomputing Software online lthttpwwwr-projectorggt

Rescorla L () The language development survey a screening tool for delayed languagein toddlers Journal of Speech and Hearing Disorders ndash

Roy B C Frank M C DeCamp P Miller M amp Roy D () Predicting the birth of aspoken word Proceedings of the National Academy of Sciences ndash

Song J Y Shattuck-Hufnagel S amp Demuth K () Development of phonetic variants(allophones) in -year-olds learning American English a study of alveolar stop t d codasJournal of Phonetics ndash

Tardif T Fletcher P Liang W Zhang Z Kaciroti N amp Marchman V A ()Babyrsquos first words Developmental Psychology ndash

Thal D Jackson-Maldonado D amp Acosta D () Validity of a parent-report measure ofvocabulary and grammar for Spanish-speaking toddlers Journal of Speech Language andHearing Research ndash

Tomasello M amp Mervis C B () The instrument is great but measuringcomprehension is still a problem Monographs of the Society for Research in ChildDevelopment ndash

Wallentin M () Putative sex differences in verbal abilities and language cortex a criticalreview Brain and Language ndash

Weisleder A amp Fernald A () Talking to children matters early language experiencestrengthens processing and builds vocabulary Psychological Science ndash

Wickham H () Ggplot elegant graphics for data analysis New York NY SpringerScience amp Business Media

Wickham H amp Francois R () Dplyr a grammar of data manipulation R packageversion middotmiddotmiddot Online lthttpscranr-projectorgwebpackagesdplyrgt

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Page 9: Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms Communicative Development Inventories

writing the site includes data from more than administrations of theCDI across fourteen different languages and twenty-four differentinstruments But because of the difficulties in comparison acrossinstruments our approach to cross-linguistic and cross-instrument data is toprovide standardized analyses within each instrument and language withoutassuming equivalence across words instruments or populations Thus ourprimary exploratory visualization tools in general do not allow comparisonacross languages and we urge users to interpret cross-linguistic and cross-instrument differences with caution Developing statistical techniques tofacilitate these comparisons is a current focus of our research

Interactive analysis tools

The primarymethod for users to interact withWordbank is through interactiveanalysis tools that are hosted on the website These tools allow for fast andflexible exploration of the dataset the results of which can be exported intabular and graphical formats for further analysis and presentation

Vocabulary Norms One of the primary purposes of the CDI instruments isto provide percentile ranks for vocabulary growth across ages both forvisualizing the variability of early vocabulary growth and for examiningdifferences in these growth patterns due to individual differences anddemographic variables Accordingly Wordbank provides a VocabularyNorms analysis pictured in Figure The inset plot shows alladministrations of a particular CDI instrument within the instrumentrsquos validage range Dots show individual children with age binned by month andjittered to avoid overplotting Lines on the plot indicate estimates ofpercentiles fit using quantile regression with monotonic polynomial splinesas the base function (using the gcrq function of the quantregGrowthpackage Muggeo Sciandra Tomasello amp Calvo ) An important featureof the norms app is that it can be split by any demographic field in the dataso that comparisons on variables like gender birth order or maternaleducation can be conducted

The original and updated norming studies (Fenson et al )gathered data from a diverse (though not nationally representative) sampleand used these data to construct normative curves from which percentileranks could be derived In contrast to these studies Wordbank is notexplicitly designed to provide stable clinically relevant normsWordbankrsquos sample is heterogeneous and continually growing and itsanalyses are subject to revision and update Thus Wordbank does notcurrently generate percentile ranks and we do not recommend that

The only exception to this policy currently is that we allow users to see responses acrossinstruments for individual words in the Item Trajectories analysis (eg the proportionof children who say the word cat on both Words amp Gestures and Words amp Sentences forms)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Wordbank-generated norming values be used for research or clinicalpurposes in which the goal is to evaluate childrenrsquos performance inreference to an established normative standard For these types ofapplications users should refer to the published norms in the appropriatelanguage

Item Trajectories A second function of the CDI instruments is to provideaggregate data on the proportion of children at a particular age who know aspecific word (Dale amp Fenson Joslashrgensen et al ) Such analysescan be extremely helpful for the design and evaluation of materials foryoung children including experimental stimuli Accordingly the secondmajor interactive visualization in Wordbank is the Item Trajectoriesanalysis tool

Fig A screenshot of the Vocabulary Norms analysis tool showing th th thth and th percentiles (default) for English production scores Dots show individualadministrations jittered slightly to avoid overplotting Curves show polynomial spline fits(See text for more details color online)

Users can always generate percentile ranks themselves and this may be desirable ornecessary for research purposes but we caution against the clinical use of such ad-hocnorms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

This tool allows exploration of growth curves for individual words on aCDI form Users can select a language and instrument (and chooseproduction or comprehension where available) and then select or input alist of words whose trajectories are plotted (Figure ) The lsquobothrsquo measureoption shows data from multiple forms for the same language withdifferent markers for each item In general our exploration suggests thatthere are only small differences across different instruments for the sameitem and age Lines on the plot show a local polynomial regressionsmoothing line (loess in R)

Other features static reports and tabular data download In addition to theinteractive analysis tools described above Wordbank also includes a numberof non-interactive but continuously updated reports on features likevocabulary composition across languages links between grammar and thelexicon (Braginsky Yurovsky Marchman amp Frank ) and genderdifferences in vocabulary growth (see below) On the Analyses page(lthttpwordbankstanfordeduanalysesgt) we provide a gallery of bothinteractive and non-interactive analyses

Wordbank also allows raw tabular data to be browsed and downloaded forsubsequent analysis in all popular statistical packages Using the same basicinterface as the Vocabulary Norms and Item Trajectory tools users canbrowse raw data aggregated across children (similar to the VocabularyNorms tool) across items (similar to the Items Trajectory tool) or evenview the raw subject-by-item data All data in these lsquostandardrsquo reports canbe downloaded in CSV format

A worked example gender differences Imagine a student interested ingender differences in production vocabulary size perhaps for a classproject Gender differences in language production are commonly found inindividual studies (eg Fenson et al Huttenlocher Haight BrykSeltzer amp Lyons see Wallentin for review) and onelarge-scale previous study found differences in production vocabulary inten languages (Eriksson et al )

To explore these differences using Wordbank the student would navigatefrom the home page to the Vocabulary Norms report English is the defaultlanguage for the report but the student could in principle select anylanguage in the database Similarly she could select her desired instrumentin the lsquoFormsrsquo menu (Words amp Sentences is the default) She would thenselect lsquoGenderrsquo as a split variable for the data (in the lsquoSplit Variablersquo menu)to see normative curves and sample sizes for each part of the dataset Or to

The distinction between sex (biological characteristic) and gender (social characteristic) iscomplex and not well understood in early childhood We defer discussion of this issuesince the CDI is a parent-report form we do not have access to either sex or genderinformation directly

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

make a plot that enabled comparison of the median level of productionvocabulary she could select lsquoMedianrsquo in the lsquoQuantilesrsquo menu

Selecting lsquoDownload Plotrsquo would result in the plot shown in Figure Orshe could navigate to the lsquoTablersquo tab of the display window to see tabularform data showing the th percentile (median) for both females andmales by age These tabular summary data are available for download viathe lsquoDownload Tablersquo button and the raw data (with a row for each oneof the children represented in the plot) are available via thelsquoDownload Raw Datarsquo button In sum this graphical workflow allowsinterested users to manipulate and download individual parts of thedataset as well as to create visualizations of basic analyses

Extensibility

Extensibility is one of the major strengths of Wordbank Althoughprogramming knowledge is not necessary for interacting with Wordbankinterested researchers with programming skills can contribute to the

Fig A screenshot of the Item Trajectories analysis tool showing a visualization of thedevelopmental trajectory of production for three words (dog choo choo and table) acrossboth Words amp Gestures and Words amp Sentences forms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

development effort by adding new analyses Each Wordbank analysis app isconstructed as a standalone script or set of scripts in the R languageConstructing an interactive analysis requires specifying a visualization andsome interactive functionality using Shiny Non-interactive analyses can beconstructed as R Markdown documents that execute scripts using theWordbank database Both of these have the virtue of rerunning on thenewest version of the database whenever they are opened so they do notgo out of date as new data are added

In addition we encourage contributions of individual datasets Wordbankcurrently imports data from Excel and CSV formats via automated importscripts Individuals or labs interested in contributing should consult withthe authors for advice about data formatting and upload

WORDBANKR AN R PACKAGE FOR ACCESSING WORDBANK

Although the analysis tools described above suffice for many needsresearchers interested in detailed quantitative or cross-linguistic analysesmay wish to connect directly to the Wordbank database and manipulatethe data directly Making use of the R programming language (RFoundation for Statistical Computing ) we provide the wordbankrpackage to help researchers accomplish this task R is an open-source

Fig A downloaded plot of gender differences in production language forEnglish-speaking children (color online)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

extensible statistical computing environment that is rapidly growing inpopularity across fields and is increasing in use in child language research(eg Norrman amp Bylund Song Shattuck-Hufnagel amp Demuth) The wordbankr package abstracts away the details of connectingto the database Users can take advantage of the SQL tools developed inthe popular dplyr package (Wickham amp Francois ) which makemanipulating large datasets quick and easy We describe the commandsthat the package provides and then give a worked example of using thepackage for a simple analysis

Package details

The wordbankr package is easily installed via CRAN the comprehensive Rarchive network To install simply type installpackages(ldquowordbankrrdquo) After installation users can use the three main data loadingfunctions provided by wordbankrget_administration_data toretrieve information about each CDI administration including the childrsquosdemographics and vocabulary sizes get_item_data to retrieve informationabout each CDI item including its text and categories andget_instrument_data to retrieve administration-by-item response valuesEach of these can be run in remote mode which loads data from theWordbank server or in local mode if the user has a copy of the database setup on their local machine For more detailed documentation see the packagerepository (lthttpgithubcomlangcogwordbankrgt)

Worked example part gender differences across languages

We next demonstrate the analytic potential of direct manipulation of theWordbank database using wordbankr by using the package to extend theworked example of gender differences above This section also replicates alarge-scale analysis by Eriksson et al () To perform the analysis wefirst begin by using wordbankr to load the data from Wordbank andconnect to the tables we need

admins lt- get_administration_data()items lt- get_item_data()

We next use a series of dplyr calls to compute the number of words in eachlanguage select the appropriate subset of the data and calculate theproportion of words produced for this data subset

num_words lt- items gtfilter(form == WS type == word) gtgroup_by(language) gtsummarise(n = n())

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Fig Median production vocabulary as a proportion of total words on an instrument plotted by age in months Red and blue lines showfemales and males respectively (color online)

WORD

BAN

K

of use available at httpsww

wcam

bridgeorgcoreterms httpsdoiorg101017S0305000916000209

Dow

nloaded from httpsw

ww

cambridgeorgcore U

niversity of Chicago on 17 May 2017 at 144547 subject to the Cam

bridge Core terms

vocab_admins lt- admins gtfilter(form == WS isna(sex)) gtselect(data_id language form age sex production)

vocab_data lt- vocab_admins gtgroup_by(language sex age) gtleft_join(num_words) gtmutate(production = production n) gtsummarise(median = median(production))

We then plot the vocab_data data frame using the ggplot2 package(Wickham ) Full code for the analysis as a whole (including the plot)is available at lthttpmikabrgithubiodemo-vocabgenderhtmlgt

The results of this analysis are shown in Figure As expected wereplicate the gender differences found in previous work (Eriksson et al) females showed a small but highly reliable advantage in earlyproduction This effect is highly consistent and clearly visible in elevenout of twelve languages with Italian being the only exception Forcomparison the previous work found a positive female effect for all tenout of ten languages but the size of the effect was close to zero for two ofthese Observational data such as those contained in Wordbank allow usonly to speculate about the origins of this gender difference or the sourcesof cross-linguistic variation (for some discussion see Eriksson et al) But the Wordbank platform dramatically facilitates the formulationand testing of analyses of this sort allowing hypotheses to be testedquickly and easily against large datasets

CONCLUSION

In this paper we have presented Wordbank an open repository forparent-report vocabulary data from the MacArthur-Bates CDI Theinteractive analysis tools available on the Wordbank site allow interestedresearchers to explore a wide variety of phenomena in vocabularydevelopment quickly and easily exporting data and downloadingpresentation-quality graphics that document their analysis In additionusers can contribute new analyses and data to the site and connect to itdirectly using an R package for data loading These functions all facilitategreater sharing and reuse of existing data on childrenrsquos vocabularyenabling new discoveries in the future

REFERENCES

Bates E () Language and context the acquisition of pragmatics (Vol ) New York NYAcademic Press

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Bates E amp Goodman J () On the emergence of grammar from the lexicon InB MacWhinney (ed) The emergence of language (pp ndash) Mahwah NJ LawrenceErlbaum Associates

Bates E Marchman V Thal D Fenson L Dale P Reznick J S Hartung J() Developmental and stylistic variation in the composition of early vocabularyJournal of Child Language ndash

Bloom P () How children learn the meanings of words Cambridge MA MIT PressBornstein M H amp Haynes O M () Vocabulary competence in early childhoodmeasurement latent construct and predictive validity Child Development ndash

Braginsky M Yurovsky D Marchman V A amp Frank M C () Developmentalchanges in the relationship between grammar and the lexicon In D C Noelle R DaleA S Warlaumont J Yoshimi T Matlock C D Jennings amp P P Maglio (Eds)Proceedings of the th Annual Meeting of the Cognitive Science Society Austin TXCognitive Science Society

Brown R () A first language the early stages Cambridge MA Harvard University PressCartmill E A Armstrong B F Gleitman L R Goldin-Meadow S Medina T N ampTrueswell J C () Quality of early parent input predicts child vocabulary yearslater Proceedings of the National Academy of Sciences ndash

Clark E () First language acquisition Cambridge Cambridge University PressDale P S (nd) Adaptations not translations Online lthttpmb-cdistanfordeduadaptationshtmlgt (last accessed )

Dale P S amp Fenson L () Lexical development norms for young children BehaviorResearch Methods Instruments amp Computers ndash

Dale P S amp PenfoldM (nd) Adaptations of theMacArthur-Bates CDI into non-US Englishlanguages Online lthttpmb-cdistanfordedudocumentsAdaptationsSurvey--Webpdfgt (last accessed )

Dickinson D K amp Tabors P O () Beginning literacy with language young childrenlearning at home and school Baltimore MD Paul H Brookes Publishing

Dunn L M amp Dunn L M () Peabody Picture Vocabulary Test th ed ParsippanyNJ AGS Publishing Pearson Assessments

Eriksson M Marschik P B Tulviste T Almgren M Peacuterez Pereira M Wehberg S Gallego C () Differences between girls and boys in emerging language skills evidencefrom language communities British Journal of Developmental Psychology ndash

Feldman H M Dale P S Campbell T F Colborn D K Kurs-Lasky M RocketteH E amp Paradise J L () Concurrent and predictive validity of parent reports of childlanguage at ages and years Child Development ndash

Feldman H M Dollaghan C A Campbell T F Kurs-Lasky M Janosky J E ampParadise J L () Measurement properties of the MacArthur CommunicativeDevelopment Inventories at ages one and two years Child Development ndash

Fenson L Bates E Dale P Goodman J Reznick J S amp Thal D () Replymeasuring variability in early child language donrsquot shoot the messenger ChildDevelopment ndash

Fenson L Dale P S Reznick J S Bates E Hartung J P Pethick S amp Reilly J() MacArthur Communicative Development Inventories userrsquos guide and technicalmanual Baltimore MD Paul H Brookes Publishing Co

Fenson L Dale P Reznick J Bates E Thal D Pethick S Stiles J ()Variability in early communicative development Monographs of the Society for Researchin Child Development

Fenson L Marchman V A Thal D Dale P Reznick J S amp Bates E ()MacArthur-Bates Communicative Development Inventories userrsquos guide and technicalmanual nd ed Baltimore MD Brookes Publishing Company

Hidaka S () Estimating the latent number of types in growing corpora with reducedcostndashaccuracy trade-off Journal of Child Language ndash

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Hills T T Maouene J Riordan B amp Smith L B () The associative structure oflanguage contextual diversity in early word learning Journal of Memory and Language ndash

Hills T T Maouene M Maouene J Sheya A amp Smith L () Longitudinal analysisof early semantic networks Preferential attachment or preferential acquisitionPsychological Science ndash

Huttenlocher J Haight W Bryk A Seltzer M amp Lyons T () Early vocabularygrowth relation to language input and gender Developmental Psychology ndash

Joslashrgensen R N Dale P S Bleses D amp Fenson L () CLEX a cross-linguistic lexicalnorms database Journal of Child Language ndash

Kristoffersen K E Simonsen H G Bleses D Wehberg S Joslashrgensen R N EieslandE A amp Henriksen L Y () The use of the Internet in collecting CDI data ndash anexample from Norway Journal of Child Language ndash

Lieven E Salomo D amp Tomasello M () Two-year-old childrenrsquos production ofmultiword utterances a usage-based analysis Cognitive Linguistics ndash

MacWhinney B () The CHILDES Project tools for analyzing talk rd ed MahwahNJ Lawrence Erlbaum Associates

Marchman V A amp Martiacutenez-Sussmann C () Concurrent validity of caregiverparentreport measures of language for children who are learning both English and SpanishJournal of Speech Language and Hearing Research ndash

Mayor J amp Plunkett K () A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis Developmental Science ndash

Muggeo V M Sciandra M Tomasello A amp Calvo S () Estimating growth chartsvia nonparametric quantile regression a practical framework with application in ecologyEnvironmental and Ecological Statistics ndash

Nelson K () Structure and strategy in learning to talk Monographs of the Society forResearch in Child Development ndash

Norrman G amp Bylund E () The irreversibility of sensitive period effects in languagedevelopment evidence from second language acquisition in international adopteesDevelopmental Science ndash

R Foundation for Statistical Computing () R a language and environment for statisticalcomputing Software online lthttpwwwr-projectorggt

Rescorla L () The language development survey a screening tool for delayed languagein toddlers Journal of Speech and Hearing Disorders ndash

Roy B C Frank M C DeCamp P Miller M amp Roy D () Predicting the birth of aspoken word Proceedings of the National Academy of Sciences ndash

Song J Y Shattuck-Hufnagel S amp Demuth K () Development of phonetic variants(allophones) in -year-olds learning American English a study of alveolar stop t d codasJournal of Phonetics ndash

Tardif T Fletcher P Liang W Zhang Z Kaciroti N amp Marchman V A ()Babyrsquos first words Developmental Psychology ndash

Thal D Jackson-Maldonado D amp Acosta D () Validity of a parent-report measure ofvocabulary and grammar for Spanish-speaking toddlers Journal of Speech Language andHearing Research ndash

Tomasello M amp Mervis C B () The instrument is great but measuringcomprehension is still a problem Monographs of the Society for Research in ChildDevelopment ndash

Wallentin M () Putative sex differences in verbal abilities and language cortex a criticalreview Brain and Language ndash

Weisleder A amp Fernald A () Talking to children matters early language experiencestrengthens processing and builds vocabulary Psychological Science ndash

Wickham H () Ggplot elegant graphics for data analysis New York NY SpringerScience amp Business Media

Wickham H amp Francois R () Dplyr a grammar of data manipulation R packageversion middotmiddotmiddot Online lthttpscranr-projectorgwebpackagesdplyrgt

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Page 10: Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms Communicative Development Inventories

Wordbank-generated norming values be used for research or clinicalpurposes in which the goal is to evaluate childrenrsquos performance inreference to an established normative standard For these types ofapplications users should refer to the published norms in the appropriatelanguage

Item Trajectories A second function of the CDI instruments is to provideaggregate data on the proportion of children at a particular age who know aspecific word (Dale amp Fenson Joslashrgensen et al ) Such analysescan be extremely helpful for the design and evaluation of materials foryoung children including experimental stimuli Accordingly the secondmajor interactive visualization in Wordbank is the Item Trajectoriesanalysis tool

Fig A screenshot of the Vocabulary Norms analysis tool showing th th thth and th percentiles (default) for English production scores Dots show individualadministrations jittered slightly to avoid overplotting Curves show polynomial spline fits(See text for more details color online)

Users can always generate percentile ranks themselves and this may be desirable ornecessary for research purposes but we caution against the clinical use of such ad-hocnorms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

This tool allows exploration of growth curves for individual words on aCDI form Users can select a language and instrument (and chooseproduction or comprehension where available) and then select or input alist of words whose trajectories are plotted (Figure ) The lsquobothrsquo measureoption shows data from multiple forms for the same language withdifferent markers for each item In general our exploration suggests thatthere are only small differences across different instruments for the sameitem and age Lines on the plot show a local polynomial regressionsmoothing line (loess in R)

Other features static reports and tabular data download In addition to theinteractive analysis tools described above Wordbank also includes a numberof non-interactive but continuously updated reports on features likevocabulary composition across languages links between grammar and thelexicon (Braginsky Yurovsky Marchman amp Frank ) and genderdifferences in vocabulary growth (see below) On the Analyses page(lthttpwordbankstanfordeduanalysesgt) we provide a gallery of bothinteractive and non-interactive analyses

Wordbank also allows raw tabular data to be browsed and downloaded forsubsequent analysis in all popular statistical packages Using the same basicinterface as the Vocabulary Norms and Item Trajectory tools users canbrowse raw data aggregated across children (similar to the VocabularyNorms tool) across items (similar to the Items Trajectory tool) or evenview the raw subject-by-item data All data in these lsquostandardrsquo reports canbe downloaded in CSV format

A worked example gender differences Imagine a student interested ingender differences in production vocabulary size perhaps for a classproject Gender differences in language production are commonly found inindividual studies (eg Fenson et al Huttenlocher Haight BrykSeltzer amp Lyons see Wallentin for review) and onelarge-scale previous study found differences in production vocabulary inten languages (Eriksson et al )

To explore these differences using Wordbank the student would navigatefrom the home page to the Vocabulary Norms report English is the defaultlanguage for the report but the student could in principle select anylanguage in the database Similarly she could select her desired instrumentin the lsquoFormsrsquo menu (Words amp Sentences is the default) She would thenselect lsquoGenderrsquo as a split variable for the data (in the lsquoSplit Variablersquo menu)to see normative curves and sample sizes for each part of the dataset Or to

The distinction between sex (biological characteristic) and gender (social characteristic) iscomplex and not well understood in early childhood We defer discussion of this issuesince the CDI is a parent-report form we do not have access to either sex or genderinformation directly

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

make a plot that enabled comparison of the median level of productionvocabulary she could select lsquoMedianrsquo in the lsquoQuantilesrsquo menu

Selecting lsquoDownload Plotrsquo would result in the plot shown in Figure Orshe could navigate to the lsquoTablersquo tab of the display window to see tabularform data showing the th percentile (median) for both females andmales by age These tabular summary data are available for download viathe lsquoDownload Tablersquo button and the raw data (with a row for each oneof the children represented in the plot) are available via thelsquoDownload Raw Datarsquo button In sum this graphical workflow allowsinterested users to manipulate and download individual parts of thedataset as well as to create visualizations of basic analyses

Extensibility

Extensibility is one of the major strengths of Wordbank Althoughprogramming knowledge is not necessary for interacting with Wordbankinterested researchers with programming skills can contribute to the

Fig A screenshot of the Item Trajectories analysis tool showing a visualization of thedevelopmental trajectory of production for three words (dog choo choo and table) acrossboth Words amp Gestures and Words amp Sentences forms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

development effort by adding new analyses Each Wordbank analysis app isconstructed as a standalone script or set of scripts in the R languageConstructing an interactive analysis requires specifying a visualization andsome interactive functionality using Shiny Non-interactive analyses can beconstructed as R Markdown documents that execute scripts using theWordbank database Both of these have the virtue of rerunning on thenewest version of the database whenever they are opened so they do notgo out of date as new data are added

In addition we encourage contributions of individual datasets Wordbankcurrently imports data from Excel and CSV formats via automated importscripts Individuals or labs interested in contributing should consult withthe authors for advice about data formatting and upload

WORDBANKR AN R PACKAGE FOR ACCESSING WORDBANK

Although the analysis tools described above suffice for many needsresearchers interested in detailed quantitative or cross-linguistic analysesmay wish to connect directly to the Wordbank database and manipulatethe data directly Making use of the R programming language (RFoundation for Statistical Computing ) we provide the wordbankrpackage to help researchers accomplish this task R is an open-source

Fig A downloaded plot of gender differences in production language forEnglish-speaking children (color online)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

extensible statistical computing environment that is rapidly growing inpopularity across fields and is increasing in use in child language research(eg Norrman amp Bylund Song Shattuck-Hufnagel amp Demuth) The wordbankr package abstracts away the details of connectingto the database Users can take advantage of the SQL tools developed inthe popular dplyr package (Wickham amp Francois ) which makemanipulating large datasets quick and easy We describe the commandsthat the package provides and then give a worked example of using thepackage for a simple analysis

Package details

The wordbankr package is easily installed via CRAN the comprehensive Rarchive network To install simply type installpackages(ldquowordbankrrdquo) After installation users can use the three main data loadingfunctions provided by wordbankrget_administration_data toretrieve information about each CDI administration including the childrsquosdemographics and vocabulary sizes get_item_data to retrieve informationabout each CDI item including its text and categories andget_instrument_data to retrieve administration-by-item response valuesEach of these can be run in remote mode which loads data from theWordbank server or in local mode if the user has a copy of the database setup on their local machine For more detailed documentation see the packagerepository (lthttpgithubcomlangcogwordbankrgt)

Worked example part gender differences across languages

We next demonstrate the analytic potential of direct manipulation of theWordbank database using wordbankr by using the package to extend theworked example of gender differences above This section also replicates alarge-scale analysis by Eriksson et al () To perform the analysis wefirst begin by using wordbankr to load the data from Wordbank andconnect to the tables we need

admins lt- get_administration_data()items lt- get_item_data()

We next use a series of dplyr calls to compute the number of words in eachlanguage select the appropriate subset of the data and calculate theproportion of words produced for this data subset

num_words lt- items gtfilter(form == WS type == word) gtgroup_by(language) gtsummarise(n = n())

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Fig Median production vocabulary as a proportion of total words on an instrument plotted by age in months Red and blue lines showfemales and males respectively (color online)

WORD

BAN

K

of use available at httpsww

wcam

bridgeorgcoreterms httpsdoiorg101017S0305000916000209

Dow

nloaded from httpsw

ww

cambridgeorgcore U

niversity of Chicago on 17 May 2017 at 144547 subject to the Cam

bridge Core terms

vocab_admins lt- admins gtfilter(form == WS isna(sex)) gtselect(data_id language form age sex production)

vocab_data lt- vocab_admins gtgroup_by(language sex age) gtleft_join(num_words) gtmutate(production = production n) gtsummarise(median = median(production))

We then plot the vocab_data data frame using the ggplot2 package(Wickham ) Full code for the analysis as a whole (including the plot)is available at lthttpmikabrgithubiodemo-vocabgenderhtmlgt

The results of this analysis are shown in Figure As expected wereplicate the gender differences found in previous work (Eriksson et al) females showed a small but highly reliable advantage in earlyproduction This effect is highly consistent and clearly visible in elevenout of twelve languages with Italian being the only exception Forcomparison the previous work found a positive female effect for all tenout of ten languages but the size of the effect was close to zero for two ofthese Observational data such as those contained in Wordbank allow usonly to speculate about the origins of this gender difference or the sourcesof cross-linguistic variation (for some discussion see Eriksson et al) But the Wordbank platform dramatically facilitates the formulationand testing of analyses of this sort allowing hypotheses to be testedquickly and easily against large datasets

CONCLUSION

In this paper we have presented Wordbank an open repository forparent-report vocabulary data from the MacArthur-Bates CDI Theinteractive analysis tools available on the Wordbank site allow interestedresearchers to explore a wide variety of phenomena in vocabularydevelopment quickly and easily exporting data and downloadingpresentation-quality graphics that document their analysis In additionusers can contribute new analyses and data to the site and connect to itdirectly using an R package for data loading These functions all facilitategreater sharing and reuse of existing data on childrenrsquos vocabularyenabling new discoveries in the future

REFERENCES

Bates E () Language and context the acquisition of pragmatics (Vol ) New York NYAcademic Press

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Bates E amp Goodman J () On the emergence of grammar from the lexicon InB MacWhinney (ed) The emergence of language (pp ndash) Mahwah NJ LawrenceErlbaum Associates

Bates E Marchman V Thal D Fenson L Dale P Reznick J S Hartung J() Developmental and stylistic variation in the composition of early vocabularyJournal of Child Language ndash

Bloom P () How children learn the meanings of words Cambridge MA MIT PressBornstein M H amp Haynes O M () Vocabulary competence in early childhoodmeasurement latent construct and predictive validity Child Development ndash

Braginsky M Yurovsky D Marchman V A amp Frank M C () Developmentalchanges in the relationship between grammar and the lexicon In D C Noelle R DaleA S Warlaumont J Yoshimi T Matlock C D Jennings amp P P Maglio (Eds)Proceedings of the th Annual Meeting of the Cognitive Science Society Austin TXCognitive Science Society

Brown R () A first language the early stages Cambridge MA Harvard University PressCartmill E A Armstrong B F Gleitman L R Goldin-Meadow S Medina T N ampTrueswell J C () Quality of early parent input predicts child vocabulary yearslater Proceedings of the National Academy of Sciences ndash

Clark E () First language acquisition Cambridge Cambridge University PressDale P S (nd) Adaptations not translations Online lthttpmb-cdistanfordeduadaptationshtmlgt (last accessed )

Dale P S amp Fenson L () Lexical development norms for young children BehaviorResearch Methods Instruments amp Computers ndash

Dale P S amp PenfoldM (nd) Adaptations of theMacArthur-Bates CDI into non-US Englishlanguages Online lthttpmb-cdistanfordedudocumentsAdaptationsSurvey--Webpdfgt (last accessed )

Dickinson D K amp Tabors P O () Beginning literacy with language young childrenlearning at home and school Baltimore MD Paul H Brookes Publishing

Dunn L M amp Dunn L M () Peabody Picture Vocabulary Test th ed ParsippanyNJ AGS Publishing Pearson Assessments

Eriksson M Marschik P B Tulviste T Almgren M Peacuterez Pereira M Wehberg S Gallego C () Differences between girls and boys in emerging language skills evidencefrom language communities British Journal of Developmental Psychology ndash

Feldman H M Dale P S Campbell T F Colborn D K Kurs-Lasky M RocketteH E amp Paradise J L () Concurrent and predictive validity of parent reports of childlanguage at ages and years Child Development ndash

Feldman H M Dollaghan C A Campbell T F Kurs-Lasky M Janosky J E ampParadise J L () Measurement properties of the MacArthur CommunicativeDevelopment Inventories at ages one and two years Child Development ndash

Fenson L Bates E Dale P Goodman J Reznick J S amp Thal D () Replymeasuring variability in early child language donrsquot shoot the messenger ChildDevelopment ndash

Fenson L Dale P S Reznick J S Bates E Hartung J P Pethick S amp Reilly J() MacArthur Communicative Development Inventories userrsquos guide and technicalmanual Baltimore MD Paul H Brookes Publishing Co

Fenson L Dale P Reznick J Bates E Thal D Pethick S Stiles J ()Variability in early communicative development Monographs of the Society for Researchin Child Development

Fenson L Marchman V A Thal D Dale P Reznick J S amp Bates E ()MacArthur-Bates Communicative Development Inventories userrsquos guide and technicalmanual nd ed Baltimore MD Brookes Publishing Company

Hidaka S () Estimating the latent number of types in growing corpora with reducedcostndashaccuracy trade-off Journal of Child Language ndash

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Hills T T Maouene J Riordan B amp Smith L B () The associative structure oflanguage contextual diversity in early word learning Journal of Memory and Language ndash

Hills T T Maouene M Maouene J Sheya A amp Smith L () Longitudinal analysisof early semantic networks Preferential attachment or preferential acquisitionPsychological Science ndash

Huttenlocher J Haight W Bryk A Seltzer M amp Lyons T () Early vocabularygrowth relation to language input and gender Developmental Psychology ndash

Joslashrgensen R N Dale P S Bleses D amp Fenson L () CLEX a cross-linguistic lexicalnorms database Journal of Child Language ndash

Kristoffersen K E Simonsen H G Bleses D Wehberg S Joslashrgensen R N EieslandE A amp Henriksen L Y () The use of the Internet in collecting CDI data ndash anexample from Norway Journal of Child Language ndash

Lieven E Salomo D amp Tomasello M () Two-year-old childrenrsquos production ofmultiword utterances a usage-based analysis Cognitive Linguistics ndash

MacWhinney B () The CHILDES Project tools for analyzing talk rd ed MahwahNJ Lawrence Erlbaum Associates

Marchman V A amp Martiacutenez-Sussmann C () Concurrent validity of caregiverparentreport measures of language for children who are learning both English and SpanishJournal of Speech Language and Hearing Research ndash

Mayor J amp Plunkett K () A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis Developmental Science ndash

Muggeo V M Sciandra M Tomasello A amp Calvo S () Estimating growth chartsvia nonparametric quantile regression a practical framework with application in ecologyEnvironmental and Ecological Statistics ndash

Nelson K () Structure and strategy in learning to talk Monographs of the Society forResearch in Child Development ndash

Norrman G amp Bylund E () The irreversibility of sensitive period effects in languagedevelopment evidence from second language acquisition in international adopteesDevelopmental Science ndash

R Foundation for Statistical Computing () R a language and environment for statisticalcomputing Software online lthttpwwwr-projectorggt

Rescorla L () The language development survey a screening tool for delayed languagein toddlers Journal of Speech and Hearing Disorders ndash

Roy B C Frank M C DeCamp P Miller M amp Roy D () Predicting the birth of aspoken word Proceedings of the National Academy of Sciences ndash

Song J Y Shattuck-Hufnagel S amp Demuth K () Development of phonetic variants(allophones) in -year-olds learning American English a study of alveolar stop t d codasJournal of Phonetics ndash

Tardif T Fletcher P Liang W Zhang Z Kaciroti N amp Marchman V A ()Babyrsquos first words Developmental Psychology ndash

Thal D Jackson-Maldonado D amp Acosta D () Validity of a parent-report measure ofvocabulary and grammar for Spanish-speaking toddlers Journal of Speech Language andHearing Research ndash

Tomasello M amp Mervis C B () The instrument is great but measuringcomprehension is still a problem Monographs of the Society for Research in ChildDevelopment ndash

Wallentin M () Putative sex differences in verbal abilities and language cortex a criticalreview Brain and Language ndash

Weisleder A amp Fernald A () Talking to children matters early language experiencestrengthens processing and builds vocabulary Psychological Science ndash

Wickham H () Ggplot elegant graphics for data analysis New York NY SpringerScience amp Business Media

Wickham H amp Francois R () Dplyr a grammar of data manipulation R packageversion middotmiddotmiddot Online lthttpscranr-projectorgwebpackagesdplyrgt

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Page 11: Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms Communicative Development Inventories

This tool allows exploration of growth curves for individual words on aCDI form Users can select a language and instrument (and chooseproduction or comprehension where available) and then select or input alist of words whose trajectories are plotted (Figure ) The lsquobothrsquo measureoption shows data from multiple forms for the same language withdifferent markers for each item In general our exploration suggests thatthere are only small differences across different instruments for the sameitem and age Lines on the plot show a local polynomial regressionsmoothing line (loess in R)

Other features static reports and tabular data download In addition to theinteractive analysis tools described above Wordbank also includes a numberof non-interactive but continuously updated reports on features likevocabulary composition across languages links between grammar and thelexicon (Braginsky Yurovsky Marchman amp Frank ) and genderdifferences in vocabulary growth (see below) On the Analyses page(lthttpwordbankstanfordeduanalysesgt) we provide a gallery of bothinteractive and non-interactive analyses

Wordbank also allows raw tabular data to be browsed and downloaded forsubsequent analysis in all popular statistical packages Using the same basicinterface as the Vocabulary Norms and Item Trajectory tools users canbrowse raw data aggregated across children (similar to the VocabularyNorms tool) across items (similar to the Items Trajectory tool) or evenview the raw subject-by-item data All data in these lsquostandardrsquo reports canbe downloaded in CSV format

A worked example gender differences Imagine a student interested ingender differences in production vocabulary size perhaps for a classproject Gender differences in language production are commonly found inindividual studies (eg Fenson et al Huttenlocher Haight BrykSeltzer amp Lyons see Wallentin for review) and onelarge-scale previous study found differences in production vocabulary inten languages (Eriksson et al )

To explore these differences using Wordbank the student would navigatefrom the home page to the Vocabulary Norms report English is the defaultlanguage for the report but the student could in principle select anylanguage in the database Similarly she could select her desired instrumentin the lsquoFormsrsquo menu (Words amp Sentences is the default) She would thenselect lsquoGenderrsquo as a split variable for the data (in the lsquoSplit Variablersquo menu)to see normative curves and sample sizes for each part of the dataset Or to

The distinction between sex (biological characteristic) and gender (social characteristic) iscomplex and not well understood in early childhood We defer discussion of this issuesince the CDI is a parent-report form we do not have access to either sex or genderinformation directly

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

make a plot that enabled comparison of the median level of productionvocabulary she could select lsquoMedianrsquo in the lsquoQuantilesrsquo menu

Selecting lsquoDownload Plotrsquo would result in the plot shown in Figure Orshe could navigate to the lsquoTablersquo tab of the display window to see tabularform data showing the th percentile (median) for both females andmales by age These tabular summary data are available for download viathe lsquoDownload Tablersquo button and the raw data (with a row for each oneof the children represented in the plot) are available via thelsquoDownload Raw Datarsquo button In sum this graphical workflow allowsinterested users to manipulate and download individual parts of thedataset as well as to create visualizations of basic analyses

Extensibility

Extensibility is one of the major strengths of Wordbank Althoughprogramming knowledge is not necessary for interacting with Wordbankinterested researchers with programming skills can contribute to the

Fig A screenshot of the Item Trajectories analysis tool showing a visualization of thedevelopmental trajectory of production for three words (dog choo choo and table) acrossboth Words amp Gestures and Words amp Sentences forms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

development effort by adding new analyses Each Wordbank analysis app isconstructed as a standalone script or set of scripts in the R languageConstructing an interactive analysis requires specifying a visualization andsome interactive functionality using Shiny Non-interactive analyses can beconstructed as R Markdown documents that execute scripts using theWordbank database Both of these have the virtue of rerunning on thenewest version of the database whenever they are opened so they do notgo out of date as new data are added

In addition we encourage contributions of individual datasets Wordbankcurrently imports data from Excel and CSV formats via automated importscripts Individuals or labs interested in contributing should consult withthe authors for advice about data formatting and upload

WORDBANKR AN R PACKAGE FOR ACCESSING WORDBANK

Although the analysis tools described above suffice for many needsresearchers interested in detailed quantitative or cross-linguistic analysesmay wish to connect directly to the Wordbank database and manipulatethe data directly Making use of the R programming language (RFoundation for Statistical Computing ) we provide the wordbankrpackage to help researchers accomplish this task R is an open-source

Fig A downloaded plot of gender differences in production language forEnglish-speaking children (color online)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

extensible statistical computing environment that is rapidly growing inpopularity across fields and is increasing in use in child language research(eg Norrman amp Bylund Song Shattuck-Hufnagel amp Demuth) The wordbankr package abstracts away the details of connectingto the database Users can take advantage of the SQL tools developed inthe popular dplyr package (Wickham amp Francois ) which makemanipulating large datasets quick and easy We describe the commandsthat the package provides and then give a worked example of using thepackage for a simple analysis

Package details

The wordbankr package is easily installed via CRAN the comprehensive Rarchive network To install simply type installpackages(ldquowordbankrrdquo) After installation users can use the three main data loadingfunctions provided by wordbankrget_administration_data toretrieve information about each CDI administration including the childrsquosdemographics and vocabulary sizes get_item_data to retrieve informationabout each CDI item including its text and categories andget_instrument_data to retrieve administration-by-item response valuesEach of these can be run in remote mode which loads data from theWordbank server or in local mode if the user has a copy of the database setup on their local machine For more detailed documentation see the packagerepository (lthttpgithubcomlangcogwordbankrgt)

Worked example part gender differences across languages

We next demonstrate the analytic potential of direct manipulation of theWordbank database using wordbankr by using the package to extend theworked example of gender differences above This section also replicates alarge-scale analysis by Eriksson et al () To perform the analysis wefirst begin by using wordbankr to load the data from Wordbank andconnect to the tables we need

admins lt- get_administration_data()items lt- get_item_data()

We next use a series of dplyr calls to compute the number of words in eachlanguage select the appropriate subset of the data and calculate theproportion of words produced for this data subset

num_words lt- items gtfilter(form == WS type == word) gtgroup_by(language) gtsummarise(n = n())

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Fig Median production vocabulary as a proportion of total words on an instrument plotted by age in months Red and blue lines showfemales and males respectively (color online)

WORD

BAN

K

of use available at httpsww

wcam

bridgeorgcoreterms httpsdoiorg101017S0305000916000209

Dow

nloaded from httpsw

ww

cambridgeorgcore U

niversity of Chicago on 17 May 2017 at 144547 subject to the Cam

bridge Core terms

vocab_admins lt- admins gtfilter(form == WS isna(sex)) gtselect(data_id language form age sex production)

vocab_data lt- vocab_admins gtgroup_by(language sex age) gtleft_join(num_words) gtmutate(production = production n) gtsummarise(median = median(production))

We then plot the vocab_data data frame using the ggplot2 package(Wickham ) Full code for the analysis as a whole (including the plot)is available at lthttpmikabrgithubiodemo-vocabgenderhtmlgt

The results of this analysis are shown in Figure As expected wereplicate the gender differences found in previous work (Eriksson et al) females showed a small but highly reliable advantage in earlyproduction This effect is highly consistent and clearly visible in elevenout of twelve languages with Italian being the only exception Forcomparison the previous work found a positive female effect for all tenout of ten languages but the size of the effect was close to zero for two ofthese Observational data such as those contained in Wordbank allow usonly to speculate about the origins of this gender difference or the sourcesof cross-linguistic variation (for some discussion see Eriksson et al) But the Wordbank platform dramatically facilitates the formulationand testing of analyses of this sort allowing hypotheses to be testedquickly and easily against large datasets

CONCLUSION

In this paper we have presented Wordbank an open repository forparent-report vocabulary data from the MacArthur-Bates CDI Theinteractive analysis tools available on the Wordbank site allow interestedresearchers to explore a wide variety of phenomena in vocabularydevelopment quickly and easily exporting data and downloadingpresentation-quality graphics that document their analysis In additionusers can contribute new analyses and data to the site and connect to itdirectly using an R package for data loading These functions all facilitategreater sharing and reuse of existing data on childrenrsquos vocabularyenabling new discoveries in the future

REFERENCES

Bates E () Language and context the acquisition of pragmatics (Vol ) New York NYAcademic Press

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Bates E amp Goodman J () On the emergence of grammar from the lexicon InB MacWhinney (ed) The emergence of language (pp ndash) Mahwah NJ LawrenceErlbaum Associates

Bates E Marchman V Thal D Fenson L Dale P Reznick J S Hartung J() Developmental and stylistic variation in the composition of early vocabularyJournal of Child Language ndash

Bloom P () How children learn the meanings of words Cambridge MA MIT PressBornstein M H amp Haynes O M () Vocabulary competence in early childhoodmeasurement latent construct and predictive validity Child Development ndash

Braginsky M Yurovsky D Marchman V A amp Frank M C () Developmentalchanges in the relationship between grammar and the lexicon In D C Noelle R DaleA S Warlaumont J Yoshimi T Matlock C D Jennings amp P P Maglio (Eds)Proceedings of the th Annual Meeting of the Cognitive Science Society Austin TXCognitive Science Society

Brown R () A first language the early stages Cambridge MA Harvard University PressCartmill E A Armstrong B F Gleitman L R Goldin-Meadow S Medina T N ampTrueswell J C () Quality of early parent input predicts child vocabulary yearslater Proceedings of the National Academy of Sciences ndash

Clark E () First language acquisition Cambridge Cambridge University PressDale P S (nd) Adaptations not translations Online lthttpmb-cdistanfordeduadaptationshtmlgt (last accessed )

Dale P S amp Fenson L () Lexical development norms for young children BehaviorResearch Methods Instruments amp Computers ndash

Dale P S amp PenfoldM (nd) Adaptations of theMacArthur-Bates CDI into non-US Englishlanguages Online lthttpmb-cdistanfordedudocumentsAdaptationsSurvey--Webpdfgt (last accessed )

Dickinson D K amp Tabors P O () Beginning literacy with language young childrenlearning at home and school Baltimore MD Paul H Brookes Publishing

Dunn L M amp Dunn L M () Peabody Picture Vocabulary Test th ed ParsippanyNJ AGS Publishing Pearson Assessments

Eriksson M Marschik P B Tulviste T Almgren M Peacuterez Pereira M Wehberg S Gallego C () Differences between girls and boys in emerging language skills evidencefrom language communities British Journal of Developmental Psychology ndash

Feldman H M Dale P S Campbell T F Colborn D K Kurs-Lasky M RocketteH E amp Paradise J L () Concurrent and predictive validity of parent reports of childlanguage at ages and years Child Development ndash

Feldman H M Dollaghan C A Campbell T F Kurs-Lasky M Janosky J E ampParadise J L () Measurement properties of the MacArthur CommunicativeDevelopment Inventories at ages one and two years Child Development ndash

Fenson L Bates E Dale P Goodman J Reznick J S amp Thal D () Replymeasuring variability in early child language donrsquot shoot the messenger ChildDevelopment ndash

Fenson L Dale P S Reznick J S Bates E Hartung J P Pethick S amp Reilly J() MacArthur Communicative Development Inventories userrsquos guide and technicalmanual Baltimore MD Paul H Brookes Publishing Co

Fenson L Dale P Reznick J Bates E Thal D Pethick S Stiles J ()Variability in early communicative development Monographs of the Society for Researchin Child Development

Fenson L Marchman V A Thal D Dale P Reznick J S amp Bates E ()MacArthur-Bates Communicative Development Inventories userrsquos guide and technicalmanual nd ed Baltimore MD Brookes Publishing Company

Hidaka S () Estimating the latent number of types in growing corpora with reducedcostndashaccuracy trade-off Journal of Child Language ndash

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Hills T T Maouene J Riordan B amp Smith L B () The associative structure oflanguage contextual diversity in early word learning Journal of Memory and Language ndash

Hills T T Maouene M Maouene J Sheya A amp Smith L () Longitudinal analysisof early semantic networks Preferential attachment or preferential acquisitionPsychological Science ndash

Huttenlocher J Haight W Bryk A Seltzer M amp Lyons T () Early vocabularygrowth relation to language input and gender Developmental Psychology ndash

Joslashrgensen R N Dale P S Bleses D amp Fenson L () CLEX a cross-linguistic lexicalnorms database Journal of Child Language ndash

Kristoffersen K E Simonsen H G Bleses D Wehberg S Joslashrgensen R N EieslandE A amp Henriksen L Y () The use of the Internet in collecting CDI data ndash anexample from Norway Journal of Child Language ndash

Lieven E Salomo D amp Tomasello M () Two-year-old childrenrsquos production ofmultiword utterances a usage-based analysis Cognitive Linguistics ndash

MacWhinney B () The CHILDES Project tools for analyzing talk rd ed MahwahNJ Lawrence Erlbaum Associates

Marchman V A amp Martiacutenez-Sussmann C () Concurrent validity of caregiverparentreport measures of language for children who are learning both English and SpanishJournal of Speech Language and Hearing Research ndash

Mayor J amp Plunkett K () A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis Developmental Science ndash

Muggeo V M Sciandra M Tomasello A amp Calvo S () Estimating growth chartsvia nonparametric quantile regression a practical framework with application in ecologyEnvironmental and Ecological Statistics ndash

Nelson K () Structure and strategy in learning to talk Monographs of the Society forResearch in Child Development ndash

Norrman G amp Bylund E () The irreversibility of sensitive period effects in languagedevelopment evidence from second language acquisition in international adopteesDevelopmental Science ndash

R Foundation for Statistical Computing () R a language and environment for statisticalcomputing Software online lthttpwwwr-projectorggt

Rescorla L () The language development survey a screening tool for delayed languagein toddlers Journal of Speech and Hearing Disorders ndash

Roy B C Frank M C DeCamp P Miller M amp Roy D () Predicting the birth of aspoken word Proceedings of the National Academy of Sciences ndash

Song J Y Shattuck-Hufnagel S amp Demuth K () Development of phonetic variants(allophones) in -year-olds learning American English a study of alveolar stop t d codasJournal of Phonetics ndash

Tardif T Fletcher P Liang W Zhang Z Kaciroti N amp Marchman V A ()Babyrsquos first words Developmental Psychology ndash

Thal D Jackson-Maldonado D amp Acosta D () Validity of a parent-report measure ofvocabulary and grammar for Spanish-speaking toddlers Journal of Speech Language andHearing Research ndash

Tomasello M amp Mervis C B () The instrument is great but measuringcomprehension is still a problem Monographs of the Society for Research in ChildDevelopment ndash

Wallentin M () Putative sex differences in verbal abilities and language cortex a criticalreview Brain and Language ndash

Weisleder A amp Fernald A () Talking to children matters early language experiencestrengthens processing and builds vocabulary Psychological Science ndash

Wickham H () Ggplot elegant graphics for data analysis New York NY SpringerScience amp Business Media

Wickham H amp Francois R () Dplyr a grammar of data manipulation R packageversion middotmiddotmiddot Online lthttpscranr-projectorgwebpackagesdplyrgt

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Page 12: Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms Communicative Development Inventories

make a plot that enabled comparison of the median level of productionvocabulary she could select lsquoMedianrsquo in the lsquoQuantilesrsquo menu

Selecting lsquoDownload Plotrsquo would result in the plot shown in Figure Orshe could navigate to the lsquoTablersquo tab of the display window to see tabularform data showing the th percentile (median) for both females andmales by age These tabular summary data are available for download viathe lsquoDownload Tablersquo button and the raw data (with a row for each oneof the children represented in the plot) are available via thelsquoDownload Raw Datarsquo button In sum this graphical workflow allowsinterested users to manipulate and download individual parts of thedataset as well as to create visualizations of basic analyses

Extensibility

Extensibility is one of the major strengths of Wordbank Althoughprogramming knowledge is not necessary for interacting with Wordbankinterested researchers with programming skills can contribute to the

Fig A screenshot of the Item Trajectories analysis tool showing a visualization of thedevelopmental trajectory of production for three words (dog choo choo and table) acrossboth Words amp Gestures and Words amp Sentences forms

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

development effort by adding new analyses Each Wordbank analysis app isconstructed as a standalone script or set of scripts in the R languageConstructing an interactive analysis requires specifying a visualization andsome interactive functionality using Shiny Non-interactive analyses can beconstructed as R Markdown documents that execute scripts using theWordbank database Both of these have the virtue of rerunning on thenewest version of the database whenever they are opened so they do notgo out of date as new data are added

In addition we encourage contributions of individual datasets Wordbankcurrently imports data from Excel and CSV formats via automated importscripts Individuals or labs interested in contributing should consult withthe authors for advice about data formatting and upload

WORDBANKR AN R PACKAGE FOR ACCESSING WORDBANK

Although the analysis tools described above suffice for many needsresearchers interested in detailed quantitative or cross-linguistic analysesmay wish to connect directly to the Wordbank database and manipulatethe data directly Making use of the R programming language (RFoundation for Statistical Computing ) we provide the wordbankrpackage to help researchers accomplish this task R is an open-source

Fig A downloaded plot of gender differences in production language forEnglish-speaking children (color online)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

extensible statistical computing environment that is rapidly growing inpopularity across fields and is increasing in use in child language research(eg Norrman amp Bylund Song Shattuck-Hufnagel amp Demuth) The wordbankr package abstracts away the details of connectingto the database Users can take advantage of the SQL tools developed inthe popular dplyr package (Wickham amp Francois ) which makemanipulating large datasets quick and easy We describe the commandsthat the package provides and then give a worked example of using thepackage for a simple analysis

Package details

The wordbankr package is easily installed via CRAN the comprehensive Rarchive network To install simply type installpackages(ldquowordbankrrdquo) After installation users can use the three main data loadingfunctions provided by wordbankrget_administration_data toretrieve information about each CDI administration including the childrsquosdemographics and vocabulary sizes get_item_data to retrieve informationabout each CDI item including its text and categories andget_instrument_data to retrieve administration-by-item response valuesEach of these can be run in remote mode which loads data from theWordbank server or in local mode if the user has a copy of the database setup on their local machine For more detailed documentation see the packagerepository (lthttpgithubcomlangcogwordbankrgt)

Worked example part gender differences across languages

We next demonstrate the analytic potential of direct manipulation of theWordbank database using wordbankr by using the package to extend theworked example of gender differences above This section also replicates alarge-scale analysis by Eriksson et al () To perform the analysis wefirst begin by using wordbankr to load the data from Wordbank andconnect to the tables we need

admins lt- get_administration_data()items lt- get_item_data()

We next use a series of dplyr calls to compute the number of words in eachlanguage select the appropriate subset of the data and calculate theproportion of words produced for this data subset

num_words lt- items gtfilter(form == WS type == word) gtgroup_by(language) gtsummarise(n = n())

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Fig Median production vocabulary as a proportion of total words on an instrument plotted by age in months Red and blue lines showfemales and males respectively (color online)

WORD

BAN

K

of use available at httpsww

wcam

bridgeorgcoreterms httpsdoiorg101017S0305000916000209

Dow

nloaded from httpsw

ww

cambridgeorgcore U

niversity of Chicago on 17 May 2017 at 144547 subject to the Cam

bridge Core terms

vocab_admins lt- admins gtfilter(form == WS isna(sex)) gtselect(data_id language form age sex production)

vocab_data lt- vocab_admins gtgroup_by(language sex age) gtleft_join(num_words) gtmutate(production = production n) gtsummarise(median = median(production))

We then plot the vocab_data data frame using the ggplot2 package(Wickham ) Full code for the analysis as a whole (including the plot)is available at lthttpmikabrgithubiodemo-vocabgenderhtmlgt

The results of this analysis are shown in Figure As expected wereplicate the gender differences found in previous work (Eriksson et al) females showed a small but highly reliable advantage in earlyproduction This effect is highly consistent and clearly visible in elevenout of twelve languages with Italian being the only exception Forcomparison the previous work found a positive female effect for all tenout of ten languages but the size of the effect was close to zero for two ofthese Observational data such as those contained in Wordbank allow usonly to speculate about the origins of this gender difference or the sourcesof cross-linguistic variation (for some discussion see Eriksson et al) But the Wordbank platform dramatically facilitates the formulationand testing of analyses of this sort allowing hypotheses to be testedquickly and easily against large datasets

CONCLUSION

In this paper we have presented Wordbank an open repository forparent-report vocabulary data from the MacArthur-Bates CDI Theinteractive analysis tools available on the Wordbank site allow interestedresearchers to explore a wide variety of phenomena in vocabularydevelopment quickly and easily exporting data and downloadingpresentation-quality graphics that document their analysis In additionusers can contribute new analyses and data to the site and connect to itdirectly using an R package for data loading These functions all facilitategreater sharing and reuse of existing data on childrenrsquos vocabularyenabling new discoveries in the future

REFERENCES

Bates E () Language and context the acquisition of pragmatics (Vol ) New York NYAcademic Press

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Bates E amp Goodman J () On the emergence of grammar from the lexicon InB MacWhinney (ed) The emergence of language (pp ndash) Mahwah NJ LawrenceErlbaum Associates

Bates E Marchman V Thal D Fenson L Dale P Reznick J S Hartung J() Developmental and stylistic variation in the composition of early vocabularyJournal of Child Language ndash

Bloom P () How children learn the meanings of words Cambridge MA MIT PressBornstein M H amp Haynes O M () Vocabulary competence in early childhoodmeasurement latent construct and predictive validity Child Development ndash

Braginsky M Yurovsky D Marchman V A amp Frank M C () Developmentalchanges in the relationship between grammar and the lexicon In D C Noelle R DaleA S Warlaumont J Yoshimi T Matlock C D Jennings amp P P Maglio (Eds)Proceedings of the th Annual Meeting of the Cognitive Science Society Austin TXCognitive Science Society

Brown R () A first language the early stages Cambridge MA Harvard University PressCartmill E A Armstrong B F Gleitman L R Goldin-Meadow S Medina T N ampTrueswell J C () Quality of early parent input predicts child vocabulary yearslater Proceedings of the National Academy of Sciences ndash

Clark E () First language acquisition Cambridge Cambridge University PressDale P S (nd) Adaptations not translations Online lthttpmb-cdistanfordeduadaptationshtmlgt (last accessed )

Dale P S amp Fenson L () Lexical development norms for young children BehaviorResearch Methods Instruments amp Computers ndash

Dale P S amp PenfoldM (nd) Adaptations of theMacArthur-Bates CDI into non-US Englishlanguages Online lthttpmb-cdistanfordedudocumentsAdaptationsSurvey--Webpdfgt (last accessed )

Dickinson D K amp Tabors P O () Beginning literacy with language young childrenlearning at home and school Baltimore MD Paul H Brookes Publishing

Dunn L M amp Dunn L M () Peabody Picture Vocabulary Test th ed ParsippanyNJ AGS Publishing Pearson Assessments

Eriksson M Marschik P B Tulviste T Almgren M Peacuterez Pereira M Wehberg S Gallego C () Differences between girls and boys in emerging language skills evidencefrom language communities British Journal of Developmental Psychology ndash

Feldman H M Dale P S Campbell T F Colborn D K Kurs-Lasky M RocketteH E amp Paradise J L () Concurrent and predictive validity of parent reports of childlanguage at ages and years Child Development ndash

Feldman H M Dollaghan C A Campbell T F Kurs-Lasky M Janosky J E ampParadise J L () Measurement properties of the MacArthur CommunicativeDevelopment Inventories at ages one and two years Child Development ndash

Fenson L Bates E Dale P Goodman J Reznick J S amp Thal D () Replymeasuring variability in early child language donrsquot shoot the messenger ChildDevelopment ndash

Fenson L Dale P S Reznick J S Bates E Hartung J P Pethick S amp Reilly J() MacArthur Communicative Development Inventories userrsquos guide and technicalmanual Baltimore MD Paul H Brookes Publishing Co

Fenson L Dale P Reznick J Bates E Thal D Pethick S Stiles J ()Variability in early communicative development Monographs of the Society for Researchin Child Development

Fenson L Marchman V A Thal D Dale P Reznick J S amp Bates E ()MacArthur-Bates Communicative Development Inventories userrsquos guide and technicalmanual nd ed Baltimore MD Brookes Publishing Company

Hidaka S () Estimating the latent number of types in growing corpora with reducedcostndashaccuracy trade-off Journal of Child Language ndash

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Hills T T Maouene J Riordan B amp Smith L B () The associative structure oflanguage contextual diversity in early word learning Journal of Memory and Language ndash

Hills T T Maouene M Maouene J Sheya A amp Smith L () Longitudinal analysisof early semantic networks Preferential attachment or preferential acquisitionPsychological Science ndash

Huttenlocher J Haight W Bryk A Seltzer M amp Lyons T () Early vocabularygrowth relation to language input and gender Developmental Psychology ndash

Joslashrgensen R N Dale P S Bleses D amp Fenson L () CLEX a cross-linguistic lexicalnorms database Journal of Child Language ndash

Kristoffersen K E Simonsen H G Bleses D Wehberg S Joslashrgensen R N EieslandE A amp Henriksen L Y () The use of the Internet in collecting CDI data ndash anexample from Norway Journal of Child Language ndash

Lieven E Salomo D amp Tomasello M () Two-year-old childrenrsquos production ofmultiword utterances a usage-based analysis Cognitive Linguistics ndash

MacWhinney B () The CHILDES Project tools for analyzing talk rd ed MahwahNJ Lawrence Erlbaum Associates

Marchman V A amp Martiacutenez-Sussmann C () Concurrent validity of caregiverparentreport measures of language for children who are learning both English and SpanishJournal of Speech Language and Hearing Research ndash

Mayor J amp Plunkett K () A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis Developmental Science ndash

Muggeo V M Sciandra M Tomasello A amp Calvo S () Estimating growth chartsvia nonparametric quantile regression a practical framework with application in ecologyEnvironmental and Ecological Statistics ndash

Nelson K () Structure and strategy in learning to talk Monographs of the Society forResearch in Child Development ndash

Norrman G amp Bylund E () The irreversibility of sensitive period effects in languagedevelopment evidence from second language acquisition in international adopteesDevelopmental Science ndash

R Foundation for Statistical Computing () R a language and environment for statisticalcomputing Software online lthttpwwwr-projectorggt

Rescorla L () The language development survey a screening tool for delayed languagein toddlers Journal of Speech and Hearing Disorders ndash

Roy B C Frank M C DeCamp P Miller M amp Roy D () Predicting the birth of aspoken word Proceedings of the National Academy of Sciences ndash

Song J Y Shattuck-Hufnagel S amp Demuth K () Development of phonetic variants(allophones) in -year-olds learning American English a study of alveolar stop t d codasJournal of Phonetics ndash

Tardif T Fletcher P Liang W Zhang Z Kaciroti N amp Marchman V A ()Babyrsquos first words Developmental Psychology ndash

Thal D Jackson-Maldonado D amp Acosta D () Validity of a parent-report measure ofvocabulary and grammar for Spanish-speaking toddlers Journal of Speech Language andHearing Research ndash

Tomasello M amp Mervis C B () The instrument is great but measuringcomprehension is still a problem Monographs of the Society for Research in ChildDevelopment ndash

Wallentin M () Putative sex differences in verbal abilities and language cortex a criticalreview Brain and Language ndash

Weisleder A amp Fernald A () Talking to children matters early language experiencestrengthens processing and builds vocabulary Psychological Science ndash

Wickham H () Ggplot elegant graphics for data analysis New York NY SpringerScience amp Business Media

Wickham H amp Francois R () Dplyr a grammar of data manipulation R packageversion middotmiddotmiddot Online lthttpscranr-projectorgwebpackagesdplyrgt

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Page 13: Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms Communicative Development Inventories

development effort by adding new analyses Each Wordbank analysis app isconstructed as a standalone script or set of scripts in the R languageConstructing an interactive analysis requires specifying a visualization andsome interactive functionality using Shiny Non-interactive analyses can beconstructed as R Markdown documents that execute scripts using theWordbank database Both of these have the virtue of rerunning on thenewest version of the database whenever they are opened so they do notgo out of date as new data are added

In addition we encourage contributions of individual datasets Wordbankcurrently imports data from Excel and CSV formats via automated importscripts Individuals or labs interested in contributing should consult withthe authors for advice about data formatting and upload

WORDBANKR AN R PACKAGE FOR ACCESSING WORDBANK

Although the analysis tools described above suffice for many needsresearchers interested in detailed quantitative or cross-linguistic analysesmay wish to connect directly to the Wordbank database and manipulatethe data directly Making use of the R programming language (RFoundation for Statistical Computing ) we provide the wordbankrpackage to help researchers accomplish this task R is an open-source

Fig A downloaded plot of gender differences in production language forEnglish-speaking children (color online)

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

extensible statistical computing environment that is rapidly growing inpopularity across fields and is increasing in use in child language research(eg Norrman amp Bylund Song Shattuck-Hufnagel amp Demuth) The wordbankr package abstracts away the details of connectingto the database Users can take advantage of the SQL tools developed inthe popular dplyr package (Wickham amp Francois ) which makemanipulating large datasets quick and easy We describe the commandsthat the package provides and then give a worked example of using thepackage for a simple analysis

Package details

The wordbankr package is easily installed via CRAN the comprehensive Rarchive network To install simply type installpackages(ldquowordbankrrdquo) After installation users can use the three main data loadingfunctions provided by wordbankrget_administration_data toretrieve information about each CDI administration including the childrsquosdemographics and vocabulary sizes get_item_data to retrieve informationabout each CDI item including its text and categories andget_instrument_data to retrieve administration-by-item response valuesEach of these can be run in remote mode which loads data from theWordbank server or in local mode if the user has a copy of the database setup on their local machine For more detailed documentation see the packagerepository (lthttpgithubcomlangcogwordbankrgt)

Worked example part gender differences across languages

We next demonstrate the analytic potential of direct manipulation of theWordbank database using wordbankr by using the package to extend theworked example of gender differences above This section also replicates alarge-scale analysis by Eriksson et al () To perform the analysis wefirst begin by using wordbankr to load the data from Wordbank andconnect to the tables we need

admins lt- get_administration_data()items lt- get_item_data()

We next use a series of dplyr calls to compute the number of words in eachlanguage select the appropriate subset of the data and calculate theproportion of words produced for this data subset

num_words lt- items gtfilter(form == WS type == word) gtgroup_by(language) gtsummarise(n = n())

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Fig Median production vocabulary as a proportion of total words on an instrument plotted by age in months Red and blue lines showfemales and males respectively (color online)

WORD

BAN

K

of use available at httpsww

wcam

bridgeorgcoreterms httpsdoiorg101017S0305000916000209

Dow

nloaded from httpsw

ww

cambridgeorgcore U

niversity of Chicago on 17 May 2017 at 144547 subject to the Cam

bridge Core terms

vocab_admins lt- admins gtfilter(form == WS isna(sex)) gtselect(data_id language form age sex production)

vocab_data lt- vocab_admins gtgroup_by(language sex age) gtleft_join(num_words) gtmutate(production = production n) gtsummarise(median = median(production))

We then plot the vocab_data data frame using the ggplot2 package(Wickham ) Full code for the analysis as a whole (including the plot)is available at lthttpmikabrgithubiodemo-vocabgenderhtmlgt

The results of this analysis are shown in Figure As expected wereplicate the gender differences found in previous work (Eriksson et al) females showed a small but highly reliable advantage in earlyproduction This effect is highly consistent and clearly visible in elevenout of twelve languages with Italian being the only exception Forcomparison the previous work found a positive female effect for all tenout of ten languages but the size of the effect was close to zero for two ofthese Observational data such as those contained in Wordbank allow usonly to speculate about the origins of this gender difference or the sourcesof cross-linguistic variation (for some discussion see Eriksson et al) But the Wordbank platform dramatically facilitates the formulationand testing of analyses of this sort allowing hypotheses to be testedquickly and easily against large datasets

CONCLUSION

In this paper we have presented Wordbank an open repository forparent-report vocabulary data from the MacArthur-Bates CDI Theinteractive analysis tools available on the Wordbank site allow interestedresearchers to explore a wide variety of phenomena in vocabularydevelopment quickly and easily exporting data and downloadingpresentation-quality graphics that document their analysis In additionusers can contribute new analyses and data to the site and connect to itdirectly using an R package for data loading These functions all facilitategreater sharing and reuse of existing data on childrenrsquos vocabularyenabling new discoveries in the future

REFERENCES

Bates E () Language and context the acquisition of pragmatics (Vol ) New York NYAcademic Press

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Bates E amp Goodman J () On the emergence of grammar from the lexicon InB MacWhinney (ed) The emergence of language (pp ndash) Mahwah NJ LawrenceErlbaum Associates

Bates E Marchman V Thal D Fenson L Dale P Reznick J S Hartung J() Developmental and stylistic variation in the composition of early vocabularyJournal of Child Language ndash

Bloom P () How children learn the meanings of words Cambridge MA MIT PressBornstein M H amp Haynes O M () Vocabulary competence in early childhoodmeasurement latent construct and predictive validity Child Development ndash

Braginsky M Yurovsky D Marchman V A amp Frank M C () Developmentalchanges in the relationship between grammar and the lexicon In D C Noelle R DaleA S Warlaumont J Yoshimi T Matlock C D Jennings amp P P Maglio (Eds)Proceedings of the th Annual Meeting of the Cognitive Science Society Austin TXCognitive Science Society

Brown R () A first language the early stages Cambridge MA Harvard University PressCartmill E A Armstrong B F Gleitman L R Goldin-Meadow S Medina T N ampTrueswell J C () Quality of early parent input predicts child vocabulary yearslater Proceedings of the National Academy of Sciences ndash

Clark E () First language acquisition Cambridge Cambridge University PressDale P S (nd) Adaptations not translations Online lthttpmb-cdistanfordeduadaptationshtmlgt (last accessed )

Dale P S amp Fenson L () Lexical development norms for young children BehaviorResearch Methods Instruments amp Computers ndash

Dale P S amp PenfoldM (nd) Adaptations of theMacArthur-Bates CDI into non-US Englishlanguages Online lthttpmb-cdistanfordedudocumentsAdaptationsSurvey--Webpdfgt (last accessed )

Dickinson D K amp Tabors P O () Beginning literacy with language young childrenlearning at home and school Baltimore MD Paul H Brookes Publishing

Dunn L M amp Dunn L M () Peabody Picture Vocabulary Test th ed ParsippanyNJ AGS Publishing Pearson Assessments

Eriksson M Marschik P B Tulviste T Almgren M Peacuterez Pereira M Wehberg S Gallego C () Differences between girls and boys in emerging language skills evidencefrom language communities British Journal of Developmental Psychology ndash

Feldman H M Dale P S Campbell T F Colborn D K Kurs-Lasky M RocketteH E amp Paradise J L () Concurrent and predictive validity of parent reports of childlanguage at ages and years Child Development ndash

Feldman H M Dollaghan C A Campbell T F Kurs-Lasky M Janosky J E ampParadise J L () Measurement properties of the MacArthur CommunicativeDevelopment Inventories at ages one and two years Child Development ndash

Fenson L Bates E Dale P Goodman J Reznick J S amp Thal D () Replymeasuring variability in early child language donrsquot shoot the messenger ChildDevelopment ndash

Fenson L Dale P S Reznick J S Bates E Hartung J P Pethick S amp Reilly J() MacArthur Communicative Development Inventories userrsquos guide and technicalmanual Baltimore MD Paul H Brookes Publishing Co

Fenson L Dale P Reznick J Bates E Thal D Pethick S Stiles J ()Variability in early communicative development Monographs of the Society for Researchin Child Development

Fenson L Marchman V A Thal D Dale P Reznick J S amp Bates E ()MacArthur-Bates Communicative Development Inventories userrsquos guide and technicalmanual nd ed Baltimore MD Brookes Publishing Company

Hidaka S () Estimating the latent number of types in growing corpora with reducedcostndashaccuracy trade-off Journal of Child Language ndash

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Hills T T Maouene J Riordan B amp Smith L B () The associative structure oflanguage contextual diversity in early word learning Journal of Memory and Language ndash

Hills T T Maouene M Maouene J Sheya A amp Smith L () Longitudinal analysisof early semantic networks Preferential attachment or preferential acquisitionPsychological Science ndash

Huttenlocher J Haight W Bryk A Seltzer M amp Lyons T () Early vocabularygrowth relation to language input and gender Developmental Psychology ndash

Joslashrgensen R N Dale P S Bleses D amp Fenson L () CLEX a cross-linguistic lexicalnorms database Journal of Child Language ndash

Kristoffersen K E Simonsen H G Bleses D Wehberg S Joslashrgensen R N EieslandE A amp Henriksen L Y () The use of the Internet in collecting CDI data ndash anexample from Norway Journal of Child Language ndash

Lieven E Salomo D amp Tomasello M () Two-year-old childrenrsquos production ofmultiword utterances a usage-based analysis Cognitive Linguistics ndash

MacWhinney B () The CHILDES Project tools for analyzing talk rd ed MahwahNJ Lawrence Erlbaum Associates

Marchman V A amp Martiacutenez-Sussmann C () Concurrent validity of caregiverparentreport measures of language for children who are learning both English and SpanishJournal of Speech Language and Hearing Research ndash

Mayor J amp Plunkett K () A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis Developmental Science ndash

Muggeo V M Sciandra M Tomasello A amp Calvo S () Estimating growth chartsvia nonparametric quantile regression a practical framework with application in ecologyEnvironmental and Ecological Statistics ndash

Nelson K () Structure and strategy in learning to talk Monographs of the Society forResearch in Child Development ndash

Norrman G amp Bylund E () The irreversibility of sensitive period effects in languagedevelopment evidence from second language acquisition in international adopteesDevelopmental Science ndash

R Foundation for Statistical Computing () R a language and environment for statisticalcomputing Software online lthttpwwwr-projectorggt

Rescorla L () The language development survey a screening tool for delayed languagein toddlers Journal of Speech and Hearing Disorders ndash

Roy B C Frank M C DeCamp P Miller M amp Roy D () Predicting the birth of aspoken word Proceedings of the National Academy of Sciences ndash

Song J Y Shattuck-Hufnagel S amp Demuth K () Development of phonetic variants(allophones) in -year-olds learning American English a study of alveolar stop t d codasJournal of Phonetics ndash

Tardif T Fletcher P Liang W Zhang Z Kaciroti N amp Marchman V A ()Babyrsquos first words Developmental Psychology ndash

Thal D Jackson-Maldonado D amp Acosta D () Validity of a parent-report measure ofvocabulary and grammar for Spanish-speaking toddlers Journal of Speech Language andHearing Research ndash

Tomasello M amp Mervis C B () The instrument is great but measuringcomprehension is still a problem Monographs of the Society for Research in ChildDevelopment ndash

Wallentin M () Putative sex differences in verbal abilities and language cortex a criticalreview Brain and Language ndash

Weisleder A amp Fernald A () Talking to children matters early language experiencestrengthens processing and builds vocabulary Psychological Science ndash

Wickham H () Ggplot elegant graphics for data analysis New York NY SpringerScience amp Business Media

Wickham H amp Francois R () Dplyr a grammar of data manipulation R packageversion middotmiddotmiddot Online lthttpscranr-projectorgwebpackagesdplyrgt

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Page 14: Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms Communicative Development Inventories

extensible statistical computing environment that is rapidly growing inpopularity across fields and is increasing in use in child language research(eg Norrman amp Bylund Song Shattuck-Hufnagel amp Demuth) The wordbankr package abstracts away the details of connectingto the database Users can take advantage of the SQL tools developed inthe popular dplyr package (Wickham amp Francois ) which makemanipulating large datasets quick and easy We describe the commandsthat the package provides and then give a worked example of using thepackage for a simple analysis

Package details

The wordbankr package is easily installed via CRAN the comprehensive Rarchive network To install simply type installpackages(ldquowordbankrrdquo) After installation users can use the three main data loadingfunctions provided by wordbankrget_administration_data toretrieve information about each CDI administration including the childrsquosdemographics and vocabulary sizes get_item_data to retrieve informationabout each CDI item including its text and categories andget_instrument_data to retrieve administration-by-item response valuesEach of these can be run in remote mode which loads data from theWordbank server or in local mode if the user has a copy of the database setup on their local machine For more detailed documentation see the packagerepository (lthttpgithubcomlangcogwordbankrgt)

Worked example part gender differences across languages

We next demonstrate the analytic potential of direct manipulation of theWordbank database using wordbankr by using the package to extend theworked example of gender differences above This section also replicates alarge-scale analysis by Eriksson et al () To perform the analysis wefirst begin by using wordbankr to load the data from Wordbank andconnect to the tables we need

admins lt- get_administration_data()items lt- get_item_data()

We next use a series of dplyr calls to compute the number of words in eachlanguage select the appropriate subset of the data and calculate theproportion of words produced for this data subset

num_words lt- items gtfilter(form == WS type == word) gtgroup_by(language) gtsummarise(n = n())

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Fig Median production vocabulary as a proportion of total words on an instrument plotted by age in months Red and blue lines showfemales and males respectively (color online)

WORD

BAN

K

of use available at httpsww

wcam

bridgeorgcoreterms httpsdoiorg101017S0305000916000209

Dow

nloaded from httpsw

ww

cambridgeorgcore U

niversity of Chicago on 17 May 2017 at 144547 subject to the Cam

bridge Core terms

vocab_admins lt- admins gtfilter(form == WS isna(sex)) gtselect(data_id language form age sex production)

vocab_data lt- vocab_admins gtgroup_by(language sex age) gtleft_join(num_words) gtmutate(production = production n) gtsummarise(median = median(production))

We then plot the vocab_data data frame using the ggplot2 package(Wickham ) Full code for the analysis as a whole (including the plot)is available at lthttpmikabrgithubiodemo-vocabgenderhtmlgt

The results of this analysis are shown in Figure As expected wereplicate the gender differences found in previous work (Eriksson et al) females showed a small but highly reliable advantage in earlyproduction This effect is highly consistent and clearly visible in elevenout of twelve languages with Italian being the only exception Forcomparison the previous work found a positive female effect for all tenout of ten languages but the size of the effect was close to zero for two ofthese Observational data such as those contained in Wordbank allow usonly to speculate about the origins of this gender difference or the sourcesof cross-linguistic variation (for some discussion see Eriksson et al) But the Wordbank platform dramatically facilitates the formulationand testing of analyses of this sort allowing hypotheses to be testedquickly and easily against large datasets

CONCLUSION

In this paper we have presented Wordbank an open repository forparent-report vocabulary data from the MacArthur-Bates CDI Theinteractive analysis tools available on the Wordbank site allow interestedresearchers to explore a wide variety of phenomena in vocabularydevelopment quickly and easily exporting data and downloadingpresentation-quality graphics that document their analysis In additionusers can contribute new analyses and data to the site and connect to itdirectly using an R package for data loading These functions all facilitategreater sharing and reuse of existing data on childrenrsquos vocabularyenabling new discoveries in the future

REFERENCES

Bates E () Language and context the acquisition of pragmatics (Vol ) New York NYAcademic Press

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Bates E amp Goodman J () On the emergence of grammar from the lexicon InB MacWhinney (ed) The emergence of language (pp ndash) Mahwah NJ LawrenceErlbaum Associates

Bates E Marchman V Thal D Fenson L Dale P Reznick J S Hartung J() Developmental and stylistic variation in the composition of early vocabularyJournal of Child Language ndash

Bloom P () How children learn the meanings of words Cambridge MA MIT PressBornstein M H amp Haynes O M () Vocabulary competence in early childhoodmeasurement latent construct and predictive validity Child Development ndash

Braginsky M Yurovsky D Marchman V A amp Frank M C () Developmentalchanges in the relationship between grammar and the lexicon In D C Noelle R DaleA S Warlaumont J Yoshimi T Matlock C D Jennings amp P P Maglio (Eds)Proceedings of the th Annual Meeting of the Cognitive Science Society Austin TXCognitive Science Society

Brown R () A first language the early stages Cambridge MA Harvard University PressCartmill E A Armstrong B F Gleitman L R Goldin-Meadow S Medina T N ampTrueswell J C () Quality of early parent input predicts child vocabulary yearslater Proceedings of the National Academy of Sciences ndash

Clark E () First language acquisition Cambridge Cambridge University PressDale P S (nd) Adaptations not translations Online lthttpmb-cdistanfordeduadaptationshtmlgt (last accessed )

Dale P S amp Fenson L () Lexical development norms for young children BehaviorResearch Methods Instruments amp Computers ndash

Dale P S amp PenfoldM (nd) Adaptations of theMacArthur-Bates CDI into non-US Englishlanguages Online lthttpmb-cdistanfordedudocumentsAdaptationsSurvey--Webpdfgt (last accessed )

Dickinson D K amp Tabors P O () Beginning literacy with language young childrenlearning at home and school Baltimore MD Paul H Brookes Publishing

Dunn L M amp Dunn L M () Peabody Picture Vocabulary Test th ed ParsippanyNJ AGS Publishing Pearson Assessments

Eriksson M Marschik P B Tulviste T Almgren M Peacuterez Pereira M Wehberg S Gallego C () Differences between girls and boys in emerging language skills evidencefrom language communities British Journal of Developmental Psychology ndash

Feldman H M Dale P S Campbell T F Colborn D K Kurs-Lasky M RocketteH E amp Paradise J L () Concurrent and predictive validity of parent reports of childlanguage at ages and years Child Development ndash

Feldman H M Dollaghan C A Campbell T F Kurs-Lasky M Janosky J E ampParadise J L () Measurement properties of the MacArthur CommunicativeDevelopment Inventories at ages one and two years Child Development ndash

Fenson L Bates E Dale P Goodman J Reznick J S amp Thal D () Replymeasuring variability in early child language donrsquot shoot the messenger ChildDevelopment ndash

Fenson L Dale P S Reznick J S Bates E Hartung J P Pethick S amp Reilly J() MacArthur Communicative Development Inventories userrsquos guide and technicalmanual Baltimore MD Paul H Brookes Publishing Co

Fenson L Dale P Reznick J Bates E Thal D Pethick S Stiles J ()Variability in early communicative development Monographs of the Society for Researchin Child Development

Fenson L Marchman V A Thal D Dale P Reznick J S amp Bates E ()MacArthur-Bates Communicative Development Inventories userrsquos guide and technicalmanual nd ed Baltimore MD Brookes Publishing Company

Hidaka S () Estimating the latent number of types in growing corpora with reducedcostndashaccuracy trade-off Journal of Child Language ndash

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Hills T T Maouene J Riordan B amp Smith L B () The associative structure oflanguage contextual diversity in early word learning Journal of Memory and Language ndash

Hills T T Maouene M Maouene J Sheya A amp Smith L () Longitudinal analysisof early semantic networks Preferential attachment or preferential acquisitionPsychological Science ndash

Huttenlocher J Haight W Bryk A Seltzer M amp Lyons T () Early vocabularygrowth relation to language input and gender Developmental Psychology ndash

Joslashrgensen R N Dale P S Bleses D amp Fenson L () CLEX a cross-linguistic lexicalnorms database Journal of Child Language ndash

Kristoffersen K E Simonsen H G Bleses D Wehberg S Joslashrgensen R N EieslandE A amp Henriksen L Y () The use of the Internet in collecting CDI data ndash anexample from Norway Journal of Child Language ndash

Lieven E Salomo D amp Tomasello M () Two-year-old childrenrsquos production ofmultiword utterances a usage-based analysis Cognitive Linguistics ndash

MacWhinney B () The CHILDES Project tools for analyzing talk rd ed MahwahNJ Lawrence Erlbaum Associates

Marchman V A amp Martiacutenez-Sussmann C () Concurrent validity of caregiverparentreport measures of language for children who are learning both English and SpanishJournal of Speech Language and Hearing Research ndash

Mayor J amp Plunkett K () A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis Developmental Science ndash

Muggeo V M Sciandra M Tomasello A amp Calvo S () Estimating growth chartsvia nonparametric quantile regression a practical framework with application in ecologyEnvironmental and Ecological Statistics ndash

Nelson K () Structure and strategy in learning to talk Monographs of the Society forResearch in Child Development ndash

Norrman G amp Bylund E () The irreversibility of sensitive period effects in languagedevelopment evidence from second language acquisition in international adopteesDevelopmental Science ndash

R Foundation for Statistical Computing () R a language and environment for statisticalcomputing Software online lthttpwwwr-projectorggt

Rescorla L () The language development survey a screening tool for delayed languagein toddlers Journal of Speech and Hearing Disorders ndash

Roy B C Frank M C DeCamp P Miller M amp Roy D () Predicting the birth of aspoken word Proceedings of the National Academy of Sciences ndash

Song J Y Shattuck-Hufnagel S amp Demuth K () Development of phonetic variants(allophones) in -year-olds learning American English a study of alveolar stop t d codasJournal of Phonetics ndash

Tardif T Fletcher P Liang W Zhang Z Kaciroti N amp Marchman V A ()Babyrsquos first words Developmental Psychology ndash

Thal D Jackson-Maldonado D amp Acosta D () Validity of a parent-report measure ofvocabulary and grammar for Spanish-speaking toddlers Journal of Speech Language andHearing Research ndash

Tomasello M amp Mervis C B () The instrument is great but measuringcomprehension is still a problem Monographs of the Society for Research in ChildDevelopment ndash

Wallentin M () Putative sex differences in verbal abilities and language cortex a criticalreview Brain and Language ndash

Weisleder A amp Fernald A () Talking to children matters early language experiencestrengthens processing and builds vocabulary Psychological Science ndash

Wickham H () Ggplot elegant graphics for data analysis New York NY SpringerScience amp Business Media

Wickham H amp Francois R () Dplyr a grammar of data manipulation R packageversion middotmiddotmiddot Online lthttpscranr-projectorgwebpackagesdplyrgt

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Page 15: Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms Communicative Development Inventories

Fig Median production vocabulary as a proportion of total words on an instrument plotted by age in months Red and blue lines showfemales and males respectively (color online)

WORD

BAN

K

of use available at httpsww

wcam

bridgeorgcoreterms httpsdoiorg101017S0305000916000209

Dow

nloaded from httpsw

ww

cambridgeorgcore U

niversity of Chicago on 17 May 2017 at 144547 subject to the Cam

bridge Core terms

vocab_admins lt- admins gtfilter(form == WS isna(sex)) gtselect(data_id language form age sex production)

vocab_data lt- vocab_admins gtgroup_by(language sex age) gtleft_join(num_words) gtmutate(production = production n) gtsummarise(median = median(production))

We then plot the vocab_data data frame using the ggplot2 package(Wickham ) Full code for the analysis as a whole (including the plot)is available at lthttpmikabrgithubiodemo-vocabgenderhtmlgt

The results of this analysis are shown in Figure As expected wereplicate the gender differences found in previous work (Eriksson et al) females showed a small but highly reliable advantage in earlyproduction This effect is highly consistent and clearly visible in elevenout of twelve languages with Italian being the only exception Forcomparison the previous work found a positive female effect for all tenout of ten languages but the size of the effect was close to zero for two ofthese Observational data such as those contained in Wordbank allow usonly to speculate about the origins of this gender difference or the sourcesof cross-linguistic variation (for some discussion see Eriksson et al) But the Wordbank platform dramatically facilitates the formulationand testing of analyses of this sort allowing hypotheses to be testedquickly and easily against large datasets

CONCLUSION

In this paper we have presented Wordbank an open repository forparent-report vocabulary data from the MacArthur-Bates CDI Theinteractive analysis tools available on the Wordbank site allow interestedresearchers to explore a wide variety of phenomena in vocabularydevelopment quickly and easily exporting data and downloadingpresentation-quality graphics that document their analysis In additionusers can contribute new analyses and data to the site and connect to itdirectly using an R package for data loading These functions all facilitategreater sharing and reuse of existing data on childrenrsquos vocabularyenabling new discoveries in the future

REFERENCES

Bates E () Language and context the acquisition of pragmatics (Vol ) New York NYAcademic Press

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Bates E amp Goodman J () On the emergence of grammar from the lexicon InB MacWhinney (ed) The emergence of language (pp ndash) Mahwah NJ LawrenceErlbaum Associates

Bates E Marchman V Thal D Fenson L Dale P Reznick J S Hartung J() Developmental and stylistic variation in the composition of early vocabularyJournal of Child Language ndash

Bloom P () How children learn the meanings of words Cambridge MA MIT PressBornstein M H amp Haynes O M () Vocabulary competence in early childhoodmeasurement latent construct and predictive validity Child Development ndash

Braginsky M Yurovsky D Marchman V A amp Frank M C () Developmentalchanges in the relationship between grammar and the lexicon In D C Noelle R DaleA S Warlaumont J Yoshimi T Matlock C D Jennings amp P P Maglio (Eds)Proceedings of the th Annual Meeting of the Cognitive Science Society Austin TXCognitive Science Society

Brown R () A first language the early stages Cambridge MA Harvard University PressCartmill E A Armstrong B F Gleitman L R Goldin-Meadow S Medina T N ampTrueswell J C () Quality of early parent input predicts child vocabulary yearslater Proceedings of the National Academy of Sciences ndash

Clark E () First language acquisition Cambridge Cambridge University PressDale P S (nd) Adaptations not translations Online lthttpmb-cdistanfordeduadaptationshtmlgt (last accessed )

Dale P S amp Fenson L () Lexical development norms for young children BehaviorResearch Methods Instruments amp Computers ndash

Dale P S amp PenfoldM (nd) Adaptations of theMacArthur-Bates CDI into non-US Englishlanguages Online lthttpmb-cdistanfordedudocumentsAdaptationsSurvey--Webpdfgt (last accessed )

Dickinson D K amp Tabors P O () Beginning literacy with language young childrenlearning at home and school Baltimore MD Paul H Brookes Publishing

Dunn L M amp Dunn L M () Peabody Picture Vocabulary Test th ed ParsippanyNJ AGS Publishing Pearson Assessments

Eriksson M Marschik P B Tulviste T Almgren M Peacuterez Pereira M Wehberg S Gallego C () Differences between girls and boys in emerging language skills evidencefrom language communities British Journal of Developmental Psychology ndash

Feldman H M Dale P S Campbell T F Colborn D K Kurs-Lasky M RocketteH E amp Paradise J L () Concurrent and predictive validity of parent reports of childlanguage at ages and years Child Development ndash

Feldman H M Dollaghan C A Campbell T F Kurs-Lasky M Janosky J E ampParadise J L () Measurement properties of the MacArthur CommunicativeDevelopment Inventories at ages one and two years Child Development ndash

Fenson L Bates E Dale P Goodman J Reznick J S amp Thal D () Replymeasuring variability in early child language donrsquot shoot the messenger ChildDevelopment ndash

Fenson L Dale P S Reznick J S Bates E Hartung J P Pethick S amp Reilly J() MacArthur Communicative Development Inventories userrsquos guide and technicalmanual Baltimore MD Paul H Brookes Publishing Co

Fenson L Dale P Reznick J Bates E Thal D Pethick S Stiles J ()Variability in early communicative development Monographs of the Society for Researchin Child Development

Fenson L Marchman V A Thal D Dale P Reznick J S amp Bates E ()MacArthur-Bates Communicative Development Inventories userrsquos guide and technicalmanual nd ed Baltimore MD Brookes Publishing Company

Hidaka S () Estimating the latent number of types in growing corpora with reducedcostndashaccuracy trade-off Journal of Child Language ndash

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Hills T T Maouene J Riordan B amp Smith L B () The associative structure oflanguage contextual diversity in early word learning Journal of Memory and Language ndash

Hills T T Maouene M Maouene J Sheya A amp Smith L () Longitudinal analysisof early semantic networks Preferential attachment or preferential acquisitionPsychological Science ndash

Huttenlocher J Haight W Bryk A Seltzer M amp Lyons T () Early vocabularygrowth relation to language input and gender Developmental Psychology ndash

Joslashrgensen R N Dale P S Bleses D amp Fenson L () CLEX a cross-linguistic lexicalnorms database Journal of Child Language ndash

Kristoffersen K E Simonsen H G Bleses D Wehberg S Joslashrgensen R N EieslandE A amp Henriksen L Y () The use of the Internet in collecting CDI data ndash anexample from Norway Journal of Child Language ndash

Lieven E Salomo D amp Tomasello M () Two-year-old childrenrsquos production ofmultiword utterances a usage-based analysis Cognitive Linguistics ndash

MacWhinney B () The CHILDES Project tools for analyzing talk rd ed MahwahNJ Lawrence Erlbaum Associates

Marchman V A amp Martiacutenez-Sussmann C () Concurrent validity of caregiverparentreport measures of language for children who are learning both English and SpanishJournal of Speech Language and Hearing Research ndash

Mayor J amp Plunkett K () A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis Developmental Science ndash

Muggeo V M Sciandra M Tomasello A amp Calvo S () Estimating growth chartsvia nonparametric quantile regression a practical framework with application in ecologyEnvironmental and Ecological Statistics ndash

Nelson K () Structure and strategy in learning to talk Monographs of the Society forResearch in Child Development ndash

Norrman G amp Bylund E () The irreversibility of sensitive period effects in languagedevelopment evidence from second language acquisition in international adopteesDevelopmental Science ndash

R Foundation for Statistical Computing () R a language and environment for statisticalcomputing Software online lthttpwwwr-projectorggt

Rescorla L () The language development survey a screening tool for delayed languagein toddlers Journal of Speech and Hearing Disorders ndash

Roy B C Frank M C DeCamp P Miller M amp Roy D () Predicting the birth of aspoken word Proceedings of the National Academy of Sciences ndash

Song J Y Shattuck-Hufnagel S amp Demuth K () Development of phonetic variants(allophones) in -year-olds learning American English a study of alveolar stop t d codasJournal of Phonetics ndash

Tardif T Fletcher P Liang W Zhang Z Kaciroti N amp Marchman V A ()Babyrsquos first words Developmental Psychology ndash

Thal D Jackson-Maldonado D amp Acosta D () Validity of a parent-report measure ofvocabulary and grammar for Spanish-speaking toddlers Journal of Speech Language andHearing Research ndash

Tomasello M amp Mervis C B () The instrument is great but measuringcomprehension is still a problem Monographs of the Society for Research in ChildDevelopment ndash

Wallentin M () Putative sex differences in verbal abilities and language cortex a criticalreview Brain and Language ndash

Weisleder A amp Fernald A () Talking to children matters early language experiencestrengthens processing and builds vocabulary Psychological Science ndash

Wickham H () Ggplot elegant graphics for data analysis New York NY SpringerScience amp Business Media

Wickham H amp Francois R () Dplyr a grammar of data manipulation R packageversion middotmiddotmiddot Online lthttpscranr-projectorgwebpackagesdplyrgt

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Page 16: Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms Communicative Development Inventories

vocab_admins lt- admins gtfilter(form == WS isna(sex)) gtselect(data_id language form age sex production)

vocab_data lt- vocab_admins gtgroup_by(language sex age) gtleft_join(num_words) gtmutate(production = production n) gtsummarise(median = median(production))

We then plot the vocab_data data frame using the ggplot2 package(Wickham ) Full code for the analysis as a whole (including the plot)is available at lthttpmikabrgithubiodemo-vocabgenderhtmlgt

The results of this analysis are shown in Figure As expected wereplicate the gender differences found in previous work (Eriksson et al) females showed a small but highly reliable advantage in earlyproduction This effect is highly consistent and clearly visible in elevenout of twelve languages with Italian being the only exception Forcomparison the previous work found a positive female effect for all tenout of ten languages but the size of the effect was close to zero for two ofthese Observational data such as those contained in Wordbank allow usonly to speculate about the origins of this gender difference or the sourcesof cross-linguistic variation (for some discussion see Eriksson et al) But the Wordbank platform dramatically facilitates the formulationand testing of analyses of this sort allowing hypotheses to be testedquickly and easily against large datasets

CONCLUSION

In this paper we have presented Wordbank an open repository forparent-report vocabulary data from the MacArthur-Bates CDI Theinteractive analysis tools available on the Wordbank site allow interestedresearchers to explore a wide variety of phenomena in vocabularydevelopment quickly and easily exporting data and downloadingpresentation-quality graphics that document their analysis In additionusers can contribute new analyses and data to the site and connect to itdirectly using an R package for data loading These functions all facilitategreater sharing and reuse of existing data on childrenrsquos vocabularyenabling new discoveries in the future

REFERENCES

Bates E () Language and context the acquisition of pragmatics (Vol ) New York NYAcademic Press

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Bates E amp Goodman J () On the emergence of grammar from the lexicon InB MacWhinney (ed) The emergence of language (pp ndash) Mahwah NJ LawrenceErlbaum Associates

Bates E Marchman V Thal D Fenson L Dale P Reznick J S Hartung J() Developmental and stylistic variation in the composition of early vocabularyJournal of Child Language ndash

Bloom P () How children learn the meanings of words Cambridge MA MIT PressBornstein M H amp Haynes O M () Vocabulary competence in early childhoodmeasurement latent construct and predictive validity Child Development ndash

Braginsky M Yurovsky D Marchman V A amp Frank M C () Developmentalchanges in the relationship between grammar and the lexicon In D C Noelle R DaleA S Warlaumont J Yoshimi T Matlock C D Jennings amp P P Maglio (Eds)Proceedings of the th Annual Meeting of the Cognitive Science Society Austin TXCognitive Science Society

Brown R () A first language the early stages Cambridge MA Harvard University PressCartmill E A Armstrong B F Gleitman L R Goldin-Meadow S Medina T N ampTrueswell J C () Quality of early parent input predicts child vocabulary yearslater Proceedings of the National Academy of Sciences ndash

Clark E () First language acquisition Cambridge Cambridge University PressDale P S (nd) Adaptations not translations Online lthttpmb-cdistanfordeduadaptationshtmlgt (last accessed )

Dale P S amp Fenson L () Lexical development norms for young children BehaviorResearch Methods Instruments amp Computers ndash

Dale P S amp PenfoldM (nd) Adaptations of theMacArthur-Bates CDI into non-US Englishlanguages Online lthttpmb-cdistanfordedudocumentsAdaptationsSurvey--Webpdfgt (last accessed )

Dickinson D K amp Tabors P O () Beginning literacy with language young childrenlearning at home and school Baltimore MD Paul H Brookes Publishing

Dunn L M amp Dunn L M () Peabody Picture Vocabulary Test th ed ParsippanyNJ AGS Publishing Pearson Assessments

Eriksson M Marschik P B Tulviste T Almgren M Peacuterez Pereira M Wehberg S Gallego C () Differences between girls and boys in emerging language skills evidencefrom language communities British Journal of Developmental Psychology ndash

Feldman H M Dale P S Campbell T F Colborn D K Kurs-Lasky M RocketteH E amp Paradise J L () Concurrent and predictive validity of parent reports of childlanguage at ages and years Child Development ndash

Feldman H M Dollaghan C A Campbell T F Kurs-Lasky M Janosky J E ampParadise J L () Measurement properties of the MacArthur CommunicativeDevelopment Inventories at ages one and two years Child Development ndash

Fenson L Bates E Dale P Goodman J Reznick J S amp Thal D () Replymeasuring variability in early child language donrsquot shoot the messenger ChildDevelopment ndash

Fenson L Dale P S Reznick J S Bates E Hartung J P Pethick S amp Reilly J() MacArthur Communicative Development Inventories userrsquos guide and technicalmanual Baltimore MD Paul H Brookes Publishing Co

Fenson L Dale P Reznick J Bates E Thal D Pethick S Stiles J ()Variability in early communicative development Monographs of the Society for Researchin Child Development

Fenson L Marchman V A Thal D Dale P Reznick J S amp Bates E ()MacArthur-Bates Communicative Development Inventories userrsquos guide and technicalmanual nd ed Baltimore MD Brookes Publishing Company

Hidaka S () Estimating the latent number of types in growing corpora with reducedcostndashaccuracy trade-off Journal of Child Language ndash

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Hills T T Maouene J Riordan B amp Smith L B () The associative structure oflanguage contextual diversity in early word learning Journal of Memory and Language ndash

Hills T T Maouene M Maouene J Sheya A amp Smith L () Longitudinal analysisof early semantic networks Preferential attachment or preferential acquisitionPsychological Science ndash

Huttenlocher J Haight W Bryk A Seltzer M amp Lyons T () Early vocabularygrowth relation to language input and gender Developmental Psychology ndash

Joslashrgensen R N Dale P S Bleses D amp Fenson L () CLEX a cross-linguistic lexicalnorms database Journal of Child Language ndash

Kristoffersen K E Simonsen H G Bleses D Wehberg S Joslashrgensen R N EieslandE A amp Henriksen L Y () The use of the Internet in collecting CDI data ndash anexample from Norway Journal of Child Language ndash

Lieven E Salomo D amp Tomasello M () Two-year-old childrenrsquos production ofmultiword utterances a usage-based analysis Cognitive Linguistics ndash

MacWhinney B () The CHILDES Project tools for analyzing talk rd ed MahwahNJ Lawrence Erlbaum Associates

Marchman V A amp Martiacutenez-Sussmann C () Concurrent validity of caregiverparentreport measures of language for children who are learning both English and SpanishJournal of Speech Language and Hearing Research ndash

Mayor J amp Plunkett K () A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis Developmental Science ndash

Muggeo V M Sciandra M Tomasello A amp Calvo S () Estimating growth chartsvia nonparametric quantile regression a practical framework with application in ecologyEnvironmental and Ecological Statistics ndash

Nelson K () Structure and strategy in learning to talk Monographs of the Society forResearch in Child Development ndash

Norrman G amp Bylund E () The irreversibility of sensitive period effects in languagedevelopment evidence from second language acquisition in international adopteesDevelopmental Science ndash

R Foundation for Statistical Computing () R a language and environment for statisticalcomputing Software online lthttpwwwr-projectorggt

Rescorla L () The language development survey a screening tool for delayed languagein toddlers Journal of Speech and Hearing Disorders ndash

Roy B C Frank M C DeCamp P Miller M amp Roy D () Predicting the birth of aspoken word Proceedings of the National Academy of Sciences ndash

Song J Y Shattuck-Hufnagel S amp Demuth K () Development of phonetic variants(allophones) in -year-olds learning American English a study of alveolar stop t d codasJournal of Phonetics ndash

Tardif T Fletcher P Liang W Zhang Z Kaciroti N amp Marchman V A ()Babyrsquos first words Developmental Psychology ndash

Thal D Jackson-Maldonado D amp Acosta D () Validity of a parent-report measure ofvocabulary and grammar for Spanish-speaking toddlers Journal of Speech Language andHearing Research ndash

Tomasello M amp Mervis C B () The instrument is great but measuringcomprehension is still a problem Monographs of the Society for Research in ChildDevelopment ndash

Wallentin M () Putative sex differences in verbal abilities and language cortex a criticalreview Brain and Language ndash

Weisleder A amp Fernald A () Talking to children matters early language experiencestrengthens processing and builds vocabulary Psychological Science ndash

Wickham H () Ggplot elegant graphics for data analysis New York NY SpringerScience amp Business Media

Wickham H amp Francois R () Dplyr a grammar of data manipulation R packageversion middotmiddotmiddot Online lthttpscranr-projectorgwebpackagesdplyrgt

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Page 17: Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms Communicative Development Inventories

Bates E amp Goodman J () On the emergence of grammar from the lexicon InB MacWhinney (ed) The emergence of language (pp ndash) Mahwah NJ LawrenceErlbaum Associates

Bates E Marchman V Thal D Fenson L Dale P Reznick J S Hartung J() Developmental and stylistic variation in the composition of early vocabularyJournal of Child Language ndash

Bloom P () How children learn the meanings of words Cambridge MA MIT PressBornstein M H amp Haynes O M () Vocabulary competence in early childhoodmeasurement latent construct and predictive validity Child Development ndash

Braginsky M Yurovsky D Marchman V A amp Frank M C () Developmentalchanges in the relationship between grammar and the lexicon In D C Noelle R DaleA S Warlaumont J Yoshimi T Matlock C D Jennings amp P P Maglio (Eds)Proceedings of the th Annual Meeting of the Cognitive Science Society Austin TXCognitive Science Society

Brown R () A first language the early stages Cambridge MA Harvard University PressCartmill E A Armstrong B F Gleitman L R Goldin-Meadow S Medina T N ampTrueswell J C () Quality of early parent input predicts child vocabulary yearslater Proceedings of the National Academy of Sciences ndash

Clark E () First language acquisition Cambridge Cambridge University PressDale P S (nd) Adaptations not translations Online lthttpmb-cdistanfordeduadaptationshtmlgt (last accessed )

Dale P S amp Fenson L () Lexical development norms for young children BehaviorResearch Methods Instruments amp Computers ndash

Dale P S amp PenfoldM (nd) Adaptations of theMacArthur-Bates CDI into non-US Englishlanguages Online lthttpmb-cdistanfordedudocumentsAdaptationsSurvey--Webpdfgt (last accessed )

Dickinson D K amp Tabors P O () Beginning literacy with language young childrenlearning at home and school Baltimore MD Paul H Brookes Publishing

Dunn L M amp Dunn L M () Peabody Picture Vocabulary Test th ed ParsippanyNJ AGS Publishing Pearson Assessments

Eriksson M Marschik P B Tulviste T Almgren M Peacuterez Pereira M Wehberg S Gallego C () Differences between girls and boys in emerging language skills evidencefrom language communities British Journal of Developmental Psychology ndash

Feldman H M Dale P S Campbell T F Colborn D K Kurs-Lasky M RocketteH E amp Paradise J L () Concurrent and predictive validity of parent reports of childlanguage at ages and years Child Development ndash

Feldman H M Dollaghan C A Campbell T F Kurs-Lasky M Janosky J E ampParadise J L () Measurement properties of the MacArthur CommunicativeDevelopment Inventories at ages one and two years Child Development ndash

Fenson L Bates E Dale P Goodman J Reznick J S amp Thal D () Replymeasuring variability in early child language donrsquot shoot the messenger ChildDevelopment ndash

Fenson L Dale P S Reznick J S Bates E Hartung J P Pethick S amp Reilly J() MacArthur Communicative Development Inventories userrsquos guide and technicalmanual Baltimore MD Paul H Brookes Publishing Co

Fenson L Dale P Reznick J Bates E Thal D Pethick S Stiles J ()Variability in early communicative development Monographs of the Society for Researchin Child Development

Fenson L Marchman V A Thal D Dale P Reznick J S amp Bates E ()MacArthur-Bates Communicative Development Inventories userrsquos guide and technicalmanual nd ed Baltimore MD Brookes Publishing Company

Hidaka S () Estimating the latent number of types in growing corpora with reducedcostndashaccuracy trade-off Journal of Child Language ndash

WORDBANK

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Hills T T Maouene J Riordan B amp Smith L B () The associative structure oflanguage contextual diversity in early word learning Journal of Memory and Language ndash

Hills T T Maouene M Maouene J Sheya A amp Smith L () Longitudinal analysisof early semantic networks Preferential attachment or preferential acquisitionPsychological Science ndash

Huttenlocher J Haight W Bryk A Seltzer M amp Lyons T () Early vocabularygrowth relation to language input and gender Developmental Psychology ndash

Joslashrgensen R N Dale P S Bleses D amp Fenson L () CLEX a cross-linguistic lexicalnorms database Journal of Child Language ndash

Kristoffersen K E Simonsen H G Bleses D Wehberg S Joslashrgensen R N EieslandE A amp Henriksen L Y () The use of the Internet in collecting CDI data ndash anexample from Norway Journal of Child Language ndash

Lieven E Salomo D amp Tomasello M () Two-year-old childrenrsquos production ofmultiword utterances a usage-based analysis Cognitive Linguistics ndash

MacWhinney B () The CHILDES Project tools for analyzing talk rd ed MahwahNJ Lawrence Erlbaum Associates

Marchman V A amp Martiacutenez-Sussmann C () Concurrent validity of caregiverparentreport measures of language for children who are learning both English and SpanishJournal of Speech Language and Hearing Research ndash

Mayor J amp Plunkett K () A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis Developmental Science ndash

Muggeo V M Sciandra M Tomasello A amp Calvo S () Estimating growth chartsvia nonparametric quantile regression a practical framework with application in ecologyEnvironmental and Ecological Statistics ndash

Nelson K () Structure and strategy in learning to talk Monographs of the Society forResearch in Child Development ndash

Norrman G amp Bylund E () The irreversibility of sensitive period effects in languagedevelopment evidence from second language acquisition in international adopteesDevelopmental Science ndash

R Foundation for Statistical Computing () R a language and environment for statisticalcomputing Software online lthttpwwwr-projectorggt

Rescorla L () The language development survey a screening tool for delayed languagein toddlers Journal of Speech and Hearing Disorders ndash

Roy B C Frank M C DeCamp P Miller M amp Roy D () Predicting the birth of aspoken word Proceedings of the National Academy of Sciences ndash

Song J Y Shattuck-Hufnagel S amp Demuth K () Development of phonetic variants(allophones) in -year-olds learning American English a study of alveolar stop t d codasJournal of Phonetics ndash

Tardif T Fletcher P Liang W Zhang Z Kaciroti N amp Marchman V A ()Babyrsquos first words Developmental Psychology ndash

Thal D Jackson-Maldonado D amp Acosta D () Validity of a parent-report measure ofvocabulary and grammar for Spanish-speaking toddlers Journal of Speech Language andHearing Research ndash

Tomasello M amp Mervis C B () The instrument is great but measuringcomprehension is still a problem Monographs of the Society for Research in ChildDevelopment ndash

Wallentin M () Putative sex differences in verbal abilities and language cortex a criticalreview Brain and Language ndash

Weisleder A amp Fernald A () Talking to children matters early language experiencestrengthens processing and builds vocabulary Psychological Science ndash

Wickham H () Ggplot elegant graphics for data analysis New York NY SpringerScience amp Business Media

Wickham H amp Francois R () Dplyr a grammar of data manipulation R packageversion middotmiddotmiddot Online lthttpscranr-projectorgwebpackagesdplyrgt

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms

Page 18: Wordbank: an open repository for developmental vocabulary ...University of Chicago, on 17 May 2017 at 14:45:47, subject to the Cambridge Core terms Communicative Development Inventories

Hills T T Maouene J Riordan B amp Smith L B () The associative structure oflanguage contextual diversity in early word learning Journal of Memory and Language ndash

Hills T T Maouene M Maouene J Sheya A amp Smith L () Longitudinal analysisof early semantic networks Preferential attachment or preferential acquisitionPsychological Science ndash

Huttenlocher J Haight W Bryk A Seltzer M amp Lyons T () Early vocabularygrowth relation to language input and gender Developmental Psychology ndash

Joslashrgensen R N Dale P S Bleses D amp Fenson L () CLEX a cross-linguistic lexicalnorms database Journal of Child Language ndash

Kristoffersen K E Simonsen H G Bleses D Wehberg S Joslashrgensen R N EieslandE A amp Henriksen L Y () The use of the Internet in collecting CDI data ndash anexample from Norway Journal of Child Language ndash

Lieven E Salomo D amp Tomasello M () Two-year-old childrenrsquos production ofmultiword utterances a usage-based analysis Cognitive Linguistics ndash

MacWhinney B () The CHILDES Project tools for analyzing talk rd ed MahwahNJ Lawrence Erlbaum Associates

Marchman V A amp Martiacutenez-Sussmann C () Concurrent validity of caregiverparentreport measures of language for children who are learning both English and SpanishJournal of Speech Language and Hearing Research ndash

Mayor J amp Plunkett K () A statistical estimate of infant and toddler vocabulary sizefrom CDI analysis Developmental Science ndash

Muggeo V M Sciandra M Tomasello A amp Calvo S () Estimating growth chartsvia nonparametric quantile regression a practical framework with application in ecologyEnvironmental and Ecological Statistics ndash

Nelson K () Structure and strategy in learning to talk Monographs of the Society forResearch in Child Development ndash

Norrman G amp Bylund E () The irreversibility of sensitive period effects in languagedevelopment evidence from second language acquisition in international adopteesDevelopmental Science ndash

R Foundation for Statistical Computing () R a language and environment for statisticalcomputing Software online lthttpwwwr-projectorggt

Rescorla L () The language development survey a screening tool for delayed languagein toddlers Journal of Speech and Hearing Disorders ndash

Roy B C Frank M C DeCamp P Miller M amp Roy D () Predicting the birth of aspoken word Proceedings of the National Academy of Sciences ndash

Song J Y Shattuck-Hufnagel S amp Demuth K () Development of phonetic variants(allophones) in -year-olds learning American English a study of alveolar stop t d codasJournal of Phonetics ndash

Tardif T Fletcher P Liang W Zhang Z Kaciroti N amp Marchman V A ()Babyrsquos first words Developmental Psychology ndash

Thal D Jackson-Maldonado D amp Acosta D () Validity of a parent-report measure ofvocabulary and grammar for Spanish-speaking toddlers Journal of Speech Language andHearing Research ndash

Tomasello M amp Mervis C B () The instrument is great but measuringcomprehension is still a problem Monographs of the Society for Research in ChildDevelopment ndash

Wallentin M () Putative sex differences in verbal abilities and language cortex a criticalreview Brain and Language ndash

Weisleder A amp Fernald A () Talking to children matters early language experiencestrengthens processing and builds vocabulary Psychological Science ndash

Wickham H () Ggplot elegant graphics for data analysis New York NY SpringerScience amp Business Media

Wickham H amp Francois R () Dplyr a grammar of data manipulation R packageversion middotmiddotmiddot Online lthttpscranr-projectorgwebpackagesdplyrgt

FRANK ET AL

of use available at httpswwwcambridgeorgcoreterms httpsdoiorg101017S0305000916000209Downloaded from httpswwwcambridgeorgcore University of Chicago on 17 May 2017 at 144547 subject to the Cambridge Core terms