1 From CHILDES to TalkBank An International Database of Communicative Interaction.

51
1 From CHILDES to TalkBank An International Database of Communicative Interaction
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    229
  • download

    1

Transcript of 1 From CHILDES to TalkBank An International Database of Communicative Interaction.

1

From CHILDES to TalkBank

An International Database of Communicative Interaction

2

TalkBank

• Brian MacWhinney– Carnegie Mellon University, Psychology– Child Language Data Exchange System CHILDES

• Steven Bird, Mark Liberman– University of Pennsylvania, Linguistics– Linguistic Data Consortium, LDC

• Howard Wactlar– Carnegie Mellon University, Computer Science– Informedia Project

3

Basic Premise of TalkBank• Human Communication is a unified fact,

• but it is studied by 8 disciplines and up to 40 subdisciplines.

• Analysis is important, but so is synthesis.

• We can put the puzzle back together by focusing all the disciplines on the data.

4

Some Examples

• “My Theory”

• Bettino Craxi

• Nixon’s Watergate Tapes

• MacWhinney’s Lectures

• Ross and Mark

• Graphics lesson

• Bilingual Classroom

5

My Theory: An ExampleSpecial Issue of Discourse Processes edited by Tim

Koschmann with articles from• Rogers Hall• Jay Lemke• Annemarie Palincsar• Carl Frederiksen• Commentary by

– Judith Green & Marleen McClelland

– Jeremy Roschelle

6

TalkBank Areas

• Classroom Discourse - CMU Dec 99• Conversation Analysis - Odense Oct• Text and Discourse - Santa Barbara July• Child Language Disorders - Madison 2002• Language and Gesture - CMU October• Child Language Learning - Madison Aug 2002• Animal Communication - Penn May 2000

7

More areas ….• Field Linguistics - LSA Dec 99, Penn Dec 2000• Aphasia• Corpus Linguistics• Signed Language• Second Language Learning• Anthropological Linguistics• Cross-cultural studies

8

More areas ...• Multilingualism, code-switching - LIDES

• Mother-infant interaction

• Psychiatry

• Conflict Resolution

• Management Styles

• Small-group Interaction - soon

• Human-computer Interaction

9

More areas ...

• Speech Technology - ongoing

• Virtual Reality

• Guided Robots, Social Robots

10

Why data-sharing is important

• Increasing the size and reliability of the empirical basis

• Opening science to the community, practitioners, and students

• Opening science to collaborative commentary

• Creating transparency across disciplines

11

Key Features of TalkBank• Multimodal digitized data

• Internet access

• Defense of confidentiality

• Codon: transcription, coding, viewing, and analysis

• XML standard for underlying representation

• Alliance of databases from many fields

12

Why TalkBank can be built now

• The Internet

• Fast computers. big disks, cheap storage

• Good audio and video digitization

• Advances in web-based database design

• Emergence of annotation standards

• Maturation of the social sciences

13

CHILDES: APrototype

• Brian MacWhinney - CMU• Leonid Spektor - CMU• Catherine Snow - Harvard

• 2000 Members• 400 Active contributors

14

1850-1950 Darwin and Diaries

• Darwin, Stern, Ament

• Emotion, gesture, language, the soul

• Card files and shoe boxes

15

1950-1984 Tapes

• Nagras and TEAC, VHS and Beta

• Dittos, mimeo, notes in the margins

• Good “raw” data, unclear transcription

16

1984 - 1994 PCsCHILDES Concord Massachusetts 1984

17

1994 -2001 childes.psy.cmu.edu

18

2000 - ? TalkBank

19

Universals• Are there basic patterns to babbling?

• Are early word orders universal?

• Does UG give children a universal set of functional categories?

• Is the vocabulary spurt universal?

The answer requires LOTS of data

20

Particulars• Do children have individual styles?

– Gestalt vs. Analytic– Enactive (1S) vs. Depictive (3S)

• Do children respond differentially to parental recasts?

• Do children vary in their match to cue validity?

Again, we need LOTS of data

21

Comparisons

• How should we match SLI children to normal controls -- MLU? Morphology, TTR

• How should we compare language socialization processes across social classes? Between cultures?

• How should we compare the course of development across languages? The case of Romance.

22

Three Components

• CHAT -- Transcription System

• CLAN -- Programs

• Database

23

CHAT Format

@Begin

@Participants: CHI Target_Child Sid, MOT Mother

*MOT: you want them to go in there?

*CHI: yeah. [+ Q]

*CHI: yeah. [+ SR]

*MOT: okay.

*CHI: okay. [+ I]

*CHI: look at this.

%act: CHI picks up piece of paper

@End

24

CLAN Programs

25

String Search

• Freq

• KWAL

• Combo

• Gem

• GemFreq, GemList

26

Indexes

• MLU

• MLT

• WdLen, MaxWd

• VOCD

• DSS

• IPSyn (in progress)

27

Profiles

• Chains

• Cooccur

• Dist

• CHIP

• KeyMap

• TimeDur

28

Phonology

• MakeMod

• ModRep

• PhonFreq

• UniCode

• Inventory (in progress, LIPP, CompProf)

• Process Analysis (in progress)

29

Utilities

• Dates

• Rely

• Lines

• SaltIn

• Check

30

The Database

• English - 25 corpora

• Non-English - 18 languages

• Clinical - 14 corpora, aphasia, SLI, Down, autism, Williams, and other groups

• Narrative - Frog stories, Red Balloon

• Childhood Bilingualism

• Adult Second Language Learning

31

Morphology

• MOR

• Post, PostTrain -- Christophe Parisse

• Parse -- Kenji Sagae

• --> revised DSS, LARSP, IPSyn

• MinMor for 14 language

• MaxMor for English, Spanish, Italian, Hungarian, Dutch, German

32

New Technologies

• Sonic CHAT

• Bullets

• QuickTime Movies

• Sound editor by wave

• Movie editor by dragging

• Fast mode editing

• Web streaming of audio and video

33

Sample Topics• Past tense debate

• Functional categories, tenseless verbs

• Verb frame generalization

• Fine-tuning of the input

• Theory of mind

• Lexical range and communicative context

• MLU and vocabulary growth in disorders

34

Research based on CHILDES• Over 1200 published studies• Syntax• Morphology• Discourse• Lexicon• Narrative, Literacy• Language Impairments• Phonology

35

Allied Efforts

• JCHAT, Chinese, Korean

• Dutch, Nordic, Celtic

• Romance (Italian, Spanish, Portuguese)

• Slavic (Krakow, Vienna)

• Bilingualism -- Catalan, Basque

• Frogs, Disorders, Code-switching

• Classroom discourse

36

37

CHILDES/BIB On-Line

38

Format BabelAlembic Annotator Archivage CA CHAT

COCOSDA CSAE CSLU DAISY DAMSL

Delta DRI EAGLES Emu Festival

FSA’s GATE HIAT Hyperlex Intex

ISIP LDC MATE MICASE MPEG

MPI Multitext Observer PartiturPraat

SABLE SAMPA SGREP SignSTream SIL

SLAM SMDL SNACK StandOff SUSANN

TalkBank TEI Tipster Transcriber TreeBank

TSNLP Unicode UTF

39

Video ToolsMedia Tagger, CLAN,

Digital Lava, Informedia ….

40

The Script

41

syncWRITER

42

SignStream

43 41

44

Audio on the Web

45

Anthropology on the Web

Chagnon’s Yanamamo

46

Touch and Click for Audio

47

Pawnee Lexicon

48

Lexicon -> Cultural Encyclopedia

49

Cornell Bioacoustics Laboratory

50

Confidentiality Levels1 - fully public2 - copying block3 - transcripts public, audio/video protected4 - non-disclosure5 - non-disclosure, no copying6 - data-viewing with approval7 - data-viewing under direct supervision8 - archived only

51

Conclusions

• Child Language has guided other fields, but now we need to link to these other fields.

• CLAN must give way to more international tools and distributed databases.

• Number counting will give way to reality-linked number counting.

• Lab-based research will have to open up to collaborative annotation.