Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics...

46
Linked Data in Language Typology Digital Humanities Research Seminar University of Helsinki February 1, 2018 Kaius Sinnemäki General Linguistics, University of Helsinki

Transcript of Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics...

Page 1: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Linked Data in Language Typology

Digital Humanities Research SeminarUniversity of HelsinkiFebruary 1, 2018

Kaius SinnemäkiGeneral Linguistics, University of Helsinki

Page 2: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Today’s talk• My background

• What is language typology?

• Rich data in typology

• Linked data possibilities in typology

• A case study

Page 3: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

My background

3

Page 4: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

MA 2004, general linguistics

• Corpus linguistics– Unix-based semi-

automatic detection of deep embedding

• Syntactic analysis of different genres– Incl. stream-of-

consciousness

Page 5: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

PhD 2011, general linguistics

• Language complexity, a typological viewpoint.• Testing domain:

– Case marking, agreement, linear order.

• Statistics and R.

5

Case markingAgreement

Word order

Page 6: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

PhD 2011, general linguistics

• Language complexity, a typological viewpoint.• Testing domain:

– Case marking, agreement, linear order.

• Statistics and R.

6

Case markingAgreement

Word order

Page 7: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Recent & current projects• Post doc, 2013- (Collegium & Academy of Finland)

– How grammatical categories of the noun (e.g. case, definiteness, gender) interact with other each.

– How linguistic structures adapt to sociolinguistic context.– Combining typological and experimental evidence.

• Digital Humanities at UH (w/ Mikko Tolonen)– Conferences in 2014-2015; Fennica-project 2015.

• Sacred in Secular Societies (w/ Janne Saarikivi) 2018– How religious concepts have been transformed into

secular context and preserved there.

7

Page 8: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

What is language typology?

Page 9: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

1950s: Chomsky and Greenberg• Linguistics dominated by structuralism from late 19th

century to mid-20th century. Emphasis on variation:“Languages can differ from each other without limit and in unpredictable ways.” (Martin Joos 1957: 96)

• Chomsky: language is a separate module in the brain.ÆUniversal grammar: all languages fundamentally the same.ÆData from English. (Rationalism, cognitive science).

• Greenberg: is cross-linguistic diversity constrained?ÆLanguage universals (esp. on word order).ÆData from many languages. (Empirical, anthropology).

Page 10: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• Language typology= It is worldwide comparison of languages to describe and

explain differences and similarities across languages.– Major research questions: 1) to what extent different

linguistic structures may interact among themselves or 2) with cognitive and cultural patterns (Bickel 2007).

• Well-known for word order correlations:– order of noun (N) and genitive (Gen) correlates

with the order of object (O) and verb (V)• NGen + VO (Book of John + eat an apple)• GenN + OV (John’s book + an apple eat)

Page 11: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Cross-linguistic comparison• How are linguistic structures compared typologically?

• Key question: are there universal categories?– If yes Æ universal ontological system feasible.

• Cf. General Ontology for Linguistic Description (GOLD).– If no Æ comparison has to be based on researchers’ tools

• No right/wrong, just better/worse definitions for comparison

• Pooling/appending data from different sources– Are the definitions for a particular structure comparable?– If not, the only possibility is to analyse new data.

Page 12: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Rich data / big data in typology

Page 13: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

“Big data” in cross-linguistic research

• Big data in linguistics: text corpora (e.g., Language Bank).

• Language typology is about language comparison.– How much is there to compare in languages?– A finite description of the grammar of even one language is

impossible (Moscoso 2010).

• Currently about 7000 languages spoken (Hammarström et al. 2017).

ÆThere should be many reasons for “big data”-research in language typology.

13

Page 14: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Computer-assisted linguistic and cultural research

• Open access databases since 2005; linked data possibilities since 2008 (Word Atlas of Language Structures).

• Availability of new databases that contain linguistic and cultural data on languages and societies all over the world.

• Enable new research questions to be approached by new computer-assisted methods.

Page 15: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

15

Page 16: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

CLLD

• CLLD = Cross-Linguistic Linked Data (clld.org)– Hosts several large cross-linguistic datasets.– All openly available, repositories in GitHub.– Including (visit https://github.com/clld)

• The World Atlas of Language Structures.• The Atlas of Pidgin and Creole Language Structures.• The World Loanword Database.• The South American Indigenous Language Structures• Glottolog: catalogue of all languages, families and dialects

(including bibliographic information).– Data largely in database format, not many texts.

Page 17: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

D-PLACE• Cultural, linguistic, environmental and geographic data on 1400+

societies. (https://d-place.org).– Society = “represents a group of people in a particular locality, who

often share a language and cultural identity.”

• Cultural descriptions tagged with date and ethnographic sources. Ethnographies based on largely pre-1950s work.– Ethnographic Atlas (Murdock 1962-1971). Human Relations Area Files.– Data from pre-1950s ethnographies.

• Also phylogenetic treesÆphylogenetic comparative methods applicable

• Clone from https://github.com/D-PLACE/dplace-data

Page 18: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• Is there any relationship:– Cultural trait ”presence of

trance”– Ecological factor ”rain

constancy”– (coursework w/Hilde

Schneemann, Andrea Bender and Mary Walworth)

• Phylogenetic computational methods:– Ancestral state reconstruction– Correlated changes.

• Result?– Negative coefficient, but non-

significant (p = .063)

A silly excerciseusing D-PLACE

Page 19: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Data sources• Typologists’ data sources are reference grammars.

– Analysis ”by hand”, seldom computer-assisted.– Time-consuming.– Are observations “languages” or “constructions”?

• Samples around 200-300 languages.– WALS: data on 2600+ languages, 190 structures. Gaps

(Dryer & Haspelmath 2013).– Not big from statistical perspective, but “big” in

comparison to the history of language typology.

• Compare with corpus linguistics:– Datapoints counted in tens of thousands or more.– About 125 000 hits for the verb oleskella ‘stay, dwell’ in

the Finnish korp -corpus.

19

Page 20: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Linked data in typology

Page 21: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

What to link?• Usually languages. Problem: many alternative names.

– Tenharim (Tupian) has 35 names (AUTOTYP).– Different databases name languages differently.– See e.g. discussion on http://dlc.hypotheses.org/623

• Solution: standard language identifiers– Problem:

• different databases/catalogues use different identifier systems• Ethnologue (ISO-639.3), Glottolog, WALS, AUTOTYP.

– Not a real problem, but there are still many-to-many mappings.

Page 22: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Tenharim, alternative names• abahyba, Caripuna, Cauaiua, Cauhib, Cawahib• Diahoi, Diahói, Diahui, Diarroi, Diarrui, Djahui• Jahoi, Jahui, Jauareta-Tapiia, Jiahui, Juma, Yuma• Kagwahiv, Kagwahív, Kagwahiva• Karipuna, Karipuná, Kawahib, Kawaib• Paranawat, Parintintim, Parintintin, Parintintín• Pawaté-Wirafed• Tenharem, Tenharim, Tenharím, Tenharin• Tenharín, Tukumanfed, Uru-eu-uau-uau.

Page 23: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

What to link?• Usually languages. Problem: many alternative names.

– Tenharim (Tupian) has 35 names (AUTOTYP).– Different databases name languages differently.

• Solution: standard language identifiers– Problem:

• different databases/catalogues use different identifier systems• Ethnologue (ISO-639.3), Glottolog, WALS, AUTOTYP.

– Not a real problem, but there are still many-to-many mappings.

Page 24: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• Conceptual work to be done:– What is a language in a database/catalogue?– Doculect, languoid, glossonym– Need to formalize the notion language

• See Cysouw & Good (2013).• Also discussion on Diversity Linguistics Comment

Page 25: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

A case study

Page 26: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Creoles vs. regular languages

• Languages are transmitted in different social conditions. Usually faithful transmission, with some restructuring.

• Some languages under heavy language contact.– Break in normal transmission.– Restructuring, structural simplification /

complexification.ÆPidgins, jargons, creoles, mixed languages.

Page 27: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• Creoles share many featuresÆA creole typological profile?– Used for arguing about language evolution.

• Questions:– Do creoles differ from regular lgs? (Bakker et al. 2011)– Do the contributing languages (mostly Indo-European)

differ from other languages of the world? (Cysouw2009; Blasi et al. 2017)

Page 28: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

28

(Cysouw 2009)

Page 29: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

29

(Cysouw 2009)

Page 30: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

30

(Cysouw 2009)

Page 31: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Some of the often cited examples of the creole• -profile deal with word order and argument marking:

Creoles have SVO and very little morphological marking (e.g. no –case).

A boi lobi a umapikin.DET boylove DET girl

S V O'The boy loves the girl.‘ (Sranan; Winford & Plag 2013)

But: this correlation between SVO and no case marking (or •morphological marking) occurs also in regular languages.

Is the correlation stronger in Creoles than in regular languages?–If yes – Æ evidence for Creole profile. If not, then not.

Page 32: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• Data on the case marking of 687 regular languages available in AUTOTYP (Bickel et al. 2017).

• Data on the word order of 1377 regular languages available in WALS (Dryer 2013).

• Data on the word order and case marking of 55 creoles available in APiCS (Michaelis et al. 2013).

Page 33: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• AUTOTYP metadata files:– AUTOTYP’s own language identifier (integer)– ISO-639.3 code for each language– Glottocode (glottolog) for each language

• WALS and APiCS metadata files:– WALS code for each language (for WALS lgs)– ISO-639.3 code for each language– Glottocode (glottolog) for each language

Æ Should be straightforward to merge or join in R.

Page 34: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• But no: several many-to-many mappings.– Not all language identifiers match one-to-one.

AUTOTYP WALS

Page 35: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• Observations (lines) in AUTOTYP are constructions in languages

• Observations (lines) in WALS are languages

ÆMany researcher-based choices and lots of cleaning necessary before merging is possible

Page 36: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,
Page 37: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• Observations (lines) in AUTOTYP are constructions in languages

• Observations (lines) in WALS are languages

ÆMany researcher-based choices and lots of cleaning necessary before merging is possible

Page 38: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,
Page 39: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• Observations (lines) in AUTOTYP are constructions in languages

• Observations (lines) in WALS are languages

ÆMany researcher-based choices and lots of cleaning necessary before merging is possible.ÆBUT: once the cleaning script is ready, linking should

work automatically. Currently work in progress but almost there.

Page 40: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

And the preliminary result… (Sinnemäki 2017)

Page 41: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

41

- Data: 55 creoles

- logit estimates: -4.6 ± 1.7; p < .001***

ÆCorrelation between word order and

case marking.

Generalized mixed effects modeling

Page 42: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

42

- Data: 55 creoles

- logit estimates: -4.6 ± 1.7; p < .001***

ÆCorrelation between word order and

case marking.

- Data: 333 non-creoles

- logit estimate = -8.6 ± 3.8; p < .0001 ***

ÆCorrelation between word order and case

marking.

Generalized mixed effects modeling

Page 43: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

43

- Data: 55 creoles

- logit estimates: -4.6 ± 1.7; p < .001***

ÆCorrelation between word order and

case marking.

- Data: 333 non-creoles

- logit estimate = -8.6 ± 3.8; p < .0001 ***

ÆCorrelation between word order and case

marking.

Generalized mixed effects modeling

case x word_order x lg_type- estimates: -2.1 ± 1.6; p = .24

Page 44: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Conclusions• More typological data openly accessible

• Universal ontological systems not necessarily feasible for a typologist

• Linking data between datasets possible but requires time-consuming cleaning

• The available datasets enable old questions to be answered in new ways with computational methods.

Page 45: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Thank you!

Page 46: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

ReferencesBakker, P. et al. 2011. Creoles are typologically distinct from non-creoles. Journal of Pidgin and Creole Languages 26(1): 5-42.Blasi, D. et al. 2017. Grammars are robustly transmitted even during the emergence of creole languages. Nature Human Behaviour

1(10): 723-729.Bickel, B. 2007. Typology in the 21st century: Major current developments. Linguistic Typology 11(1): 239-251.Bickel, B. et al. 2017. The AUTOTYP typological databases. Version 0.1.0 https://github.com/autotyp/autotyp-data.Cysouw, M. 2009. APiCS, WALS, and the creole typological profile (if any). Presentation at the 1st APiCS Conference, 5-8

November 2009, Leipzig.Dryer, M. 2013. Order of subject, object and verb. In M. Dryer & M. Haspelmath (eds.).Dryer, M. & M. Haspelmath (eds.) 2013. The World Atlas of Language Structures Online. Leipzig: MPI for Evolutionary

Anthropology. http://wals.info.Good, J. & M. Cysouw 2013. Languoid, Doculect, and Glossonym: Formalizing the Notion 'Language’. Language Documentation

and Coservation 7: 331-359.Hammarström, H. et al. 2017. Glottolog 3.0. Jena: MPI for the Science of Human History. Joos, M. (ed.) 1957. Readings in Linguistics: The Development of Descriptive Linguistics in America Since 1925. Washington:

American Council of Learned Societies.Simons, G. F. & C. D. Fennig (eds.) 2017. Ethnologue: Languages of the World, 20th edn. Dallas, TX: SIL International.

http://www.ethnologue.com.Michaelis, S. et al. (eds.) 2013. Atlas of Pidgin and Creole Language Structures Online. Leipzig: MPI for Evolutionary Anthropology.Moscoso del Prado Martín, F. 2010. The effective complexity of language: English requires at least an infinite grammar. Ms.

http://www.moscosodelprado.net.Sinnemäki, K. 2017. How useful are creoles in language evolution research? Evaluating cross-linguistic universals of word order

and argument marking. Invited talk at the 30th Annual CUNY Conference on Human Sentence Processing, April 1, 2017, Massachussetts Institute of Technology.

Winford, D & I. Plag 2013. Sranan structure dataset. In S. Michaelis et al. (eds.).