Using Linked Data to Create a Typological Knowledge...
Transcript of Using Linked Data to Create a Typological Knowledge...
OverviewInvestigationChallenges
Using Linked Data to Create a TypologicalKnowledge Base
Steven MoranLMU Munchen
March 8, 2012
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Talk Map
� Overview
� Investigation
� Challenges
� Conclusion
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
Where are we in linguistic structure?
!
"
µ
# #
"
µ
f o n ə l ɑ dʒ ə k ɫ Θ i ə ɹ i
$ $ $
" " " " " "
µ µ µ µ µ µ µ
← phonological phrase
← phonological words
← metrical feet
← syllables
← moras
← segments
← distinctive features-voice+labiodental+continuent
+voice+back-low-high
+voice+nasal+coronal
...
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
PHOnetics Information Base and LExicon (PHOIBLE)
� A typological data set of segment inventories with linguisticand non-linguistic information
� Linguistic info� segment inventories� distinctive features� genealogical data (language stock and genus)
� Non-linguistic info� population figures� geographic location (geo-coordinates, country and region)� per-capita GDP, etc.
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
Segment inventory databases
� Stanford Phonology Archive (SPA; Crothers et al 1979)
� UCLA Phonological Segment Inventory Database (UPSID;Maddieson 1984; Maddieson & Precoda 1990)
� Systemes alphabetiques des langues africaines (AA; Hartell1993; Chanard 2006)
� PHOIBLE data (Moran 2009-2012)
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
PHOIBLE resources
SPA (197)phonemes, allophones, phono rules
UPSID (451)phonemes
PHOIBLE (738)
phonemes,allophones,phono rules,
citations,squibs AA (203)
phonemes, graphemes
Metadata and Unicode
IPA
PHOIBLE combined
inventories(1298/1589)
UnicodeNormalization
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
How was PHOIBLE developed?
PHOIBLE relational database
PHOIBLE flat files:
1. Aggregated2. Phoneme level
Data warehouseprocedure
RDF graph ofsegments
Python script
RDF graph offeatures
Python script
PHOIBLERDF graph
(segments and features)
Merge graph Merge graph
Data sources:
PHOIBLE
AA
UPSID
SPAETL
processesData sources:
Unicode IPA
WALS
Multitree
Ethnologue
CIA
Flat files
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
PHOIBLE data warehouse flat file table – phoneme level
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
Building the RDF graph
Subject
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
Building the RDF graph
Subject Object
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
Building the RDF graph
Subject
u!
Object
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
Building the RDF graph
Subject
u!
Object
hasSegment
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
Building the RDF graph
tha
Subject
u!
ObjecthasSegment
khm hasSegment
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
Building the RDF graph
tha
Subject
u!
ObjecthasSegment
khm hasSegment
"#$
kor
lbe
kat
bsk
hasSegment
hasSegment
hasSegment
hasSegment
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
Building the RDF graph
thau!
khm
"#$
kor
lbe
kat
bsk
hasSegment
asisncau
ausa
taik
Korean
Lak-Dargwa Kartvelian
BurushaskiKhmer
Kam-Tai
hasEthnologueLanguageFamilyStock
hasWalsGenus
20,200,000
hasPopulation
37:30
128:0
hasLatitude
hasLongitude
kart
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
Knowledge base with segments
Sisaala, Western
[ssl]p
b
kp
hasSegment
hasSegment
hasSegment
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
Knowledge base with segments and features
� Features added to the graph by linking them from eachsegment
ssl
p b kp
voice plosive velar
hasSegment
hasFeature
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
Knowledge base with segments, features and feature sets
ssl
p b kp
voice plosive velar
hasSegment
Hayes 2009
Maddieson1984
consonantallabial dorsal
hasFeature
hasFeature
bilabial
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
Mapping distinctive features to segment types
� Features are the atoms that combine compositionally to forma segment
� Query features for natural classes of sounds
� Model in RDF/OWL to hierarchically organize features into afeature geometry
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
Simple and complex segment feature resolution
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
Feature specifications for natural classes of sounds
Class of sounds Feature specificationVowels [+syllabic] [-consonantal]Vowels & Syllabic Consonants [+syllabic]Glides [-syllabic] [-consonantal]Liquids [+consonantal] [+approximant]Nasals [+sonorant] [-approximant]Fricatives [-sonorant] [+continuant]Affricates [-continuant] [+delayed release]Stops [-delayed release]Stops & Affricates [-continuant]Liquids & Glides [-syllabic] [+approximant]Liquids, Glides, & Nasals [-syllabic] [+sonorant]
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
Feature geometry
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Where?What?How?
Feature geometry
root
supralaryngeallaryngeal
approximant
anteriorback
consonantal
constricted glottis
continuant
coronal
delayed release
distributed
dorsal
fortis
front high
labial
labiodental
lateral
long
low
nasal
round
sonorant
spread glottis
stress
strident
syllabictap
tense
tone
trill
voice
place
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
BasicsInvestigating patterns
SPARQL
� SPARQL Protocol And RDF Query Language
� RDF query language (Prud’Hommeaux and Seaborne, 2006)
� consist of triple patterns that match concepts and theirrelations by binding variables to match graph patterns
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
BasicsInvestigating patterns
Select segments of a particular language
SELECT ?segmentsWHERE {ssl hasSegment ?segments}
hasSegmentSisaala, Western
[ssl]
p
b
kp
English [eng]
Sisaala, Tumulung
[sil]
gb
German [deu]
v
f
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
BasicsInvestigating patterns
Select segments of a particular language
SELECT ?segmentsWHERE {ssl hasSegment ?segments}
hasSegmentSisaala, Western
[ssl]
p
b
kp
English [eng]
Sisaala, Tumulung
[sil]
gb
German [deu]
v
f
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
BasicsInvestigating patterns
Select languages that have a particular segment
SELECT ?languagesWHERE { ?languages hasSegment gb }
hasSegmentSisaala, Western
[ssl]
p
b
kp
English [eng]
Sisaala, Tumulung
[sil]
gb
German [deu]
v
f
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
BasicsInvestigating patterns
Select languages that have a particular segment
SELECT ?languagesWHERE { ?languages hasSegment gb }
hasSegmentSisaala, Western
[ssl]
p
b
kp
English [eng]
Sisaala, Tumulung
[sil]
gb
German [deu]
v
f
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
BasicsInvestigating patterns
Select languages that have a class of segments
SELECT ?languagesWHERE { ?languages hasSegment ?segments .?segments hasFeature DELAYED RELEASE }
hasSegment
Sisaala, Western
[ssl]
p
b
kp
Sisaala, Tumulung
[sil]
gb
v
f
labial
delayed release
syllabic
hasFeature
Hayes2009
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
BasicsInvestigating patterns
Select languages that have a class of segments
SELECT ?languagesWHERE { ?languages hasSegment ?segments .?segments hasFeature DELAYED RELEASE }
hasSegmentSisaala, Western
[ssl]
p
b
kp
Sisaala, Tumulung
[sil]
gb
v
f
labial
delayed release
syllabic
hasFeature
Hayes2009
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
BasicsInvestigating patterns
Investigating phonological universals
� Hyman (2008): Every phonological system has...� stops� at least one unrounded vowel� at least one front vowel or the palatal glide /j/� coronal phoneme(s)
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
BasicsInvestigating patterns
Every phonological system has stops
SELECT ?languagesWHERE {?languages phoible:hasSegment ?segments .?segments phoible:notHasFeature feature:DELAYED RELEASE}
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
BasicsInvestigating patterns
Phonological system with at least one unrounded vowel –and what are those languages and their segments?
SELECT ?languages ?segmentsWHERE {?languages phoible:hasSegment ?segments .?segments phoible:hasFeature feature:SYLLABIC .?segments phoible:notHasFeature feature:CONSONANTAL .?segments phoible:notHasFeature feature:ROUND}
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
BasicsInvestigating patterns
Every phonological system has at least one front vowel orthe palatal glide /j/
SELECT ?languagesWHERE {?languages phoible:hasSegment ?segments .?segments phoible:hasFeature feature:SYLLABIC .?segments phoible:notHasFeature feature:CONSONANTAL .?segments phoible:hasFeature feature:FRONT .UNION { ?languages phoible:hasSegment segment:j }}
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
BasicsInvestigating patterns
Every phonological system has coronal phonemes... nope!
SELECT ?languagesWHERE {?languages phoible:hasSegment ?segments .?segments phoible:hasFeature feature:CORONAL}
� “Another Universal Bites the Dust” (Blevins, 2009)
� Northwest Mekeo [mek] /p, B, m, w, g, N, j, i, e, a, o, u/
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Linguistic challengesComputational challengesMetadata and discovery
Linguistic challenges
� Which language is this? (“A Grammar of Haida” - Northern?Southern?)
� Different theoretical models are used in describing languages� Diacritic ordering
� creaky voiced syllabic dental nasal: n”"� labialized aspirated long alveolar plosive: twh:
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Linguistic challengesComputational challengesMetadata and discovery
Different analyses – same language
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Linguistic challengesComputational challengesMetadata and discovery
Using inference
� RDF model we can add additional knowledge to the model
� Change knowledge base without changing our query
� Logically-defined properties in OWL; merge OWL and RDFgraphs
� Establish relationships between resources as inferred by areasoner
� Reasoner evaluates logic statements in graph and addsinferred triples
� Ability to manipulate the ontology and to specify how toderive logical consequences and to create new entailments
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Linguistic challengesComputational challengesMetadata and discovery
Using OWL logic to extend the knowledge base
http://phoible.org/segment/uhttp://phoible.org/id/iso639-3/
amp
http://phoible.org/id/iso639-3/ant http://phoible.org/segment/!
http://phoible.org/segment/"
http://phoible.org/hasSegment
http://phoible.org/id/iso639-3/apn
http://phoible.org/segment/u#
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Linguistic challengesComputational challengesMetadata and discovery
Using OWL logic to extend the knowledge base
http://phoible.org/segment/uhttp://phoible.org/id/iso639-3/
amp
http://phoible.org/id/iso639-3/ant http://phoible.org/segment/!
http://phoible.org/segment/"
http://phoible.org/hasSegment
http://phoible.org/id/iso639-3/apn
http://phoible.org/segment/u#
owl:sameAs
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Linguistic challengesComputational challengesMetadata and discovery
Computational challenges
� Adherence to Unicode IPA:� g/g, !/!, a/A, p/p
� Rendering sequences of Unicode characters as the samesegment
a˜
a˜U+0061 + U+0330 + U+0303 U+0061 + U+0303 + U+0330
latin small letter a + latin small letter a +combining tilde below + combining tilde +combining tilde combining tilde below
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Linguistic challengesComputational challengesMetadata and discovery
Metadata
� DCMI RDF gets you most of OLAC� Two big things missing: resource type and language
identification� http://www.sil.org/iso639-3/documentation.asp?id=aar� http://www.ethnologue.com/show language.asp?code=aar� http://lexvo.org/id/iso639-3/aar� http://wals.info/languoid/by code/iso 639 3 aar� http://phoible.org/id/iso639-3/aar� http://resource dc:subject GOLD:Language� http://glottolog.livingsources.org/languoid/id/25785.xhtml
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Linguistic challengesComputational challengesMetadata and discovery
Poornima & Good 2010
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Linguistic challengesComputational challengesMetadata and discovery
Other questions moving forward...
� How do we access the Linguistic Linked Open Data cloud?� Download the RDF/OWL files and run locally?� In our code and point to the files online?
� What about a publicly accessible interface?
� SPARQL endpoint to make the LLOD data widely available
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Linguistic challengesComputational challengesMetadata and discovery
Summary
� PHOIBLE provides a large sample of segment inventories andadditional phonological information from the world’s languages
� It uses RDF and OWL graph data structures to captureknowledge about segments and distinctive features
� Can be used to ask questions of phonological systems... andmore
Steven Moran, LMU Linked Data in Linguistics
OverviewInvestigationChallenges
Linguistic challengesComputational challengesMetadata and discovery
Many thanks to the participants and organizers of LDL
And... Emily Bender, Michael Cysouw, Morgana Davids, ScottDrellishak, Shauna Eggers, David Ellison, Scott Farrar, ChristopherGreen, Sharon Hargus, Richard John Harvey, Jeff Good, KelleyKilanski, William Lewis, Michael McAuliffe, Dan McCloy, BrandonPlasters, Tristan Purvis, Cameron Rule, Daniel Smith, Daniel Veja& Richard Wright
Steven Moran, LMU Linked Data in Linguistics