Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia...
Transcript of Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia...
Knowledge Base
―Semantic Web and Ontology (4)―
Masaharu Yoshioka
Answer of the last lecture question
◼ Following are list of concepts and roles defined
for the question
– Concepts
• Female, Male, Human ≡ Male ⊔ Female, Animal
– Roles
• has-student, bred-by, eaten-by
◼ Define following concepts using given concepts
and roles.
Answer of the last lecture question (cont)
◼ Domestic animal(Animal who is bred by human
for eating by human)Animal ⊓ eaten-by.Human ⊓ bred-by.Human
Animal ⊓ eaten-by.Human ⊓ bred-by.Human
◼ Teacher(Man who has human student)Human ⊓ has-student.Human
◼ Female teacher
Female ⊓ has-student.Human
◼ Teacher who have only male students
– Human ⊓ has-student. Male ⊓ has-student. Male
Comments from the last report
◼ 男の子だけを持つ父親はどう書きますか?
◼ Can I interpret that epistemology is an antonym of
ontology?
◼ DL言語などがどのようにコンピュータデー応用されているのでしょうか?
◼ 私は普通に英語でもOKです。
◼ 日本語で話す割合を増やしてほしい。日本語メインで英語をサブにしてほしい。
◼ 予習したいので、スライドを早くほしい。
Web Ontology: Background
◼ Web2.0
– Large volume of user generated contents
• Wikipedia:Encyclopedia edited by volunteer editors
• GeoNames:Geographical database that can edit by
volunteer editors
– Varieties of knowledge resources are available through
the Web
◼ Basic information about concept hierarchy
– WordNet:Dictionary for representing sense for the
strings (words)
• Define concept hierarchy, word sense relationships,
hypernym, hyponym, …
Wikipedia
◼ Free Internet based encyclopedia
– Metadata related to an article
is organized in an Infobox
– Articles are categorized into
multiple categories
DBpedia
◼ Metadata database based on Wikipedia articles
– Core part of Linked Open Data
◼ Quality of DBpedia relies on quality of Wikipedia
DBpedia Facts & Figures
https://wiki.dbpedia.org/about/facts-figures
◼ Large information resource of things
– 4.58 million things
– 4.22 million are classified in a consistent ontology (DBpedia ontology)
• 1,445,000 persons
• 735,000 places
– 478,000 populated places
• 411,000 creative works
– 123,000 music albums, 87,000 films and 19,000 video games
• 241,000 organizations
– 58,000 companies and 49,000 educational institutions
• 251,000 species
• 6,000 diseases.
DBPedia Ontologyhttp://mappings.dbpedia.org/server/ontology/classes/
◼ Ontology for classify the things defined in
DBpedia
Data Example of DBpedia
◼ Add metadata to the correcsponding Wikipedia
article in RDF format
<http://dbpedia.org/resource/Aristotle>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/2002/07/owl#Thing> .
<http://dbpedia.org/resource/Aristotle>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://dbpedia.org/ontology/Person> .
<http://dbpedia.org/resource/Aristotle>
<http://dbpedia.org/property/wikiPageUsesTemplate>
<http://dbpedia.org/resource/Template:Persondata> .
<http://dbpedia.org/resource/Aristotle>
<http://dbpedia.org/property/placeOfDeath> "Chalcis"@en .
<http://dbpedia.org/resource/Aristotle>
<http://dbpedia.org/property/dateOfDeath> "322 BC"@en .
<http://dbpedia.org/resource/Aristotle>
<http://dbpedia.org/property/placeOfBirth> "Stageira"@en .
Wikidata
◼ Wikidata stores structured
data that are not precise
enough using Infobox
https://www.wikidata.org/
wiki/Wikidata:Main_Page
WordNet
https://wordnet.princeton.edu/
◼ WordNet is a large lexical database of English.
Nouns, verbs, adjectives and adverbs are grouped
into sets of cognitive synonyms (synsets), each
expressing a distinct concept.
– Synsets represent word sense.
◼ Synsets are interlinked by means of conceptual-
semantic and lexical relations.
– Words (strings) with multiple sense (e.g., bridge) have
multiple links with synsets.
– Synsets with synonyms (multiple words for one sense)
have multiple links with words(strings).
Concept Definition of WordNet
http://wordnetweb.princeton.edu/perl/webwn
◼ Browsing WordNet using WordNet Search
– Search Results for “philosopher”
WordNet Statisticshttps://wordnet.princeton.edu/documentation/wnstats7wn
◼ Version 3.0 (Version 3.1)
◼ There are several other language version including
Japanese version(Wn-Ja 1.1)
http://compling.hss.ntu.edu.sg/wnja/
– 57,238 synsets; 93,834 words; 158,058 word-sense
pairs
– 135,692 Definitions; 48,276 example sentences
POS Unique Synsets Total Strings Word-Sense Pairs
Noun 117798 82115 146312
Verb 11529 13767 25047
Adjective 21479 18156 30002
Adverb 4481 3621 5580
Totals 155287(175979) 117659 (155327) 206941 (207016)
Linked Open Data
◼ Link open data provided by various knowledge
and information resources for using them as a
large database, knowledge base.
◼ One of the largest ontology based knowledge base
that covers concept definition and knowledge
about instances.
Linking Open Data Cloud
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
YAGO:Yet Another Great Ontology
◼ Web-based ontology constructed by using
Wikipedia、WordNet、GeoNames
◼ Class definition based on the Wikipedia categories
and WordNet for handling class hierarchy
– Select appropriate class for the article using category
list of the article with the reference to the category
Class:PhilosopherCategory of Aristotle
Integration of Geographical Information to
YAGO
◼ GeoNames: Linked Open Data for Geographical
Data
more than 10,000,000
geographical entries
19.8
million
articles
large geographical database
YAGO2
(Yet Another Great Ontology2)
Integration method by YAGO2
[Hoffart et.al., 2012]
◼ Integration by using name and coordinate'Burgos' Province in Spainhttp://en.wikipedia.org/wiki/Burgos
'Min River' River in Chinahttp://en.wikipedia.org/wiki/Min_River_(Fujian)
Name matching
Coordinates matching
84,349 corresponding pairs have been found.
SPARQL Endpoint
◼ Most of the Linked Open Data sites provide
SPARQL Endpoint.
– DBpedia
http://dbpedia.org/sparql
– DBpedia(Japanese)
http://ja.dbpedia.org/sparql
– Wikidata
https://query.wikidata.org/
– YAGO2
https://gate.d5.mpi-
inf.mpg.de/webyagospotlx/WebInterface
Example of Using SPARQL Endpoint
◼ DBpedia(Japanese)
http://ja.dbpedia.org/sparql
– Example
http://ja.dbpedia.org/
◼ YAGO2
https://www.mpi-
inf.mpg.de/departments/databases-and-
information-systems/research/yago-
naga/yago/demo/
Supplemental material
◼ Another approach to make a mapping between
GeoNames and Wikipedia
Integration Method based on Wikipedia
Category [Yoshioka et al, 2012]
◼ Wikipedia category for geographical entity
– <class information> (in|of) <location information>
e.g., Populated Place In Spain
◼ GeoNames
– Country and administrative code: location information
– Feature code: class information
'Burgos' Province in Spainhttp://en.wikipedia.org/wiki/Burgos
Populated Place In Spain -> Populated Place In Burgos
Integration Method based on Wikipedia Category
◼ Use matching tables between feature code and
Wikipedia category class information
– Distance is not first priority information to select an
appropriate corresponding entity.
'Narosura' populated place in Kenya
http://en.wikipedia.org/wiki/Narosura
Distance ○
△
Algorithms for link discovery
◼ Comparison of names for candidate pairs
Wikipedia: Rome, Iowa
Category: Cities in Iowa
Populated places in Henry County, Iowa
GeoNames(id, name (alter name), feature
class, country and administrative code)
6459720, Rome, PPL, US:IA:087
3169070, Roma (Rome), PPLC, IT:07:RM,
….
Algorithms for link discovery 2
◼ Extraction of information from category
Wikipedia: Rome, IowaCategory: Cities in Iowa Populated places in Henry County, Iowa
Class candidates:
City, Populated places
→
PPL, PPLC, ADM1, …
Location candidates:
Iowa → US:IA
Henry County, Iowa
→ US:IA
Algorithms for link discovery 3
◼ Selection of candidate pair
Wikipedia: Rome, IowaCategory: Cities in Iowa Populated places in Henry County, Iowa
GeoNames(id, name (alter name), feature
class, country and administrative code)
○6459720, Rome, PPL, US:IA:087
×3169070, Roma (Rome), PPLC, IT:07:RM, ….
Algorithms for link discovery 4
◼ Elimination of low precision data
– 1 to N mapping (It may includes errors)
• Multiple Wikipedia pages for a single GeoNames
entry
• Multiple GeoNames entries for a single Wikipedia
page
Results of Automatic Integration
◼ Classify integration results by using distance
information
– Wikipedia coordinate information is extracted by using
DBpedia and GeoHack
Types of pairs Pages Manual
evaluation
Nearby pairs (<= 5km) 26,047 200/200
Distant pairs (>5km) 4,333 180/200
Pairs with no distance
information
14,200 190/200
Inconsistent Geographical Information
◼ There are several appropriate pairs with long
distance.
Type of Inconsistency Cases
Inconsistent geographic information for
appropriate pairs (e.g., large area such as lake,
stream,…)
150/200
Errors in Wikipedia and/or GeoNames 30/200
Errors due to our link detection method 20/200
Errors in Automatic Integration
◼ Variations in names
– The names of entities might not be represented in
English in GeoNames.
◼ Failure to estimate the appropriate administrative
code
– Wikipedia category has administrative information, but
name of the administrative code is different from
GeoNames ones.
Errors in Original Data (Wikipedia and
DBpedia)
◼ Wikipedia infobox may include errors
– There are several errors for coordinate in Wikipedia
• Copy and paste
• Difficulties to use template (hidden parameters for
type of longitude (E or W))
– DBPedia also contains many errors for coordinate
information
• DBPedia assumes coordinates are represented by 3
integers (degrees, minutes and seconds) but there are
several coordinate information by using float values.
Errors in Original Data (GeoNames)
◼ Inappropriate pairs between GeoNames and
Wikipedia in original GeoNames database
– Failure about disambiguation of entries for different
feature code
e.g., Populated place is matched with train station of the
city.
Another Issues for Linking Wikipedia and
GeoNames
◼ Different granularity level of the geographical
entity
– It is problematic for using owl:SameAs link.
◼ Wikipedia issues
– Geographical entities with multiple points
• Geographical entity about large area may contains
multiple points.
• Example: river (source, mouth, …)
– Wikipedia pages with multiple geographical entities
• Geographical entity about large area may contains
multiple points.
• Example: mountain range pages contained
information about several mountains in the range
Another Issues for Linking Wikipedia and
GeoNames (cont.)
◼ GeoNames issues
– Geographical entities with multiple feature classes
• A single GeoNames entry corresponds to one
feature class.
• Example: “Milolii, Hawaii” has two corresponding
GeoNames entities (5851041: administrative
division and 5851402: populated place).
Summary
◼ Semantic Web
– For handling the semantic information of the web page,
it is necessary to have
• Ontology: Concept definition for understanding the
difference of the meaning in different website.
• Metadata annotation: It is necessary to have
metadata annotation as structured data with schema
definition.