Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia...

36
Knowledge Base ―Semantic Web and Ontology (4)― Masaharu Yoshioka

Transcript of Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia...

Page 1: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Knowledge Base

―Semantic Web and Ontology (4)―

Masaharu Yoshioka

Page 2: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Answer of the last lecture question

◼ Following are list of concepts and roles defined

for the question

– Concepts

• Female, Male, Human ≡ Male ⊔ Female, Animal

– Roles

• has-student, bred-by, eaten-by

◼ Define following concepts using given concepts

and roles.

Page 3: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Answer of the last lecture question (cont)

◼ Domestic animal(Animal who is bred by human

for eating by human)Animal ⊓ eaten-by.Human ⊓ bred-by.Human

Animal ⊓ eaten-by.Human ⊓ bred-by.Human

◼ Teacher(Man who has human student)Human ⊓ has-student.Human

◼ Female teacher

Female ⊓ has-student.Human

◼ Teacher who have only male students

– Human ⊓ has-student. Male ⊓ has-student. Male

Page 4: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Comments from the last report

◼ 男の子だけを持つ父親はどう書きますか?

◼ Can I interpret that epistemology is an antonym of

ontology?

◼ DL言語などがどのようにコンピュータデー応用されているのでしょうか?

◼ 私は普通に英語でもOKです。

◼ 日本語で話す割合を増やしてほしい。日本語メインで英語をサブにしてほしい。

◼ 予習したいので、スライドを早くほしい。

Page 5: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Web Ontology: Background

◼ Web2.0

– Large volume of user generated contents

• Wikipedia:Encyclopedia edited by volunteer editors

• GeoNames:Geographical database that can edit by

volunteer editors

– Varieties of knowledge resources are available through

the Web

◼ Basic information about concept hierarchy

– WordNet:Dictionary for representing sense for the

strings (words)

• Define concept hierarchy, word sense relationships,

hypernym, hyponym, …

Page 6: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Wikipedia

◼ Free Internet based encyclopedia

– Metadata related to an article

is organized in an Infobox

– Articles are categorized into

multiple categories

Page 7: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

DBpedia

◼ Metadata database based on Wikipedia articles

– Core part of Linked Open Data

◼ Quality of DBpedia relies on quality of Wikipedia

Page 8: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

DBpedia Facts & Figures

https://wiki.dbpedia.org/about/facts-figures

◼ Large information resource of things

– 4.58 million things

– 4.22 million are classified in a consistent ontology (DBpedia ontology)

• 1,445,000 persons

• 735,000 places

– 478,000 populated places

• 411,000 creative works

– 123,000 music albums, 87,000 films and 19,000 video games

• 241,000 organizations

– 58,000 companies and 49,000 educational institutions

• 251,000 species

• 6,000 diseases.

Page 9: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

DBPedia Ontologyhttp://mappings.dbpedia.org/server/ontology/classes/

◼ Ontology for classify the things defined in

DBpedia

Page 10: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Data Example of DBpedia

◼ Add metadata to the correcsponding Wikipedia

article in RDF format

<http://dbpedia.org/resource/Aristotle>

<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>

<http://www.w3.org/2002/07/owl#Thing> .

<http://dbpedia.org/resource/Aristotle>

<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>

<http://dbpedia.org/ontology/Person> .

<http://dbpedia.org/resource/Aristotle>

<http://dbpedia.org/property/wikiPageUsesTemplate>

<http://dbpedia.org/resource/Template:Persondata> .

<http://dbpedia.org/resource/Aristotle>

<http://dbpedia.org/property/placeOfDeath> "Chalcis"@en .

<http://dbpedia.org/resource/Aristotle>

<http://dbpedia.org/property/dateOfDeath> "322 BC"@en .

<http://dbpedia.org/resource/Aristotle>

<http://dbpedia.org/property/placeOfBirth> "Stageira"@en .

Page 11: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Wikidata

◼ Wikidata stores structured

data that are not precise

enough using Infobox

https://www.wikidata.org/

wiki/Wikidata:Main_Page

Page 12: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

WordNet

https://wordnet.princeton.edu/

◼ WordNet is a large lexical database of English.

Nouns, verbs, adjectives and adverbs are grouped

into sets of cognitive synonyms (synsets), each

expressing a distinct concept.

– Synsets represent word sense.

◼ Synsets are interlinked by means of conceptual-

semantic and lexical relations.

– Words (strings) with multiple sense (e.g., bridge) have

multiple links with synsets.

– Synsets with synonyms (multiple words for one sense)

have multiple links with words(strings).

Page 13: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Concept Definition of WordNet

http://wordnetweb.princeton.edu/perl/webwn

◼ Browsing WordNet using WordNet Search

– Search Results for “philosopher”

Page 14: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

WordNet Statisticshttps://wordnet.princeton.edu/documentation/wnstats7wn

◼ Version 3.0 (Version 3.1)

◼ There are several other language version including

Japanese version(Wn-Ja 1.1)

http://compling.hss.ntu.edu.sg/wnja/

– 57,238 synsets; 93,834 words; 158,058 word-sense

pairs

– 135,692 Definitions; 48,276 example sentences

POS Unique Synsets Total Strings Word-Sense Pairs

Noun 117798 82115 146312

Verb 11529 13767 25047

Adjective 21479 18156 30002

Adverb 4481 3621 5580

Totals 155287(175979) 117659 (155327) 206941 (207016)

Page 15: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Linked Open Data

◼ Link open data provided by various knowledge

and information resources for using them as a

large database, knowledge base.

◼ One of the largest ontology based knowledge base

that covers concept definition and knowledge

about instances.

Page 16: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Linking Open Data Cloud

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”

Page 17: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

YAGO:Yet Another Great Ontology

◼ Web-based ontology constructed by using

Wikipedia、WordNet、GeoNames

◼ Class definition based on the Wikipedia categories

and WordNet for handling class hierarchy

– Select appropriate class for the article using category

list of the article with the reference to the category

Class:PhilosopherCategory of Aristotle

Page 18: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Integration of Geographical Information to

YAGO

◼ GeoNames: Linked Open Data for Geographical

Data

more than 10,000,000

geographical entries

19.8

million

articles

large geographical database

YAGO2

(Yet Another Great Ontology2)

Page 19: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Integration method by YAGO2

[Hoffart et.al., 2012]

◼ Integration by using name and coordinate'Burgos' Province in Spainhttp://en.wikipedia.org/wiki/Burgos

'Min River' River in Chinahttp://en.wikipedia.org/wiki/Min_River_(Fujian)

Name matching

Coordinates matching

84,349 corresponding pairs have been found.

Page 20: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

SPARQL Endpoint

◼ Most of the Linked Open Data sites provide

SPARQL Endpoint.

– DBpedia

http://dbpedia.org/sparql

– DBpedia(Japanese)

http://ja.dbpedia.org/sparql

– Wikidata

https://query.wikidata.org/

– YAGO2

https://gate.d5.mpi-

inf.mpg.de/webyagospotlx/WebInterface

Page 21: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Example of Using SPARQL Endpoint

◼ DBpedia(Japanese)

http://ja.dbpedia.org/sparql

– Example

http://ja.dbpedia.org/

◼ YAGO2

https://www.mpi-

inf.mpg.de/departments/databases-and-

information-systems/research/yago-

naga/yago/demo/

Page 22: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Supplemental material

◼ Another approach to make a mapping between

GeoNames and Wikipedia

Page 23: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Integration Method based on Wikipedia

Category [Yoshioka et al, 2012]

◼ Wikipedia category for geographical entity

– <class information> (in|of) <location information>

e.g., Populated Place In Spain

◼ GeoNames

– Country and administrative code: location information

– Feature code: class information

'Burgos' Province in Spainhttp://en.wikipedia.org/wiki/Burgos

Populated Place In Spain -> Populated Place In Burgos

Page 24: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Integration Method based on Wikipedia Category

◼ Use matching tables between feature code and

Wikipedia category class information

– Distance is not first priority information to select an

appropriate corresponding entity.

'Narosura' populated place in Kenya

http://en.wikipedia.org/wiki/Narosura

Distance ○

Page 25: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Algorithms for link discovery

◼ Comparison of names for candidate pairs

Wikipedia: Rome, Iowa

Category: Cities in Iowa

Populated places in Henry County, Iowa

GeoNames(id, name (alter name), feature

class, country and administrative code)

6459720, Rome, PPL, US:IA:087

3169070, Roma (Rome), PPLC, IT:07:RM,

….

Page 26: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Algorithms for link discovery 2

◼ Extraction of information from category

Wikipedia: Rome, IowaCategory: Cities in Iowa Populated places in Henry County, Iowa

Class candidates:

City, Populated places

PPL, PPLC, ADM1, …

Location candidates:

Iowa → US:IA

Henry County, Iowa

→ US:IA

Page 27: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Algorithms for link discovery 3

◼ Selection of candidate pair

Wikipedia: Rome, IowaCategory: Cities in Iowa Populated places in Henry County, Iowa

GeoNames(id, name (alter name), feature

class, country and administrative code)

○6459720, Rome, PPL, US:IA:087

×3169070, Roma (Rome), PPLC, IT:07:RM, ….

Page 28: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Algorithms for link discovery 4

◼ Elimination of low precision data

– 1 to N mapping (It may includes errors)

• Multiple Wikipedia pages for a single GeoNames

entry

• Multiple GeoNames entries for a single Wikipedia

page

Page 29: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Results of Automatic Integration

◼ Classify integration results by using distance

information

– Wikipedia coordinate information is extracted by using

DBpedia and GeoHack

Types of pairs Pages Manual

evaluation

Nearby pairs (<= 5km) 26,047 200/200

Distant pairs (>5km) 4,333 180/200

Pairs with no distance

information

14,200 190/200

Page 30: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Inconsistent Geographical Information

◼ There are several appropriate pairs with long

distance.

Type of Inconsistency Cases

Inconsistent geographic information for

appropriate pairs (e.g., large area such as lake,

stream,…)

150/200

Errors in Wikipedia and/or GeoNames 30/200

Errors due to our link detection method 20/200

Page 31: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Errors in Automatic Integration

◼ Variations in names

– The names of entities might not be represented in

English in GeoNames.

◼ Failure to estimate the appropriate administrative

code

– Wikipedia category has administrative information, but

name of the administrative code is different from

GeoNames ones.

Page 32: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Errors in Original Data (Wikipedia and

DBpedia)

◼ Wikipedia infobox may include errors

– There are several errors for coordinate in Wikipedia

• Copy and paste

• Difficulties to use template (hidden parameters for

type of longitude (E or W))

– DBPedia also contains many errors for coordinate

information

• DBPedia assumes coordinates are represented by 3

integers (degrees, minutes and seconds) but there are

several coordinate information by using float values.

Page 33: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Errors in Original Data (GeoNames)

◼ Inappropriate pairs between GeoNames and

Wikipedia in original GeoNames database

– Failure about disambiguation of entries for different

feature code

e.g., Populated place is matched with train station of the

city.

Page 34: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Another Issues for Linking Wikipedia and

GeoNames

◼ Different granularity level of the geographical

entity

– It is problematic for using owl:SameAs link.

◼ Wikipedia issues

– Geographical entities with multiple points

• Geographical entity about large area may contains

multiple points.

• Example: river (source, mouth, …)

– Wikipedia pages with multiple geographical entities

• Geographical entity about large area may contains

multiple points.

• Example: mountain range pages contained

information about several mountains in the range

Page 35: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Another Issues for Linking Wikipedia and

GeoNames (cont.)

◼ GeoNames issues

– Geographical entities with multiple feature classes

• A single GeoNames entry corresponds to one

feature class.

• Example: “Milolii, Hawaii” has two corresponding

GeoNames entities (5851041: administrative

division and 5851402: populated place).

Page 36: Knowledge Base - 北海道大学mhjcc3-ei.eng.hokudai.ac.jp/~yoshioka/kb/kby-4.pdf · •DBPedia assumes coordinates are represented by 3 integers (degrees, minutes and seconds) but

Summary

◼ Semantic Web

– For handling the semantic information of the web page,

it is necessary to have

• Ontology: Concept definition for understanding the

difference of the meaning in different website.

• Metadata annotation: It is necessary to have

metadata annotation as structured data with schema

definition.