807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Anaphora resolution (Coreference)
807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics.
-
Upload
ursula-blair -
Category
Documents
-
view
241 -
download
1
Transcript of 807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics.
807 - TEXT ANALYTICS
Massimo Poesio
Lecture 7 Wikipedia for Text Analytics
WIKIPEDIA
bullWikipedia is a free multilingual encyclopedia project supported by the non-profit Wikimedia FoundationbullWikipedias articles have been written collaboratively by volunteers around the worldbullAlmost all of its articles can be edited by anyone who can access the Wikipedia website
The free encyclopedia that anyone can edit
----httpenwikipediaorgwikiWikipeida
WIKIPEDIA
bull Wikipedia is
1 domain independentndash it has a large coverage
2 up-to-datendash to process current information
3 multilingualndash to process information in many languages
bullTitle
bullAbstract
bullInfoboxes
bullGeo-coordinates
bullCategories
bullImages
bullLinks
bullOther languages
bullOther wiki pages
bullTo the web
bullRedirects
bullDisambiguates
WIKIPEDIA FOR TEXT ANALYTICS
bull Wikipedia has proven an extremely useful resource for text analytics being used forndash Text classification clusteringndash Enriching documents through lsquoWikificationrsquondash NERndash Relation extraction ndash hellip
Wikipedia as Thesaurus for text classification clusteringbull Unlike other standard ontologies such as WordNet
and Mesh Wikipedia itself is not a structured thesaurus
bull However it is morehellipndash Comprehensive it contains 12 million articles (28
million in the English Wikipedia) ndash Accurate A study by Giles (2005) found Wikipedia can
compete with Encyclopaeligdia Britannica in accuracyndash Up to date Current and emerging concepts are
absorbed timely
Giles J 2005 Internet encyclopaedias go head to head Nature 438 900ndash901
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurus
Wikipedia Article that describes the Concept Artificial intelligence
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected links
AI is redirected to its equivalent concept Artificial Intelligence
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system
in which each article belongs to at least one category
The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system in
which each article belongs to at least one category ndash Polysemous concepts are disambiguated by
Disambiguation Pages
The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems
bull A number of possibilitiesndash Use similarity between documents and Wikipedia
pages on a given topic as a feature for text classification
ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire
Using Wikipedia Categories for text classification
17
WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science
psychology)ndash Booksndash Learning objects
ldquoThe United States was involved in the Cold Warrdquo
United States03793
Cold War03111
Vietnam War00023
World War I00023
Communism00027
Ronald Reagan00027
Michail Gorbachev00023
Cat Wars Involvingthe United States000779
Cat Global Conflicts000779
USING WIKIPEDIA FOR TEXT CLASSIFICATION
bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories
bull Use the documents associated with those categories as training documents
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
WIKIPEDIA
bullWikipedia is a free multilingual encyclopedia project supported by the non-profit Wikimedia FoundationbullWikipedias articles have been written collaboratively by volunteers around the worldbullAlmost all of its articles can be edited by anyone who can access the Wikipedia website
The free encyclopedia that anyone can edit
----httpenwikipediaorgwikiWikipeida
WIKIPEDIA
bull Wikipedia is
1 domain independentndash it has a large coverage
2 up-to-datendash to process current information
3 multilingualndash to process information in many languages
bullTitle
bullAbstract
bullInfoboxes
bullGeo-coordinates
bullCategories
bullImages
bullLinks
bullOther languages
bullOther wiki pages
bullTo the web
bullRedirects
bullDisambiguates
WIKIPEDIA FOR TEXT ANALYTICS
bull Wikipedia has proven an extremely useful resource for text analytics being used forndash Text classification clusteringndash Enriching documents through lsquoWikificationrsquondash NERndash Relation extraction ndash hellip
Wikipedia as Thesaurus for text classification clusteringbull Unlike other standard ontologies such as WordNet
and Mesh Wikipedia itself is not a structured thesaurus
bull However it is morehellipndash Comprehensive it contains 12 million articles (28
million in the English Wikipedia) ndash Accurate A study by Giles (2005) found Wikipedia can
compete with Encyclopaeligdia Britannica in accuracyndash Up to date Current and emerging concepts are
absorbed timely
Giles J 2005 Internet encyclopaedias go head to head Nature 438 900ndash901
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurus
Wikipedia Article that describes the Concept Artificial intelligence
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected links
AI is redirected to its equivalent concept Artificial Intelligence
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system
in which each article belongs to at least one category
The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system in
which each article belongs to at least one category ndash Polysemous concepts are disambiguated by
Disambiguation Pages
The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems
bull A number of possibilitiesndash Use similarity between documents and Wikipedia
pages on a given topic as a feature for text classification
ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire
Using Wikipedia Categories for text classification
17
WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science
psychology)ndash Booksndash Learning objects
ldquoThe United States was involved in the Cold Warrdquo
United States03793
Cold War03111
Vietnam War00023
World War I00023
Communism00027
Ronald Reagan00027
Michail Gorbachev00023
Cat Wars Involvingthe United States000779
Cat Global Conflicts000779
USING WIKIPEDIA FOR TEXT CLASSIFICATION
bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories
bull Use the documents associated with those categories as training documents
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
WIKIPEDIA
bull Wikipedia is
1 domain independentndash it has a large coverage
2 up-to-datendash to process current information
3 multilingualndash to process information in many languages
bullTitle
bullAbstract
bullInfoboxes
bullGeo-coordinates
bullCategories
bullImages
bullLinks
bullOther languages
bullOther wiki pages
bullTo the web
bullRedirects
bullDisambiguates
WIKIPEDIA FOR TEXT ANALYTICS
bull Wikipedia has proven an extremely useful resource for text analytics being used forndash Text classification clusteringndash Enriching documents through lsquoWikificationrsquondash NERndash Relation extraction ndash hellip
Wikipedia as Thesaurus for text classification clusteringbull Unlike other standard ontologies such as WordNet
and Mesh Wikipedia itself is not a structured thesaurus
bull However it is morehellipndash Comprehensive it contains 12 million articles (28
million in the English Wikipedia) ndash Accurate A study by Giles (2005) found Wikipedia can
compete with Encyclopaeligdia Britannica in accuracyndash Up to date Current and emerging concepts are
absorbed timely
Giles J 2005 Internet encyclopaedias go head to head Nature 438 900ndash901
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurus
Wikipedia Article that describes the Concept Artificial intelligence
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected links
AI is redirected to its equivalent concept Artificial Intelligence
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system
in which each article belongs to at least one category
The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system in
which each article belongs to at least one category ndash Polysemous concepts are disambiguated by
Disambiguation Pages
The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems
bull A number of possibilitiesndash Use similarity between documents and Wikipedia
pages on a given topic as a feature for text classification
ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire
Using Wikipedia Categories for text classification
17
WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science
psychology)ndash Booksndash Learning objects
ldquoThe United States was involved in the Cold Warrdquo
United States03793
Cold War03111
Vietnam War00023
World War I00023
Communism00027
Ronald Reagan00027
Michail Gorbachev00023
Cat Wars Involvingthe United States000779
Cat Global Conflicts000779
USING WIKIPEDIA FOR TEXT CLASSIFICATION
bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories
bull Use the documents associated with those categories as training documents
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
bullTitle
bullAbstract
bullInfoboxes
bullGeo-coordinates
bullCategories
bullImages
bullLinks
bullOther languages
bullOther wiki pages
bullTo the web
bullRedirects
bullDisambiguates
WIKIPEDIA FOR TEXT ANALYTICS
bull Wikipedia has proven an extremely useful resource for text analytics being used forndash Text classification clusteringndash Enriching documents through lsquoWikificationrsquondash NERndash Relation extraction ndash hellip
Wikipedia as Thesaurus for text classification clusteringbull Unlike other standard ontologies such as WordNet
and Mesh Wikipedia itself is not a structured thesaurus
bull However it is morehellipndash Comprehensive it contains 12 million articles (28
million in the English Wikipedia) ndash Accurate A study by Giles (2005) found Wikipedia can
compete with Encyclopaeligdia Britannica in accuracyndash Up to date Current and emerging concepts are
absorbed timely
Giles J 2005 Internet encyclopaedias go head to head Nature 438 900ndash901
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurus
Wikipedia Article that describes the Concept Artificial intelligence
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected links
AI is redirected to its equivalent concept Artificial Intelligence
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system
in which each article belongs to at least one category
The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system in
which each article belongs to at least one category ndash Polysemous concepts are disambiguated by
Disambiguation Pages
The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems
bull A number of possibilitiesndash Use similarity between documents and Wikipedia
pages on a given topic as a feature for text classification
ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire
Using Wikipedia Categories for text classification
17
WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science
psychology)ndash Booksndash Learning objects
ldquoThe United States was involved in the Cold Warrdquo
United States03793
Cold War03111
Vietnam War00023
World War I00023
Communism00027
Ronald Reagan00027
Michail Gorbachev00023
Cat Wars Involvingthe United States000779
Cat Global Conflicts000779
USING WIKIPEDIA FOR TEXT CLASSIFICATION
bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories
bull Use the documents associated with those categories as training documents
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
WIKIPEDIA FOR TEXT ANALYTICS
bull Wikipedia has proven an extremely useful resource for text analytics being used forndash Text classification clusteringndash Enriching documents through lsquoWikificationrsquondash NERndash Relation extraction ndash hellip
Wikipedia as Thesaurus for text classification clusteringbull Unlike other standard ontologies such as WordNet
and Mesh Wikipedia itself is not a structured thesaurus
bull However it is morehellipndash Comprehensive it contains 12 million articles (28
million in the English Wikipedia) ndash Accurate A study by Giles (2005) found Wikipedia can
compete with Encyclopaeligdia Britannica in accuracyndash Up to date Current and emerging concepts are
absorbed timely
Giles J 2005 Internet encyclopaedias go head to head Nature 438 900ndash901
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurus
Wikipedia Article that describes the Concept Artificial intelligence
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected links
AI is redirected to its equivalent concept Artificial Intelligence
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system
in which each article belongs to at least one category
The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system in
which each article belongs to at least one category ndash Polysemous concepts are disambiguated by
Disambiguation Pages
The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems
bull A number of possibilitiesndash Use similarity between documents and Wikipedia
pages on a given topic as a feature for text classification
ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire
Using Wikipedia Categories for text classification
17
WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science
psychology)ndash Booksndash Learning objects
ldquoThe United States was involved in the Cold Warrdquo
United States03793
Cold War03111
Vietnam War00023
World War I00023
Communism00027
Ronald Reagan00027
Michail Gorbachev00023
Cat Wars Involvingthe United States000779
Cat Global Conflicts000779
USING WIKIPEDIA FOR TEXT CLASSIFICATION
bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories
bull Use the documents associated with those categories as training documents
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Wikipedia as Thesaurus for text classification clusteringbull Unlike other standard ontologies such as WordNet
and Mesh Wikipedia itself is not a structured thesaurus
bull However it is morehellipndash Comprehensive it contains 12 million articles (28
million in the English Wikipedia) ndash Accurate A study by Giles (2005) found Wikipedia can
compete with Encyclopaeligdia Britannica in accuracyndash Up to date Current and emerging concepts are
absorbed timely
Giles J 2005 Internet encyclopaedias go head to head Nature 438 900ndash901
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurus
Wikipedia Article that describes the Concept Artificial intelligence
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected links
AI is redirected to its equivalent concept Artificial Intelligence
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system
in which each article belongs to at least one category
The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system in
which each article belongs to at least one category ndash Polysemous concepts are disambiguated by
Disambiguation Pages
The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems
bull A number of possibilitiesndash Use similarity between documents and Wikipedia
pages on a given topic as a feature for text classification
ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire
Using Wikipedia Categories for text classification
17
WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science
psychology)ndash Booksndash Learning objects
ldquoThe United States was involved in the Cold Warrdquo
United States03793
Cold War03111
Vietnam War00023
World War I00023
Communism00027
Ronald Reagan00027
Michail Gorbachev00023
Cat Wars Involvingthe United States000779
Cat Global Conflicts000779
USING WIKIPEDIA FOR TEXT CLASSIFICATION
bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories
bull Use the documents associated with those categories as training documents
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurus
Wikipedia Article that describes the Concept Artificial intelligence
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected links
AI is redirected to its equivalent concept Artificial Intelligence
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system
in which each article belongs to at least one category
The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system in
which each article belongs to at least one category ndash Polysemous concepts are disambiguated by
Disambiguation Pages
The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems
bull A number of possibilitiesndash Use similarity between documents and Wikipedia
pages on a given topic as a feature for text classification
ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire
Using Wikipedia Categories for text classification
17
WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science
psychology)ndash Booksndash Learning objects
ldquoThe United States was involved in the Cold Warrdquo
United States03793
Cold War03111
Vietnam War00023
World War I00023
Communism00027
Ronald Reagan00027
Michail Gorbachev00023
Cat Wars Involvingthe United States000779
Cat Global Conflicts000779
USING WIKIPEDIA FOR TEXT CLASSIFICATION
bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories
bull Use the documents associated with those categories as training documents
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Wikipedia Article that describes the Concept Artificial intelligence
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected links
AI is redirected to its equivalent concept Artificial Intelligence
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system
in which each article belongs to at least one category
The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system in
which each article belongs to at least one category ndash Polysemous concepts are disambiguated by
Disambiguation Pages
The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems
bull A number of possibilitiesndash Use similarity between documents and Wikipedia
pages on a given topic as a feature for text classification
ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire
Using Wikipedia Categories for text classification
17
WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science
psychology)ndash Booksndash Learning objects
ldquoThe United States was involved in the Cold Warrdquo
United States03793
Cold War03111
Vietnam War00023
World War I00023
Communism00027
Ronald Reagan00027
Michail Gorbachev00023
Cat Wars Involvingthe United States000779
Cat Global Conflicts000779
USING WIKIPEDIA FOR TEXT CLASSIFICATION
bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories
bull Use the documents associated with those categories as training documents
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected links
AI is redirected to its equivalent concept Artificial Intelligence
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system
in which each article belongs to at least one category
The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system in
which each article belongs to at least one category ndash Polysemous concepts are disambiguated by
Disambiguation Pages
The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems
bull A number of possibilitiesndash Use similarity between documents and Wikipedia
pages on a given topic as a feature for text classification
ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire
Using Wikipedia Categories for text classification
17
WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science
psychology)ndash Booksndash Learning objects
ldquoThe United States was involved in the Cold Warrdquo
United States03793
Cold War03111
Vietnam War00023
World War I00023
Communism00027
Ronald Reagan00027
Michail Gorbachev00023
Cat Wars Involvingthe United States000779
Cat Global Conflicts000779
USING WIKIPEDIA FOR TEXT CLASSIFICATION
bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories
bull Use the documents associated with those categories as training documents
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
AI is redirected to its equivalent concept Artificial Intelligence
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system
in which each article belongs to at least one category
The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system in
which each article belongs to at least one category ndash Polysemous concepts are disambiguated by
Disambiguation Pages
The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems
bull A number of possibilitiesndash Use similarity between documents and Wikipedia
pages on a given topic as a feature for text classification
ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire
Using Wikipedia Categories for text classification
17
WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science
psychology)ndash Booksndash Learning objects
ldquoThe United States was involved in the Cold Warrdquo
United States03793
Cold War03111
Vietnam War00023
World War I00023
Communism00027
Ronald Reagan00027
Michail Gorbachev00023
Cat Wars Involvingthe United States000779
Cat Global Conflicts000779
USING WIKIPEDIA FOR TEXT CLASSIFICATION
bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories
bull Use the documents associated with those categories as training documents
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system
in which each article belongs to at least one category
The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system in
which each article belongs to at least one category ndash Polysemous concepts are disambiguated by
Disambiguation Pages
The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems
bull A number of possibilitiesndash Use similarity between documents and Wikipedia
pages on a given topic as a feature for text classification
ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire
Using Wikipedia Categories for text classification
17
WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science
psychology)ndash Booksndash Learning objects
ldquoThe United States was involved in the Cold Warrdquo
United States03793
Cold War03111
Vietnam War00023
World War I00023
Communism00027
Ronald Reagan00027
Michail Gorbachev00023
Cat Wars Involvingthe United States000779
Cat Global Conflicts000779
USING WIKIPEDIA FOR TEXT CLASSIFICATION
bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories
bull Use the documents associated with those categories as training documents
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system in
which each article belongs to at least one category ndash Polysemous concepts are disambiguated by
Disambiguation Pages
The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems
bull A number of possibilitiesndash Use similarity between documents and Wikipedia
pages on a given topic as a feature for text classification
ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire
Using Wikipedia Categories for text classification
17
WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science
psychology)ndash Booksndash Learning objects
ldquoThe United States was involved in the Cold Warrdquo
United States03793
Cold War03111
Vietnam War00023
World War I00023
Communism00027
Ronald Reagan00027
Michail Gorbachev00023
Cat Wars Involvingthe United States000779
Cat Global Conflicts000779
USING WIKIPEDIA FOR TEXT CLASSIFICATION
bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories
bull Use the documents associated with those categories as training documents
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Wikipedia as Thesaurus
bull Moreover Wikipedia has a well-formed structurendash Each article only describes a single conceptndash The title of the article is a short and well-formed
phrase like a term in a traditional thesaurusndash Equivalent concepts are grouped together by
redirected linksndash It contains a hierarchical categorization system in
which each article belongs to at least one category ndash Polysemous concepts are disambiguated by
Disambiguation Pages
The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems
bull A number of possibilitiesndash Use similarity between documents and Wikipedia
pages on a given topic as a feature for text classification
ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire
Using Wikipedia Categories for text classification
17
WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science
psychology)ndash Booksndash Learning objects
ldquoThe United States was involved in the Cold Warrdquo
United States03793
Cold War03111
Vietnam War00023
World War I00023
Communism00027
Ronald Reagan00027
Michail Gorbachev00023
Cat Wars Involvingthe United States000779
Cat Global Conflicts000779
USING WIKIPEDIA FOR TEXT CLASSIFICATION
bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories
bull Use the documents associated with those categories as training documents
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems
bull A number of possibilitiesndash Use similarity between documents and Wikipedia
pages on a given topic as a feature for text classification
ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire
Using Wikipedia Categories for text classification
17
WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science
psychology)ndash Booksndash Learning objects
ldquoThe United States was involved in the Cold Warrdquo
United States03793
Cold War03111
Vietnam War00023
World War I00023
Communism00027
Ronald Reagan00027
Michail Gorbachev00023
Cat Wars Involvingthe United States000779
Cat Global Conflicts000779
USING WIKIPEDIA FOR TEXT CLASSIFICATION
bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories
bull Use the documents associated with those categories as training documents
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
bull Objective use information in Wikipedia to improve performance of text classifiers clustering systems
bull A number of possibilitiesndash Use similarity between documents and Wikipedia
pages on a given topic as a feature for text classification
ndash Use WIKIFICATION to enrich documentsndash Use Wikipedia category system as category repertoire
Using Wikipedia Categories for text classification
17
WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science
psychology)ndash Booksndash Learning objects
ldquoThe United States was involved in the Cold Warrdquo
United States03793
Cold War03111
Vietnam War00023
World War I00023
Communism00027
Ronald Reagan00027
Michail Gorbachev00023
Cat Wars Involvingthe United States000779
Cat Global Conflicts000779
USING WIKIPEDIA FOR TEXT CLASSIFICATION
bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories
bull Use the documents associated with those categories as training documents
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Using Wikipedia Categories for text classification
17
WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science
psychology)ndash Booksndash Learning objects
ldquoThe United States was involved in the Cold Warrdquo
United States03793
Cold War03111
Vietnam War00023
World War I00023
Communism00027
Ronald Reagan00027
Michail Gorbachev00023
Cat Wars Involvingthe United States000779
Cat Global Conflicts000779
USING WIKIPEDIA FOR TEXT CLASSIFICATION
bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories
bull Use the documents associated with those categories as training documents
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
17
WIKIPEDIA FOR TEXT CLASSIFICATIONbull Automatic identification of the topiccategory of a text (eg computer science
psychology)ndash Booksndash Learning objects
ldquoThe United States was involved in the Cold Warrdquo
United States03793
Cold War03111
Vietnam War00023
World War I00023
Communism00027
Ronald Reagan00027
Michail Gorbachev00023
Cat Wars Involvingthe United States000779
Cat Global Conflicts000779
USING WIKIPEDIA FOR TEXT CLASSIFICATION
bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories
bull Use the documents associated with those categories as training documents
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
USING WIKIPEDIA FOR TEXT CLASSIFICATION
bull Either directly use Wikipedia categories or map onersquos categories to Wikipedia categories
bull Use the documents associated with those categories as training documents
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
TEXT WIKIFICATION
Wikification = adding links to Wikipedia pages to documents
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
bull Text
WIKIFICATION
bull Wikipedia
20May 2012 Truc-Vien T Nguyen
Giotto was called to work in Padua and also in Rimini
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Keyword Extraction
bull Finding important wordsphrases in raw textbull Two-stage process
ndash Candidate extractionbull Typical methods n-grams noun phrases
ndash Candidate rankingbull Rank the candidates by importancebull Typical methods
ndash Unsupervised information theoretic ndash Supervised machine learning using positional and linguistic
features
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Keyword Extraction using Wikipedia
1 Candidate extractionbull Semi-controlled vocabulary
ndash Wikipedia article titles and anchor texts (surface forms)
bull Eg ldquoUSArdquo ldquoUSrdquo = ldquoUnited States of Americardquo
ndash More than 2000000 termsphrasesndash Vocabulary is broad (eg the a are included)
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Keyword Extraction using Wikipedia
2 Candidate rankingbull tf idf
ndash Wikipedia articles as document collection
bull Chi-squared independence of phrase and textndash The degree to which it appeared more times than
expected by chance
bull Keyphraseness
)(
)()|(
W
key
Dcount
DcountWkeywordP
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Our own Approach(Cfr Milne amp Witten 2008 2012 Ratinov et al 2011)
bull Use Wikipedia dump to compute two statistics
bull KEYPHRASENESS prior probability that a term is used to refer to a Wikipedia article
bull COMMONNESS probability that phrase is used to refer to specific Wikipedia article
bull Two versions of system
bull UNSUPERVISED use statistics only
bull SUPERVISED use distant learning to create training data
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
KEYPHRASENESS
bull the probability that a term t is a link to a Wikipedia article
(cfr Milne amp Wittenrsquos prior link probability)
bull Examplesbull The term Georgia
ndash Is found as a link in 22631 Wikipedia articlesndash appears in 75000 Wikipedia articles keyphraseness = 2263175000 = 03017466
bull Cfr the term ldquotherdquo keyphraseness = 00006
euro
Keyphraseness(t) =count([_ | t])
count(t)
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
COMMONNESS
bull the probability that a term t is a link to a SPECIFIC Wikipedia article a
bull for example the surface form Georgia was found to be linked to
ndash a1 = University_of_Georgia 166 times
commonness(t a1) = 166(166+18+5) = 08783
ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
euro
Commonness(ta) =count([a | t])
count(t)
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Extracting dictionaries and statistics from a Wikipedia dump
bull Parsingbull In three phases
bull Identify articles of relevancebull Extract (among other things)
bull Set of SURFACE FORMS (terms that are used to link to Wikipedia articles)
bull Set of LINKS [article|surface_form]
bull [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
The Wikipedia Dump from July 2011
ndash 11459639 pages in totalndash 12525583 links
bull specifying surface word target frequency
ndash ranked by frequency bull for example the mention Georgia is linked to
ndash University_of_Georgia 166 times ndash Republic_of_Georgia 18 timesndash Georgia_(United_States) 5 times
May 2012 29Truc-Vien T Nguyen
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Some statistics (all Wikidumps from July 2011)
Page Type English Italian Polish
Redirected 4465652 323591 134148
List_of 138581 836 5021
Disambiguation 176721 6193 4553
Relevant 4361020 917354 920486
Total 11459639 1654258 1200313
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Surface forms titles articles
Dictionary English Italian Polish
Titles 4361020 917354 920486
Surface forms 8829624 2484045 2482104
Files 745724 72126 na
Links 10871741 2917235 2937981
Files in Polish are arranged in a repository different from EnglishItalian
Some definitions and figuressurface form the occurence of a mention inside an articletarget article the target Wiki article a surface form linked to
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
The Unsupervised Approach
bull Use Keyphraseness to identify candidate termsbull Retain terms whose keyphraseness is
above a certain threshold (currently 001)bull Use commonness to rank
bull Retain top 10
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
The Supervised Approach
bull Features in addition to commonness use measures of SIMILARITY between text containing the term and the candidate Wikipedia page
bull RELATEDNESS a measure of similarity between the LINKS (cfr MilneampWittenrsquos NORMALIZED LINK DISTANCE)
euro
Re latedness(a1a2) =log(max( A1 A2 )) minus log( A1 cap A2 ))
log(W ) minus log(min( A1 A2 ))
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Training a supervised wikifier
bull Using WIKIPEDIA ITSELF as source of training materials (see next)
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Results on standard datasets
APPROACH AQUAINT WIKIPEDIA
Our approach 8566 8437
MilneampWitten 2008 8361 8031
Ratinov et al 2011 8452 9020
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
bull BAL Data setsndash 1049 Query set
bull 1 annotator up to 3 manual annotationsbull 1 automatic annotation
ndash 100 Query setbull 3 annotators each up to 3 manual annotations
Wikifying queries the Bridgeman datasets
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Results on Bridgeman 1000 Y3
CORRECT CANDIDATE IS RESULTS
First candidate 6477
Among first 2 7159
First 3 7542
First 4 7718
First 5 7832
Accuracy up by 17 points (36)
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Results for the GALATEAS languages and Arabic
LANGUAGE WIKIPEDIA SIZE RESULTS (on Wikipedia subset)
English 4M articles 8437
Italian 1M 7964
French 14M 76-77
German 16M 72-73
Dutch 16M 70-71
Polish 900K 6081
Arabic 200K 8078
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
The GALATEAS D2W web services
bull Available as open sourcebull Deployed within LinguaGridbull API based on the Morphosyntactic Annotation
Framework (MAF) an ISO standardbull Tested on 15M queries achieves throughput
of 600 characters per secondbull Integrated with LangLog tool
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Use of the service in LangLog
(See Domoinarsquos demo)
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Other applications
bull The UK Data Archive
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
WIKIPEDIA FOR NER
[The FCC] took [three specific actions] regarding [ATampT] By a 4-0 vote it allowed ATampT to continue offering special discount packages to big customers called Tariff 12 rejecting appeals by ATampT competitors that the discounts were illegal hellip
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
WIKIPEDIA FOR NER
httpenwikipediaorgwikiFCC
The Federal Communications Commission (FCC) is an independent United States government agency created directed and empowered by Congressional statute (see 47 USC sect 151 and 47 USC sect 154)
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
WIKIPEDIA FOR NER
Numberofglucocorticoidreceptorsinlymphocytesandtheirsensitivitytohormoneaction
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
WIKIPEDIA
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
WIKIPEDIA FOR NER
bull Wikipedia has been used in NER systemsndash As a source of features for normal NER ndash To automatically create training materials
(DISTANT LEARNING)ndash To go beyond NE tagging towards proper ENTITY
DISAMBIGUATION
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Distant learning
bull Automatically extract examples
bull positive examples from mention-to-link Wikipedia page
bull Negative examples from similar mentions with other links
bull Use positive and negative examples to train model
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
The Supervised Approach Using Wikipedia links to generate training data
bull Examplendash Giotto was called to work in Padua and also in Rimini (sentence taken from Wikipedia text with links avalable)ndash Giotto_di_Bondone (painter) Giotto_Griffiths (Welsh rugby player)
Giotto_Bizzarrini (automobile engineer)
bull Datasetndash +1 Giotto was called to work -- Giotto_di_Bondonendash -1 Giotto was called to work -- Giotto_Griffithsndash -1 Giotto was called to work -- Giotto_Bizzarrini
May 2012 48Truc-Vien T Nguyen
httpenwikipediaorgwikiGiotto_di_Bondone
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
MORE ADVANCED USES OF WIKIPEDIA
bull As a source of ONTOLOGICAL KNOWLEDGEbull DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
bull Taxonomic information category structurebull Attributes infobox text
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Wikipedia category network
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Start with the category tree
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Deriving a taxonomy from Wikipedia (AAAI 2007)
bull Induce a subsumption hierarchy
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
INFOBOXES
bull Collaborative content
bull Semi-structured data
Infobox Writer| bgcolour = silver| name = Edgar Allan Poe| image = Edgar_Allan_Poe_2jpg| caption = This [[daguerreotype]] of Poe was taken in 1848 | birth_date = birth date|1809|1|19|mf=y| birth_place = [[Boston Massachusetts]] [[United States|US]]| death_date = death date and age|1849|10|07|1809|01|19| death_place = [[Baltimore Maryland]] [[United States|US]]| occupation = Poet short story writer editor literary critic| movement = [[Romanticism]] [[Dark romanticism]]| genre = [[Horror fiction]] [[Crime fiction]] [[Detective fiction]]| magnum_opus = The Raven| spouse = [[Virginia Eliza Clemm Poe]]
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
DBpediaorg is a effort to bull extract structured information from Wikipediabull make this information available on the Web under an
open licensebull interlink the DBpedia dataset with other datasets on the
Web
DBPEDIA
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
1048607 1600000 concepts
1048607 including
1048698 58000 persons
1048698 70000 places
1048698 35000 music albums
1048698 12000 films
1048607 described by 91 million triples
1048607 using 8141 different properties
1048607 557000 links to pictures
1048607 1300000 links external web pages
1048607 207000 Wikipedia categories
1048607 75000 YAGO categories
The DBpedia Dataset
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
The DBpediaorg project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web It uses the SPARQL query language to query this data At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data
REPRESENTING EXTRACTED INFORMATION
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
httpenwikipediaorgwikiCalgary
httpdbpediaorgresourceCalgary
dbpedianative_name Calgaryrdquo
dbpediaaltitude ldquo1048rdquo
dbpediapopulation_city ldquo988193rdquo
dbpediapopulation_metro ldquo1079310rdquo
mayor_name
dbpediaDave_Bronconnier
governing_body
dbpediaCalgary_City_Council
Extracting Infobox Data (RDF Representation)
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
SPARQL
bull SPARQL is a query language for RDF
bullRDF is a directed labeled graph data format for representing information in the Web bullThis specification defines the syntax and semantics of the SPARQL query language for RDF
bull SPARQL can be used to express queries across diverse data sources whether the data is stored natively as RDF or viewed as RDF via middleware
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
1048607 httpdbpediaorgsparql
1048607 hosted on a OpenLink Virtuoso server
1048607 can answer SPARQL queries like
1048698 Give me all Sitcoms that are set in NYC
1048698 All tennis players from Moscow
1048698 All films by Quentin Tarentino
1048698 All German musicians that were born in Berlin in the 19th century
The DBpedia SPARQL Endpoint
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
bull Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing effortsndash Other initiatives Citizen Science Cognition and
Language Laboratory hellipbull This has been taken advantage of in AI
ndash Open Mind Commonsense (Singh) (collecting facts)
ndash Semantic Wikis
WEB COLLABORATION FOR KNOWLEDGE ACQUISITION
wwwphrasedetectivescom
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
bull Open Mind Common Sense ndash Singh
bull Crater mapping (results) ndash Kanefsky
bull Learner Learner2 1001 Paraphrases ndash Chklovski
bull FACTory ndash CyCORP
bull Hot or Not ndash 8 Days
bull ESP Phetch Verbosity Peekaboom ndash von Ahn
bull Galaxy Zoo ndash Oxford University
WEB COLLABORATION PROJECTS
wwwphrasedetectivescom
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
OPEN MIND COMMONSENSE
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
Twenty Semantic Relation Types in ConceptNet (Liu and Singh 2004)
THINGS (52000 assertions)
IsA (IsA apple fruit) Part of (PartOf CPU computer) PropertyOf (PropertyOf coffee wet) MadeOf (MadeOf bread flour) DefinedAs (DefinedAs meat flesh of animal)
EVENTS (38000 assertions)
PrerequisiteeventOf (PrerequisiteEventOf read letter open envelope) SubeventOf (SubeventOf play sport score goal) FirstSubeventOF (FirstSubeventOf start fire light match) LastSubeventOf (LastSubeventOf attend classical concert applaud)
AGENTS (104000 assertions)
CapableOf (CapableOf dentist pull tooth)
SPATIAL (36000 assertions)
LocationOf (LocationOf army in war)
TEMPORAL time amp sequence
CAUSAL (17000 assertions)
EffectOf (EffectOf view video entertainment) DesirousEffectOf (DesirousEffectOf sweat take shower)
AFFECTIONAL (mood feeling emotions) (34000 assertions)
DesireOf (DesireOf person not be depressed) MotivationOf (MotivationOf play game compete)
FUNCTIONAL (115000 assertions)
IsUsedFor (UsedFor fireplace burn wood) CapableOfReceivingAction (CapableOfReceivingAction drink serve)
ASSOCIATION K-LINES (125 million assertions)
SuperThematicKLine (SuperThematicKLine western civilization civilization) ThematicKLine (ThematicKLine wedding dress veil) ConceptuallyRelatedTo (ConceptuallyRelatedTo bad breath mint)
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
CONCEPT NET
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
GAMES WITH A PURPOSE
bull Luis von Ahn pioneered a new approach to resource creation on the Web GAMES WITH A PURPOSE or GWAP in which people as a side effect of playing perform tasks lsquocomputers are unable to performrsquo (sic)
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
bull GWAP do not rely on altruism or financial incentives to entice people to perform certain actions
bull The key property of games is that PEOPLE WANT TO PLAY THEM
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
EXAMPLES OF GWAP
bull Games at wwwgwapcomndash ESPndash Verbosityndash TagATune
bull Other gamesndash Peekaboomndash Phetch
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
ESP
bull The first GWAP developed by von Ahn and their group (2003 2004)
bull The problem obtain accurate description of images to be usedndash To train image search enginesndash To develop machine learning approaches to vision
bull The goal label the majority of the images on the Web
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
ESP the game
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
ESP THE GAMEbull Two partners are picked at random from the
large number of players onlinebull They are not told who their partner is and canrsquot
communicate with thembull They are both shown the same imagebull The goal guess how their partner will describe
the image and type that descriptionndash Hence the ESP game
bull If any of the strings typed by one player matches the string typed by the other player they score points
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
THE TASK
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
SCORING BY MATCHING
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
SOME STATISTICS
bull In the 4 months between August 9th 2003 and December 10th 2003ndash 13630 playersndash 12 million labels for 293760 imagesndash 80 of players played more than once
bull By 2008 ndash 200000 playersndash 50 million labels
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
QUALITY OF THE LABELSbull For IMAGE SEARCH
ndash choose 10 labels among those produced and look at which images are returned
bull Compare labels produced by players with labels produced by participants in an experimentndash 15 participants 20 images among the 1000 with more
than 5 labelsndash 83 of game labels also produced by participants
bull Manual assessment of labels (lsquowould you use these labels to describe this imagersquo)ndash 15 participants 20 imagesndash 85 of words rated useful
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
GOOGLE IMAGE LABELLER
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
THE TASK
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
RESULTS
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
PHRASE DETECTIVES
wwwphrasedetectivesorg
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
bull 2 tasks
ndash Find The Culprit (Annotation)User must identify the closest antecedent of a markable if it is anaphoric
ndash Detectives Conference (Validation)User must agreedisagree with a coreference relation entered by another user
wwwphrasedetectivescom
PHRASE DETECTIVES THE TASKS
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
NAME THE CULPRIT
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
READINGS
bull Mihalcea R and Csomai A Wikify linking documents to encyclopedic knowledge Proceedings of CIKMrsquo07 Lisbon Portugal
bull V Nguyen amp M Poesio 2012 Entity disambiguation and linking over queries using Encyclopedic Knowledge Proceedings of 6th workshop on Analytics for Noisy Unstructured Text Data
bull D Lungley M Trevisan V Nguyen M Althobaiti M Poesio 2013 GALATEAS D2W A Multi-lingual Disambiguation to Wikipedia Web Service Proc Of ENRICH
bull V Nastaseamp M Strube Transforming Wikipedia into a large scale multilingual concept network Artificial Intelligence 2012
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-
READINGS
bull L von Ahn and L Dabbish (2008) Designing games with a purpose Communications of the ACM v 51 n8 58-67
bull Poesio Chamberlain Kruschwitz Robaldo amp Ducceschi 2013 Phrase Detectives Utilizing Collective Intelligence for Internet-Scale Language Resource Creation ACM Transactions on Intelligent Interactive Systems
- 807 - TEXT ANALYTICS
- WIKIPEDIA
- Slide 3
- Slide 4
- WIKIPEDIA FOR TEXT ANALYTICS
- Wikipedia as Thesaurus for text classification clustering
- Wikipedia as Thesaurus
- Slide 8
- Slide 9
- Slide 10
- Slide 11
- The concept Artificial Intelligence belongs to four categories Artificial intelligence Cybernetics Formal sciences amp Technology in society
- Slide 13
- The different meanings that Artificial intelligence may refer to are listed in its disambiguation page
- WIKIPEDIA FOR TEXT CATEGORIZATION CLUSTERING
- Using Wikipedia Categories for text classification
- WIKIPEDIA FOR TEXT CLASSIFICATION
- USING WIKIPEDIA FOR TEXT CLASSIFICATION
- TEXT WIKIFICATION
- WIKIFICATION
- Wikification pipeline
- Keyword Extraction
- Keyword Extraction using Wikipedia
- Slide 24
- Slide 25
- KEYPHRASENESS
- COMMONNESS
- Slide 28
- The Wikipedia Dump from July 2011
- Some statistics (all Wikidumps from July 2011)
- Surface forms titles articles
- Slide 32
- Slide 33
- Training a supervised wikifier
- Results on standard datasets
- Wikifying queries the Bridgeman datasets
- Results on Bridgeman 1000 Y3
- Results for the GALATEAS languages and Arabic
- The GALATEAS D2W web services
- Use of the service in LangLog
- Other applications
- WIKIPEDIA FOR NER
- Slide 43
- Slide 44
- Slide 45
- Slide 46
- Slide 47
- The Supervised Approach Using Wikipedia links to generate training data
- MORE ADVANCED USES OF WIKIPEDIA
- SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA
- Wikipedia category network
- Deriving a taxonomy from Wikipedia (AAAI 2007)
- Slide 53
- INFOBOXES
- Slide 56
- Slide 57
- Slide 58
- SPARQL
- Slide 60
- Slide 61
- Slide 62
- Slide 63
- OPEN MIND COMMONSENSE
- Slide 65
- CONCEPT NET
- GAMES WITH A PURPOSE
- GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK
- EXAMPLES OF GWAP
- ESP
- ESP the game
- ESP THE GAME
- THE TASK
- SCORING BY MATCHING
- SOME STATISTICS
- QUALITY OF THE LABELS
- GOOGLE IMAGE LABELLER
- Slide 78
- RESULTS
- PHRASE DETECTIVES
- Slide 81
- NAME THE CULPRIT
- READINGS
- Slide 84
-