Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning
-
Upload
yi-zeng -
Category
Technology
-
view
1.205 -
download
5
description
Transcript of Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning
1
Large Knowledge Collider (LarKC) : A Platform for Web Scale Reasoning
Ning Zhong1,3, Frank van Harmelen2, Yi Zeng3, Zhisheng Huang2
Maebashi Institute of Technology, JapanVrije University Amsterdam, the Netherlands
International WIC Institute, Beijing University of Technology, China
http://www.larkc.eu
2
Late breaking news:Google Video now also annotated with RDF-a (using vocabularies from Yahoo and Facebook)
The World is Creating the Linked Data Every Day!
6
toxic releases consumer expenditure
recent earthquakes consumer price index
crime statistics tornado reports
assaults on police trade statistics
social benefits river elevations
unemployment rates energy consumption
10
<rdf:RDF><rdf:Description rdf:about="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4bfc4.rdf<rdfs:label>Description of the artist Yeah Yeah Yeahs</rdfs:label><foaf:primaryTopic rdf:resource="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4b</rdf:Description><mo:MusicArtist rdf:about="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4bfc4#a<rdf:type rdf:resource="http://purl.org/ontology/mo/MusicGroup"/><foaf:name>Yeah Yeah Yeahs</foaf:name><ov:sortLabel>Yeah Yeah Yeahs</ov:sortLabel><bio:event><bio:Birth><bio:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime
</bio:event><owl:sameAs rdf:resource="http://dbpedia.org/resource/Yeah_Yeah_Yeahs"/>
<mo:image rdf:resource="/music/images/artists/7col_in/584c04d2-4acc-491b-8a0a-e63<foaf:page rdf:resource="/music/artists/584c04d2-4acc-491b-8a0a-e63133f4bfc4.html"/<mo:musicbrainz rdf:resource="http://musicbrainz.org/artist/584c04d2-4acc-491b-8a0a<foaf:homepage rdf:resource="http://www.yeahyeahyeahs.com/"/><mo:wikipedia rdf:resource="http://en.wikipedia.org/wiki/Yeah_Yeah_Yeahs"/><mo:myspace rdf:resource="http://www.myspace.com/yeahyeahyeahs"/><mo:member rdf:resource="/music/artists/a1439b8d-672a-446f-a7ff-6f09d68254b3#art<mo:member rdf:resource="/music/artists/14d44067-99c2-4f77-b58b-138f0b6911fa#ar<mo:member rdf:resource="/music/artists/20dc35ec-6cc1-4c66-98a3-4a6116cb3869#a...
11
<foaf:made><mo:Record><dc:title>It's Blitz!</dc:title><mo:musicbrainz rdf:resource="http://musicbrainz.org/release/9c4177fe-bdce-4f9d-ab<rev:hasReview rdf:resource="/music/reviews/hnp2#review"/></mo:Record>
</foaf:made>.....<mo:MusicArtist rdf:about="/music/artists/a1439b8d-672a-446f-a7ff-6f09d68254b3#arti<foaf:name>Brian Chase</foaf:name>
</mo:MusicArtist>
<mo:MusicArtist rdf:about="/music/artists/14d44067-99c2-4f77-b58b-138f0b6911fa#art<foaf:name>Karen O</foaf:name>
</mo:MusicArtist>
<mo:MusicArtist rdf:about="/music/artists/20dc35ec-6cc1-4c66-98a3-4a6116cb3869#art<foaf:name>Nick Zinner</foaf:name>
</mo:MusicArtist></rdf:RDF>
13
What to do for the success of Web-scale Semantic Data Processing?
Unifying Search and Reasoning (ReaSearch) [Fensel & Frank 2007]
Refining Search by Reasoning [Berners-Lee 1999]
Refining Reasoning by Search[Fensel & Frank 2007]
16
“a configurable platform for infinitely scalable semantic web reasoning”
“pipeline” suggestslinear structure:
but in LarKC also:
23
Cashier1: 53Cashier2: 14Cashier3: 33Cashier4: 72Cashier2: 34Cashier3: 13Cashier4: 32--------------------Total : 340
Parallelization
I am with Web-scale data : 7x10^8 triples
24
Cashier1: 53Cashier2: 14Cashier3: 33Cashier4: 72Cashier2: 34Cashier3: 13Cashier4: 32--------------------Total : 340
two for the price of one?
2nd for half price?
Data dependencies
25
Cashier1: 53Cashier2: 14Cashier3: 33Cashier4: 72Cashier2: 34Cashier3: 13Cashier4: 32--------------------Total : 340
two for the price of one?
2nd for half price?
Split Responsibility
Fruit
Vegetables
Household
Packaged
Rest
26
Cashier1: 53Cashier2: 14Cashier3: 33Cashier4: 72Cashier2: 34Cashier3: 13Cashier4: 32--------------------Total : 340
two for the price of one?
2nd for half price?
LoadBalancing
Fruit
Vegetables
Household
Packaged
Rest
27
Cashier1: 53Cashier2: 14Cashier3: 33Cashier4: 72Cashier2: 34Cashier3: 13Cashier4: 32--------------------Total : 340
With a box of detergent
and a box of cereal get a
free pen!
Data dependencies
Fruit
Vegetables
Household
Packaged
Rest
For RDF data, any triple can refer to any URI.
28
Towards Parallelization and Distribution
Different parallel computing models:− Peer-to-peer (MaRVIN)− Map-Reduce (Reasoning-Hadoop)
29
The MaRVIN
Way!
inputdata
outputdata
compute
compute
compute
compute
compute
compute
Divide-Conquer-Swap
Eyal Oren
Spyros Kotoulas
30
… is:
− a distributed technique for computing RDFS/OWL closure
… scales by:
− distributing computation over many nodes
− approximate (sound but incomplete) reasoning
− anytime convergence (more complete over time)
… runs on:
− in principle: any grid, using Ibis middleware
− the DAS-3 distributed supercomputer (300 nodes)
MARVIN (Massive RDF Versatile Inference Network)
34
The MapReduceDistributed Programming Model
Initially designed and developed by Google in 2004 for large data processing [Jeffrey & Sanjay 2004].
The computation is expressed with two functions: map and reduce.
Map-Reduce on 64 machines:
Peak inference rates at 8M triples/secSustained inference rates at 4M triples/sec
A p CA q BD r DE r DF r C
Map
Map
Reduce
Reduce
C 2p 1r 3q 1D 3F 1
<C,_,_><A,_,_>
<C,_,_>
<F,_,_>
......
Map-Reduce Jacopo Urbani
On very large datasets, incompleteness is the ruleMust stop before we are finishedWhen to stop?Stopping rules are important− determine length of computation
(don’t stop too late)− quality of result
(don’t stop too early)
Stopping Rules
Take inspiration from economics, biology, psychology
Lael Schooler
Humans have good heuristics for when to stop problem solving:
“Name capital cities in Europe”:London, Paris, Berlin, Rome, Amsterdam, …Milan, Madrid, …., ….., Paris, ….,
Time between solutions
Wrong answers Repetitions
When to switch between tasks?
Humans (& animals) are verygood finding this optimum
Humans (& animals) are verygood finding this optimum
Lael Schooler
hard task & easy taskhard task & easy task combined task
combined task
Where do the axioms come from?• Which subset to use?• Relevance measures
• Example: syntactic relevance:• δ(α,β)=1 if α,β share a concept symbol• δ(α,β)=k if δ(α,γ)=k-1 and
β,γ share a concept symbol• very simple measure,
very syntactically unstable, but:
Gives a high quality sound approximation(> 90% recall, 100% precision for small k)Gives a high quality sound approximation(> 90% recall, 100% precision for small k)
Zhisheng Huang
Take data-selection seriously
exploit the grounding of logical symbols in natural language
• Google distance as relevance measure
= symmetric conditional probability of co-occurrence
= estimate of semantic distance
)}(log),(min{loglog),(log)}(log),(max{log),(
yfxfMyxfyfxfyxNGD
−−
=
Gives almost perfect “forgetting function”for matching class definitions in 2 vocabulariesGives almost perfect “forgetting function”for matching class definitions in 2 vocabularies
Zhisheng Huang
Take identifiers seriously
42
Unifying Search and Reasoning from the Viewpoint of Granularity
Unifying Search and Reasoning (ReaSearch) [Fensel & Frank 2007]
Refining Search by Reasoning [Berners-Lee 1999]
Refining Reasoning by Search[Fensel & Frank 2007]
Granularity
Human Problem Solving Web Problem Solving
Basic level advantage, Cognitive Memory RetentionMulti-level, multi-perspective, Variable Precision
Inspire!
Barriers for Web-scale Problem Solving
(1) most relevant data vs search results space [Berners-Lee 1999]. (2) Traditional reasoning systems vs Web-scale data vs rational time [Fensel 2007].
43
43
Concrete Strategies
• The Starting Point.• Multi-level Completeness.• Multi-level Specificity.• Multi-perspective.
44
The Starting Point Strategy
[Collins 1969] Collins, A.M. and Quillian, M.R. Retrieval time from semantic memory. Journal of Verbal Learning and Verbal Behaviour, 8, 240-247.
45
(I) The Starting Point Strategy
• (Frequency and Recency) Exponential Model for Interest Retention :
• (Frequency and Recency) Power Model for Interest Retention :
∑ ==
n
jjimiTI
1),()(
ibTn
jAejimiEIR −
=×=∑ 1
),()(
bi
n
jATjimiPIR −
=×=∑ 1
),()(
The “ Basic level advantage ” [Rogers2007].
Concepts in a basic level -- > more frequently than other terms [Wisniewski1989].
• (Frequency) Total Interest :
As a step forward “familiar term” in basic level, “interests retention” focuses on frequency and recency at the same time.
Interest retention models < -- > Cognitive memory retention models [Anderson, Schooler 1991].
46
Interest Retention and Interest Prediction
A comparative study of TI during 1990-2008 and IR in 2009
Difference on the contribution values from papers published in different years
A comparative study on the prediction and real publication numbers by the power law model
A comparative study on the prediction and real publication numbers by the exponential law model
47
Evaluations and the Released Dataset
• interest retentions vs future interests.publication >= 100top 9 interests 2000 to 2007 1226 persons49.54% predict 3 out of 9 interests.
• 615,124 computer scientists in the SwetoDBLP dataset.• http://wiki.larkc.eu/csri-rdf
48
DBLP-SSE : DBLP Search Support Engine
Recent interests are extracted using the power law interest retention model.
Terms with high frequency do not necessarily have high interest retention. (e.g. “Knowledge”)
49
DBLP-SSE : DBLP Search Support Engine
* Web Intelligence and Artificial Intelligence in Education. * Artificial Intelligence Exchange and Service Tie to All Test Environments (AI-ESTATE)-A New Standard for System Diagnostics. * Semantic Model for Artificial Intelligence Based on Molecular Computing. * Open Information Systems Semantics for Distributed Artificial Intelligence. * Artificial Intelligence and Financial Services.* …
with current interests constraints (Top 5 results)List 2 :
* PROLOG Programming for Artificial Intelligence, Second Edition. * Artificial Intelligence Architectures for Composition and Performance Environment. * Artificial Intelligence in Music Education: A Critical Review. * Music, Intelligence and Artificiality. Artificial Intelligence and Music Education. * Musical Knowledge: What can Artificial Intelligence Bring to the Musician?* ...
without current interests constraints (Top 5 results)List 1 :
Artificial IntelligenceQuery :
Web, Service, Semantic, Architecture, Model, Ontology, Knowledge, Computing, Language
Top 9 interests
Dieter FenselLog in
50
Multi-level Completeness Strategy
Low completeness
High completeness
Limited Time
More time Available
One practical question :
How to choose the nodes to be reasoned over?
51
Choosing the pivotal nodes in the network first !
Another one: If I stop in here, what is the completeness like now!
52
Multi-level Completeness Strategy
degree(n, Pcn) to stop Satisfied authors AI authors
70 2885 151
30 17121 579
11 78868 1142
4 277417 1704
1 575447 2225
0 615124 2355
Comparison of predicted and actual completeness value.
Unifying search and reasoning with multilevel completeness and anytime behavior.
Completeness Prediction Function :
|)||)'((||)'(||))'(||(||)(||))'(||)((||)(|)(
NiNsubiNreliNsubNiNreliNsubiNsubiNreliPC
−×+−×−×
=
“Who are authors in Artificial Intelligence?”
Nodes are grouped together by Node degrees under a perspective.
54
A Case Study on Multi-level Specificity StrategySpecificity Relevant Keywords Number of Authors
Level 1 Artificial Intelligence 2355Level 2 Agents
Automated Reasoning
Cognition
Constriants
Games
Knowledge Representation
Natural Language
Robot
…
9157
222
19775
8744
3817
1537
2939
16425
…
Level 3 Case-Based Reasoning
Cognitive Modeling
Decision Trees
Search
Translation
Web Intelligence
…
1133
76
1112
32079
4414
122
…
Answers to “Who are the authors in Artificial Intelligence?” in multiple levels of specificity according to the hierarchical ontology of Artificial Intelligence.
Specificity Number of authors Completeness
Level 1 2355 0.85%
Level 1,2 207468 75.11%
Level 1,2,3 276205 100%
A comparative study on the answers in different levels of specificity.
55
The Multi-perspective Strategy
Multiple representation of Knowledge [Minsky2006]
User needs may differ from each other
< -- > expect answers from different perspectives.
Normalized Degree Distribution of predicates in SwetoDBLP dataset
56
The Multi-perspective Strategy
Fig. 2. Coauthor number distributionin the SwetoDBLP dataset.
Fig. 3. log-log diagram of Figure 2. Fig. 4. A zoomed in version of Figure 2.
Fig. 5. A zoomed in version of coauthordistribution for Artificial Intelligence".
Fig. 6. Publication number distribution in the SwetoDBLP dataset.
Fig. 7. log-log diagram of Figure 6.
Under different perspectives, the distribution characteristics are different!
57
Comparison of Results from Different Perspectives
Publication number perspective Coauthor number perspectiveThomas S. Huang (387) Carl Kesselman (312)John Mylopoulos (261) Thomas S. Huang (271)Hsinchun Chen (260) Edward A. Fox (269)Henri Prade (252) Lei Wang (250)Didier Dubois (241) John Mylopoulos (245)Thomas Eiter (219) Ewa Deelman (237)... ...
A partial result of the multilevel specificity reasoning task The list of authors in Artificial Intelligence" in level 1 from two perspectives.
Summarizing
The Semantic Web is rapidly becoming real
Scale is becoming a real problem
Different ways of scaling up:− parallelization
− exploiting cognitive heuristics
Stopping rules, cognitive memory retention, etc.− data-selection for incomplete reasoning.
− New Forms of Reasoning.
60
AcknowledgementSlides for this talk is mainly from 3 previous talks :
Frank van Harmelen. Large Scale Reasoning on the Semantic Web or: When success is becoming a problem. Invited talk at the 2009 International Joint Conferences on Active Media Technology and Brain Informatics.
Yi Zeng. Unifying Web-scale Search and Reasoning from the viewpoint of Granularity. the 2009 International Joint Conferences on Active Media Technology and Brain Informatics.
Spyros. Marvin and the Billion Triple Challenge. Super Computing Seminar, University of Amsterdam, 2008.
61
Contact Info
[email protected]://www.larkc.eu
Want to play with LarKC?Want to contribute plugins?Want to deploy LarKC?
Want to play with LarKC?Want to contribute plugins?Want to deploy LarKC?
Asia: Ning Zhong: [email protected] Zeng : [email protected]
@ WIC
Asia: Ning Zhong: [email protected] Zeng : [email protected]
@ WIC
62
References[Berners-Lee1999] Berners-Lee, T., Fischetti, M.: Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by Its Inventor. HarperSanFrancisco (1999)
[Fensel2007] Fensel, D., van Harmelen, F.: Unifying reasoning and search to web scale. IEEE Internet Computing 11(2) (2007) 94-96
[Michalski1986] Michalski, R.S. and Winston, P.H. Variable precision logic. Artificial Intelligence, 29(2), 121–146, 1986.
[Minsky2006] Minsky, M. The Emotion Machine : commonsense thinking, artificial intelligence, and the future of the human mind. Simon & Schuster, 2006.
[Rogers 2007] Rogers, T., Patterson, K.: Object categorization: Reversals and explanations of the basic-level advantage. Journal of Experimental Psychology: General 136(3) (2007) 451-469
[Wickelgren1976] Wickelgren, W.: Memory storage dynamics. In: Handbook of learning and cognitive processes. Hillsdale, NJ: Lawrence Erlbaum Associates (1976) 321-361
[Aleman-Meza2007] Aleman-Meza, B. Hakimpour, F., Arpinar, I., Sheth, A.: Swetodblp ontology of computer science publications. Web Semantics: Science, Services and Agents on the World Wide Web 5(3) (2007) 151-155
[Ebbinghaus1913] Ebbinghaus, H.: Memory: A Contribution to Experimental Psychology Hermann Ebbinghaus. Teachers College, Columbia University (1913)