The Growing Semantic Web Dr. Mark Greaves Vulcan Inc. [email protected] © 2009 Vulcan Inc.
-
Upload
milo-gallagher -
Category
Documents
-
view
216 -
download
1
Transcript of The Growing Semantic Web Dr. Mark Greaves Vulcan Inc. [email protected] © 2009 Vulcan Inc.
2
What is the Semantic Web? 4 AnswersWhat is the Semantic Web? 4 Answers
The Web of Data– A fully distributed web-based system to publish logical assertions – A way to link to someone else’s data, augment it, and add to it– Democratic, crowd-based, scalable knowledge engineering– The hottest area of Artificial Intelligence right now
3
What is the Semantic Web? 4 AnswersWhat is the Semantic Web? 4 Answers
The Web of Data– A fully distributed web-based system to publish logical assertions – A way to link to someone else’s data, augment it, and add to it– Democratic, crowd-based, scalable knowledge engineering– The hottest area of Artificial Intelligence right now
The set of W3C standards for simple logic languages on the web– RDF, RDFS, OWL (Lite, DL, Full dialects), OWL2 (EL, QL, RL profiles)– Weaker than First Order Logic, more easily authorable, decidable, tractable in
most cases using tableaux provers
4
What is the Semantic Web? 4 AnswersWhat is the Semantic Web? 4 Answers
The Web of Data– A fully distributed web-based system to publish logical assertions – A way to link to someone else’s data, augment it, and add to it– Democratic, crowd-based, scalable knowledge engineering– The hottest area of Artificial Intelligence right now
The set of W3C standards for simple logic languages on the web– RDF, RDFS, OWL (3 versions), OWL2– Weaker than First Order Logic, more easily authorable, decidable, tractable in
most cases using tableaux provers
The largest formal knowledge base on Earth– Also the messiest formal knowledge base on Earth
5
What is the Semantic Web? 4 AnswersWhat is the Semantic Web? 4 Answers
The Web of Data– A fully distributed web-based system to publish logical assertions – A way to link to someone else’s data, augment it, and add to it– Democratic, crowd-based, scalable knowledge engineering– The hottest area of Artificial Intelligence right now
The set of W3C standards for simple logic languages on the web– RDF, RDFS, OWL (3 versions), OWL2– Weaker than First Order Logic, more easily authorable, decidable, tractable in
most cases using tableaux provers
The largest formal knowledge base on Earth– Also the messiest formal knowledge base on Earth
A revolution in the way we think of data, crowds, and schemas– Scaling at last comes to symbolic AI– Massive, partial, participatory, logically weak, dynamic, schema-last– A way to democratize and curate knowledge bases and knowledge systems
6
Talk Outline: The Growing Semantic WebTalk Outline: The Growing Semantic Web
The Origins of the Semantic Web – DARPA’s DAML Program– RDF, OWL, and the Semweb Infrastructure
Semantic Web Evolution to 2009
Semantic Web Transformation in three areas of application– Markets and Companies
The Fourth Generation– A Scalable Revolution
7
Talk Outline: The Growing Semantic WebTalk Outline: The Growing Semantic Web
The Origins of the Semantic Web – DARPA’s DAML Program– RDF, OWL, and the Semweb Infrastructure
Semantic Web Evolution to 2009
Semantic Web Transformation in three areas of application– Markets and Companies
The Fourth Generation– A Scalable Revolution
8
The Roots of the Semantic WebThe Roots of the Semantic Web
Semantic technology has been a distinct research field for decades– Symbolic Logic (from Russell and Frege)– Knowledge Representation Systems in AI
• Semantic Networks (Bill Woods, 1975)• US and EU R&D programs in information integration (80s and 90s)• Development of simple tractable “description logics” for classification• Conceptual Graphs and Formal Concept Analysis
– Relational Algebras and Schemas in Database Systems
Library Science (classifications, thesauri, taxonomies)
So, What Sparked the Semantic Web?
What’s new was the Web!– The material needed to answer almost any question is somewhere on the web– A massive infrastructure of data servers, protocols, authentication systems, presentation languages, and
thin clients that can be leveraged– A way around needing the “big data warehouse”
9
The Beginnings of the US Semantic Web: DARPA’s DAML ProgramThe Beginnings of the US Semantic Web: DARPA’s DAML Program
Solution:Augment the web to link machine-readable knowledge to web pages
Extend RDF with Description Logic Extensibility via frame-based language design Create the first fully distributed web-scale knowledge base
out of networks of hyperlinked facts and data
Approach:Design a family of new web languages
Basic knowledge representation (OWL)Reasoning (SWRL, OWL/P, OWL/T)Process representation (OWL/S)
Build definition and markup tools Link new knowledge to existing web page elements
Test design approach with operational pilots in US Government
Partner with parallel EU efforts to standardize the new web languages
People use implicit knowledge to reason with web pages
People use implicit knowledge to reason with web pages
Computers require explicit knowledge to reason with web pages
Computers require explicit knowledge to reason with web pages
Existing Web
(HTML/XML over HTTP)
Semantic Web
(OWL over HTTP)
Links via URLs
Problem: Computers cannot process most of the information stored on web pages
10
What is RDF?What is RDF?
RDF defines entities and relations for an area of knowledge – Assertions are triples of (resource, property, resource) or (resource, property, value)– Assertions are precise enough to be interpreted set-theoretically by machines– Graphically represented as a directed concept graph with typed links– Simple RDF semantics enables reuse of domain knowledge via class hierarchies
RDF is web data language with web-friendly syntax– All elements in triples are either primitive values or URIs– RDF Schema adds classes, domains, ranges, and inheritance of domains and ranges
typeOf
isa
VIN1234VIN1234
hasManufacturer
isa
isa
isa
isaisa
VehicleVehicle
TruckTruck
196419641964 Mustang1964 Mustang
FordFord
MustangMustang
CarCar
ResourcePropertyProperty Value
VehicleVehicleManufacturerManufacturer
YearYear
Joe’s MustangJoe’s Mustang
manufacturesmanufac
turedIn
manufacturedIn
hasVIN
hasManufacturer
Graphic from Mary Pulvermacher, MITRE
11
What Does OWL Add?What Does OWL Add?
Description Logic Semantics– Relations between classes
• Equivalent class (e.g., US_President and PrincipalResidentofWhiteHouse)• Disjoint class (e.g., Male and Female), subClassOf
– Derived classes (intersectionOf, unionOf, complementOf)– Property characteristics (inverseOf, transitive, symmetric, subPropertyOf, equivProperty, etc.)– Range and Cardinality constraints (e.g., birthMother has exactly one value, which is a person)
Generic (logically-grounded) ability to combine assertions into inferences according to the rules of the DL (for OWL, SHIQ )
Given... And...
spouseOf
Can conclude...
SymmetricPropertySymmetricPropertyJimJimJimJim
MaryMaryMaryMaryspouseOfspouseOfspouseOfspouseOf
rdf:type spouseOf
MaryMaryMaryMary
JimJimJimJim
Graphic from Mary Pulvermacher, MITRE
12
From XML to OWL Increasingly Expressive Options for Web Data MarkupFrom XML to OWL Increasingly Expressive Options for Web Data Markup
XMLIssue addressed: how to express data in text?XML Solution: “wrap” data within start tag/end tags, and empower users to create their own tagsExample:
<altitude>1500 feet</altitude>
Start tag End tag
altitude element
XML Schema (XMLS)
Issue addressed: how should the type structure of the data be expressed?XML Schema Solution: XML templatesExample:
<element name=“altitude” type=“integer”/>
altitude is constrained to be an integer
RDF and RDF SchemaIssue addressed: how can data support statements?RDF Solution: use a subject, property, object patternExample:
<Fighter rdf:ID=“F16”> <altitude>50,000 feet</altitude> <builder>Lockheed</builder></Fighter>
OWL
Issue addressed: how to express data semantics?OWL Solution: use inheritance and a description logic to express restrictions and describe entailmentExample:
<owl:Class rdf:ID=“Fighter”> <rdfs:subClassOf rdf:resource=“#Aircraft”/></owl:Class>
Unconstrained text string
Properties
Instance of the Class
Fighter class: a Fighter inherits properties type of Aircraft,
Values
HTML XML & XMLS RDF OWL
13
DAML Program ElementsDAML Program Elements
Web Ontology Language (OWL) (2/10/04)– Enables knowledge representation and
tractable inference in a web standard format– Based on Description Logics and RDF– Done by the W3C WebOnt Working Group
OWL Reasoning Languages– SWRL: Supports business rules and linking
between OWL ontologies (based on RuleML)– OWL/P Proof Language: Allows software
components to exchange chains of reasoning– OWL/T Trust Language: Represents trust that
OWL and SWRL inferences are valid
Semantic Web Services (OWL/S)– Allows discovery, matching, and execution of
web services based on action descriptions– Unifies semantic data models (OWL) with
process models (Agent) and shows how to dynamically compose web services
OWL Tools– www.semwebcentral.org and www.daml.org
Several US Govt pilots and prototypes
Completed standards process
Started standards process
Unfinished
SWRL: RulesOWL/P: Proof
SWRL: RulesOWL/P: Proof
OWL/S:Semantic Web
Services
OWL/S:Semantic Web
Services
Web OntologyLanguage (OWL)
Web OntologyLanguage (OWL)
OWL/T:Trust
OWL/T:Trust
DAML Program Technical Flow
Each DAML Program Element includesspecifications, software tools,
coordination teams, and use cases
~$45M over 5 years (FY01 – FY05)
14
The Semantic Web in 2009The Semantic Web in 2009
CuttingEdge
Mature
StillResearch
“The Famous Semantic Web Technology Stack”
15
The Semantic Web in 2009The Semantic Web in 2009
CommercialCuttingEdge
Mature
Active Researchand StandardsActivity
“The Famous Semantic Web Technology Stack”
16
CommercialCuttingEdge
The Semantic Web in 2009The Semantic Web in 2009
Mature
Active Researchand StandardsActivity
“The Famous Semantic Web Technology Stack”
17
The Semantic Web in 2009The Semantic Web in 2009
Mature
Active Researchand StandardsActivity
CommercialCuttingEdge
“The Famous Semantic Web Technology Stack”
18
Active Researchand StandardsActivity
Completing the Semantic Web PictureCompleting the Semantic Web Picture
Mature
Other Technologies Impact the Semantic Web
More OntologiesTag SystemsMicroformats
Social Authorship
Combined RDF/OWL and
RDBMS Systems
Scalable Reasoning Systems
A Huge Base of RDF data
CommercialCuttingEdge
Description LP, SWRL, SILK...
19
Beyond RDF and OWL: 2009 Semantic Web InfrastructureBeyond RDF and OWL: 2009 Semantic Web Infrastructure
Markup Languages– HTML-friendly markup dialects:
Microformats and RDFa– OWL 2 is a Candidate Recommendation
Triplestores and SPARQL Servers– Stores for 1B triples now available, though
with caveats around write performance– Commercial: AllegroGraph, Virtuoso,
BigOWLIM, Oracle 11g Semantic Technologies...
– Open Source: 4Store, Sesame, Redland...– Next step is parallel web delivery
architectures
Entity Name Service (Okkam, DBpedia)
Semantic Web Reasoners– Commercial: Oracle 11g RDFS/OWL
engine, Ontobroker, Ontotext, RacerPro– Open Source: Pellet, FaCT++...– RIF is at W3C Last Call status
Vocabularies and Design Tools– Ontologies: Dublin Core, FOAF, SIOC...– OpenSource: Protégé, SWOOP...– Commercial: TopBraid Composer, Schema
Semweb Data Generation– RDF / RDBMS front-ends – Entity extractors and taggers (OpenCalais)– Zemanta-type author’s assistants– Semantic wikis
Semweb Data Exploitation– Semweb search engines (Sindice, Watson,
Falcon...)– Yahoo SearchMonkey / Google Rich
Snippets– Browser extensions and facets
Visualization Tools– Simile Project (http://simile.mit.edu/) – Several Commercial Companies
User-layer ToolsServer Infrastructure
20
State of Semantic Web Work in the USState of Semantic Web Work in the US
DAML finished in 2005, with no followons– NIH (Protégé, NCBO), NSF, some small DoD funding– PAL/CALO funded broader semantic/AI work
But... leading-edge Venture Capital moved in– Vulcan, Crosslink, In-Q-Tel, Benchmark, Intel Capital...
An emerging commercialization ecosystem– Startup/Small: Radar, Metaweb, Evri, AdaptiveBlue...– Midsized: Metatomix, Dow Jones, Reuters/OpenCalais, Franz...– Large: Yahoo!, Google, Oracle, IBM, HP, Microsoft...– Recession is taking its toll on everybody
Emphasis is mostly Semantic dimension of Semantic Web– That was where the money and customers were– RDBMS scale and orientation, powerful analytics for Business Intelligence– Centralized workflows for ontology definition and management– Use cases surrounding corporate data integration and document markup
21
Semantic Web Work in the EUSemantic Web Work in the EU
Continuing Large Public-Sector Investments– Framework 6 (2002-6) – More than €100M in
several different programs– Framework 7 (2007-13) – ~€1B/year for information and
communications technologies• Semantics is more present as a general systems technology• Future Internet and Digital Libraries thrusts
Two Dedicated Multi-site Semweb R&D Institutes– DERI: 100+ people and the world leader in research– Semantic Technology Institute International– A strong and growing cadre of graduate students
Emphasis on the Social and Web Dimensions of the Semantic Web– Web-scale Linked Data, social networks, simple scalable imperfect inference– Ontology and data dynamics, imperfections, versioning– Semantically-boosted collaboration with limited knowledge engineer involvement
Clear R&D leadership but lags in Semweb commercialization
22
Talk Outline: The Growing Semantic WebTalk Outline: The Growing Semantic Web
The Origins of the Semantic Web – DARPA’s DAML Program– RDF, OWL, and the Semweb Infrastructure
Semantic Web Evolution to 2009
Semantic Web Transformation in three areas of application– Markets and Companies
The Fourth Generation– A Scalable Revolution
23
Evolving Conceptions for the Semantic WebEvolving Conceptions for the Semantic Web
Initial Semantic Web Conception*
* By most people but not Tim Berners-Lee
Semantic markup would be tightly associated with individual web pages
– “Translate the Web for machines”– Markup lives in associated RDF/XML docs
Core problem is labeling free-text web pages with a (pre-defined) ontology markup vocabulary
– Entity extraction and ontology-aware NLP– Document segmentation technologies– Manual annotation tools
Need an all-encompassing ontology or set of logically compatible ontologies
– Agreement was always elusive
Need to achieve scale and consistency in semantic page markup
– Manual annotation has incentive, quality/ consistency, and cost/benefit issues
Hypertext W
eb
(HTML/X
ML over H
TTP)
Seman
tic W
eb
(RDF/O
WL o
ver H
TTP)
Links via URLs
24
The Situation for Semantic Web and ML/DM in 2007The Situation for Semantic Web and ML/DM in 2007
Problem: Semantic Web will never see network effects until a sufficiently large amount of semantic markup/metadata is available
Solution: ML/DM techniques can generate consistent, cost-effective semantic markup, either semi-automatically or fully automatically
– Data/instance level: ML/DM derives semantic metadata for web resources (URIs)• Segmentation of heterogeneous pages (simple web pages to blogs/forums/Amazon...)• IE-based and ontology-driven annotation
– Standard text-oriented techniques (catgorization, clustering, partial parsing, relation extraction, topic extraction, fully-/semi-/un-supervised learning, etc.)
– Schema induction from the web (see Google Squared)– Leverage links, tags, deep web query results & wrapper induction, MIME-types, – Stylized and ungrammatical language use including telegraphic speech
• Content cleaning and normalization• Duplicate entity recognition and merging (currently >6000 “George Bush” mentions...)
– Model level: ML/DM generates semantic annotation schema• Derive RDF/OWL vocabularies, taxonomies, lexical nets, ontologies from web data• Propose ontology alignments, restrictions, and mapping rules
Pretty Standard ML/DM Stuff: Find some data, clean it, annotate it, categorize/cluster it, and map it
25
Evolving Conceptions for the Semantic Web (Part 2)Evolving Conceptions for the Semantic Web (Part 2)
The Web is a publishing platform for formal knowledge as well as pages
– RDFa allows mixed webtext and RDF– RDF/OWL documents can be published by
web-connected databases– Huge numbers of knowledge publishers– Linkage via owl:sameAs,
owl:equivalentClass, owl:equivalentProperty, skos:exactMatch, etc.
Core problem is maintaining a set of evolving and partial agreements on semantic models and labels
– Semantic Web rule languages (SWRL, SILK, etc.) can be used to specify mappings and symbolic relations
– Overall semantic cohesion is increased as more data is mapped together
– Hard problem is cost-effectively maintaining semantic models
Supplemental semantics is carried in the free-text web
The Semantic Web in 2009
Hypertext W
eb
(HTML/X
ML over H
TTP)
Seman
tic W
eb
(RDF/O
WL o
ver H
TTP)
Links via URLs
...
Semantic Data Publishers(publication of RDF and SPARQL endpoints)
RDFa
RDFa
26
What does this mean for ML/DM on the Semantic Web?What does this mean for ML/DM on the Semantic Web?
Semantic Web = “The World Wide Database”– Anyone can read, anyone can write, with very loose version control– “Schema-last” data engineering
• The semweb is a database with a constantly-changing Entity/Relation diagram• High data flexibility at the cost of non-optimized queries
– Most current Semantic Web data originates in RDBMSs via RDF and web servers
ML/DM on the Semantic Web is really about extending, cleaning, and making useful the largest and most dynamic database in the world
Opens the door to a more collaborative type of semantic annotation– 2007 idea: SW would be tied to the text web, and so annotation systems were key
• OpenCalais is currently gaining a dominant position• Thompson/Reuters is also getting a vast amount of data about entities
– 2009 idea: SW is (mostly) an issue of translating relational DB semantics in order to define entities, slots and values, followed by schema mapping
• Much easier than NL annotation; data is typically cleaner and semantics are less complex• Easier to add value with a small amount of semantic linking• Network effects are beginning
The Semantic Web is developing into a web-scale social platform for data integration and fusion
27
Before we continue: 2 challenges about the Semantic WebBefore we continue: 2 challenges about the Semantic Web
The Semantic Web is not consistent– Semantic Web data is published by many people who have not read the
specifications• The extant Semantic Web is a mix of formal and informal usages• Consistent usage is only found in islands
– Semantic structures are common and often unmaintained • There is substantial non-RDF semantic data describing the same concepts (XML with
reasonable offline semantic models, microformats, tag data, link data, etc.)• Sindice catalogs >36000 uses of the term “publication” in RDF, tags, RDFa,
microformats); how do you choose?– Semantic Web rules are currently rare, but use is growing
RDF/OWL has very limited expressivity for quantities– N-ary relations are awkward to express
• Reification is only a partial solution• W3C recommends class-based accounts of
relations but this is clumsy– RDF/OWL core semantics are truth-functional,
with no good support for units, probabilities, weights, vectors, uncertainty, etc.
28
Talk Outline: The Growing Semantic WebTalk Outline: The Growing Semantic Web
The Origins of the Semantic Web – DARPA’s DAML Program– RDF, OWL, and the Semweb Infrastructure
Semantic Web Evolution to 2009
Semantic Web Transformation in three areas of application– Markets and Companies
The Fourth Generation– A Scalable Revolution
29
First Generation Semantic Web ApplicationsFirst Generation Semantic Web Applications
A really old problem type– Semantics as the keystone technology for unstructured Information Retrieval– Requires powerful NLP and document interpretation systems
• Often also requires powerful semantic representations (e.g., events or causality)• Can use semantic web KR but usually augments it
Market Segments and Players– Enterprise Document Management (EDM) and search systems (Autonomy, Endeca...)– Email autoclassifiers and inbox managers (Xiant)– Web question answering: Hakia, Powerset, TrueKnowledge, Cycorp (inCyc)... – Semantics for Search Result Presentation: Yahoo! SearchMonkey
Some lessons with applying semantic web technology in this space– Still waiting for a compelling match between technical capability and business need
• Statistical methods are surprisingly good for basic relevance scoring (e.g., Latent Semantic Indexing, PageRank, SVMs)
• Vertical applications (esp. medical pharma) have seen some success– Semantic processing is only a small differentiator in these markets – you have to
be great at nonsemantic queries, data import, crawling, storage, performance...
Semantically-Boosted Search and Classification
30
Semantic Web in Search and PresentationSemantic Web in Search and Presentation
As far as I know, no one is yet using Semantic Web data for query interpretation or relevance judgments on the web
– Some issues are scale and trustworthiness (the lesson of the HTML <META> tag)– Powerset used to use DBpedia data in its knowledge base
Yahoo! SearchMonkey (see also Google’s Rich Snippets) use semantic data to enhance presentation
– GreaseMonkey-style presentation reformatting for search snippets– Yahoo’s crawler indexes and interprets RDFa, microformats, delicious data, etc.– Display URL as an enhanced result, with standard or custom presentations– Incentives: “Structured data is the new SEO” (Dries Buytaert, Drupal)
Alex Moskalyuk | FacebookAlex Moskalyuk is on Facebook. Join Facebook to connect with Alex Moskalyuk and others you may know. Facebook gives people the power to share and makes the ...www.facebook.com/.../Alex-Moskalyuk/100000220291330?... - Cached - Similar
31
Second Generation Semantic Web ApplicationsSecond Generation Semantic Web Applications
An only slightly newer problem type– Business exploitation of structured enterprise data (RDBMS, Spreadsheet, ERP data)
• Backwards to Data Management to reduce cost of managing, migrating, integrating• Forwards to Business Process Management
– Support for unified query, analytics, and application access• Includes SOA integration, Enterprise Application Integration
Markets Segments and Players– Gartner estimates that EII software and services alone is $14B/year, with 40% growth
over 5 years (pre-recession numbers, though)– Very complex market space includes EAI, Entity Analytics, MDM, BI, BPM, CPM...– Huge entrenched players (IBM, SAP, Oracle...) and major consulting shops
Some lessons with applying semantic web technology in this space– Fundamental problem is understanding the semantics from legacy systems, not in KR– Pure Semantic technology companies tend to be unsophisticated about the customer – RDF/OWL is typically too weak and must be augmented by rules, quantities, etc.– Raw performance is typically inferior to a well-designed RDBMS– Tends to be an IT sale (not Line-of-Business sale), with attendant cost pressures
Strategic Enterprise Information Technology
32
Layers of Semantic Data in Strategic Enterprise ITLayers of Semantic Data in Strategic Enterprise IT
Semantic data modeling
Automatic data management
EII - Information integration leadership
BI -- Better business intelligence
Financial modeling & intelligence
Semantic process modeling
BAM and CPM w/ predictive analytics
MDM – Master Data Management
Increasing Value of Semantic Data Models
Sales Pitch for Semantic Data Models– Promote flexibility and improvisation in the face of dynamism– Expose business processes as rules, for governance and compliance– Can be driven all the way through the architecture, from SOA to CPM dashboards
This vision has never been proven at scale outside the lab
33
Semantic Web in Strategic Enterprise ITSemantic Web in Strategic Enterprise IT
A few companies are using semantic web technology in Enterprise IT
– Majority of revenue is usually from consulting engagements, not software products or services
– Carefully-curated external ontologies with logic engines (e.g. F-Logic), are used for high-end data integration
– RDF has been useful as a KR language
General Data outsourcing– Do for general data what Bloomberg and Capital-IQ do for financial data
• Collect data, clean it, add value, sell as a service– Semweb-driven variant of Wolfram Alpha
• Ultra-calculator (Mathematica) combined with a massive hand-curated almanac
• Similar to Google’s special computations, but much more powerful• Displays the solution results using templates from Mathematica’s
visualization tools– Semantic web has huge amounts of data, but cleaning, maintaining, and
linking it has been not cost effective• ML/DM techniques on semantic web data are in use at Evri and Freebase
– [Unnamed Startups] use ML/DM and Semantic Web for personalized mobile results (huge opportunities here!)
34
Third Generation Semantic Web ApplicationsThird Generation Semantic Web Applications
A new problem type– “Semantic Web should allow people to have a better online experience” – Alex Iskold,
AdaptiveBlue– Enhance the human activities of content creation, publishing, linking my data to other
data, socializing, forming community, purchasing satisfying things, browsing, etc.– Improve the effectiveness of advertising
Market Segments and Players– Mashup systems and consumer-oriented semantic web services (Drupal, Ning, ...)– Semantic enhancements to blogs and wikis (Zemanta, Faviki, Ontoprise, Radar, ...)– Semantics for Social Networking (MySpace RDF service and microformats, Facebook
RDF models, etc.)
Some lessons with applying semantic web technology in this space– If we don’t have semantic convergence, then semantics isn’t a differentiator– No one really knows the design principles that allow some Web 2.0 sites to be
successful and others to never get traction
Web 2.0 and the Socio-Semantic Web
35
Web Knowledge Bases: Metaweb and FreebaseWeb Knowledge Bases: Metaweb and Freebase
Massive amounts of almanac-style RDF data (Creative Commons license) that is readily available from partners
Social authoring tools and wiki-style consensus combined with controlled reconciliation by Metaweb personnel
36
Semantic Blogger Support: ZemantaSemantic Blogger Support: Zemanta
Automatic link, image, keyword, tag suggestions for blogs and email– Average semi-professional blogger spends ~20 mins adding “decorative” content
Key technology is fast NLP and categorization, plus personalization and mining of aggregated choice streams
Accuracy is guaranteed because users explicitly add the suggestions (feedback)– Zemanta inserts RDFa and standard semantic markup in the background– Includes user specified friends/feeds/photos/etc as well as standard ones
37
Third Generation Example: Semantic WikisThird Generation Example: Semantic Wikis
Wikis are tools for Publication and Consensus
MediaWiki (software for Wikipedia, Wikimedia, Wikibooks, etc.)– Most successful Wiki software
• High performance: 10K pages/sec served, scalability demonstrated• LAMP web server architecture, GPL license
– Publication: simple distributed authoring model• Wikipedia: >2.9M English articles, 400K Russian, >2.5M images, #8 Alexa traffic rank
– Consensus achieved by global editing and rollback• Fixpoint hypothesis, although consensus is not static• Gardener/admin role for contentious cases
Semantic Wikis apply the wiki idea to structured (typically RDFS) information– Authoring includes instances, data types, vocabularies, classes– Natural language text used for explanations– Automatic list generation from structured data, basic analytics, database imports– Reuse of wiki knowledge– See e.g., http://smwforum.ontoprise.com for one powerful semantic wiki
Semantic Wiki Hypotheses:(1) Significant interesting Semantic Data can be collected cheaply
(2) Wiki mechanisms can be used to maintain consensus on vocabularies and classes
39
Semantic MediaWiki and RulesSemantic MediaWiki and Rules
SMW Simple Rules– Implement basic rules– Automates some authoring
and data maintenance– Templates to help users
create rules– Publishable in SWRL
SMW Importers/Exporters– Web Services– Vocabulary and data files– MS Office and Excel Docs– Semantic Web representations
40
More complex: National Cancer Institute (NCI) ThesaurusMore complex: National Cancer Institute (NCI) Thesaurus
41
Using Semantic Web data from Wikis: Project HaloUsing Semantic Web data from Wikis: Project Halo
Project Halo: SME-based authoring for scientific question-answering systems
Project Halo Goal: To determine whether tools can be built to facilitate robust knowledge formulation, query and evaluation by domain experts, with ever-decreasing reliance on knowledge engineers
– Can SMEs build robust question-answering systems that demonstrate excellent coverage of a given syllabus, the ability to answer novel questions, and produce readable domain appropriate justifications using reasonable computational resources?
– Will SMEs be capable of posing questions and complex problems to these systems?
– Do these systems address key failure, scalability and cost issues encountered in expert systems?
Experimental Scope: Selected portions of the AP syllabi for chemistry, biology and physics
– Example: Balance the following reactions, and indicate whether they are examples of combustion, decomposition, or combination
(a) C4H10 + O2 CO2 + H2O(b) KClO3 KCl + O2
(c) CH3CH2OH + O2 CO2 + H2O(d) P4 + O2 P2O5
(e) N2O5 + H2O HNO3
42
AURA – Automated User-centered Reasoning and Acquisition SystemAURA – Automated User-centered Reasoning and Acquisition System
Aura is a tool to help users formalize high school-level scientific knowledge Aura can then reason with that knowledge Users can ask English questions and get answers and explanations
43
Symbiosis Between Aura and Semantic WikisSymbiosis Between Aura and Semantic Wikis
Classical Knowledge Engineering– Expressive knowledge representation– Sophisticated testing and debugging
Knowledge Engineering in Aura– Acquires knowledge for deductive Q/A that
can be used for answering AP questions in sciences
• Uses a DL-style class taxonomy, and logic programming style rules with many extensions
– Requires substantial hours of training for knowledge formulation
Semantic Web Knowledge Engineering– Simple knowledge representation– Quantity at some expense of quality
Knowledge Engineering in SMW+– Tool for online authoring and consensus-
building around semantic web content– Captures knowledge at the level of RDFS– Collective editing for quality control– Gardening appropriate for scientific
knowledge– Almost walk up and use system
Can we use the Semantic Media Wiki to capture knowledge that could be used for Q/A in AURA?
– Factual knowledge (e.g., atomic number for carbon is 6, solubility constraints, etc.)– Taxonomic knowledge (e.g., eukaryotic and prokaryotic are two types of cells)
Knowledge creation would be faster, distributed, and cheaper
45
Source Text of Article on Organelle in Semantic MediaWikiSource Text of Article on Organelle in Semantic MediaWiki
46
Fact Box With the Annotations in Semantic MediaWikiFact Box With the Annotations in Semantic MediaWiki
47
Ontology Browser for Biology Data in Semantic MediaWIkiOntology Browser for Biology Data in Semantic MediaWIki
48
Aura/Semantic MediaWiki Use CaseAura/Semantic MediaWiki Use Case
Semantic MediaWiki includes relevant knowledge
Aura knowledge formulation engineer (KFE) searches for knowledge during knowledge formulation
The KFE notices useful information in SMW+ The KFE maps the knowledge into Aura
– Uses a derivative of Ontomap– Mapping suggestions from FOAM (Framework for Ontology
Alignment and Mapping) from Univ. Karlsruhe AIFB– ETL workflow
The knowledge is mapped into Aura and is available to the question-answering algorithms
51
Aura User Maps Wiki Knowledge into Aura KB (with FoaM proposals)Aura User Maps Wiki Knowledge into Aura KB (with FoaM proposals)
52
Wiki Knowledge Available in Aura for Question-AnsweringWiki Knowledge Available in Aura for Question-Answering
53
Talk Outline: The Growing Semantic WebTalk Outline: The Growing Semantic Web
The Origins of the Semantic Web – DARPA’s DAML Program– RDF, OWL, and the Semweb Infrastructure
Semantic Web Evolution to 2009
Semantic Web Transformation in three areas of application– Markets and Companies
The Fourth Generation– A Scalable Revolution
54
Fourth Generation Semantic WebFourth Generation Semantic Web
A problem of scale– The number of Internet devices is starting to explode (again!)
• Mobile devices, embedded systems, and sensors• In 2008, Google reported 1 trillion unique URLs, ~200M web sites• Total 2008 web page estimates are ~30 billion (significant variation in these estimates)
– Gartner (May 2007, Report G00148725)• "By 2012, 70% of public Web pages will have some level of semantic markup,
20% will use more extensive Semantic Web-based ontologies”– Can Semantic Web technologies work at web scales?
• Sindice (www.sindice.com) is now indexing >10B triples/microformats over 65M pages• 20% of 30 billion pages @ 1000 triples per page = 6 trillion triples • 30 billion and 1000 are underestimates
Some lessons with applying semantic web technology in this space– Can we exploit billions of triples, microformats, ontologies, rules, and services?
• Are Semantic Web systems deployable on parallel web architectures, friendly to out-of-core algorithms, and compatible with giant databases?
– Is there a scaling limit to useful, profitable Semantic Web implementations?
The Web of Data meets the Future Internet
Material from Frank van Harmelen,Vrije Universiteit, Amsterdam
55
Fourth Generation Example: DBpediaFourth Generation Example: DBpedia
Mine Wikipedia for assertions– Mainly from Wikipedia Factboxes
• ~23M triples– Category assertions
DBpedia 3.3 dataset (May 09 Wikipedia)– ~2.6M things, ~274M triples
• 213K persons, 328K places, 57K music albums, 36K films, 609K links to images, 3.2M links to relevant external web pages, 4.9M links into RDF datasets
– Classifications via Wikipedia categories, YAGO, and WordNet synsets
– One of the largest broad knowledge bases in the world
Simple queries over extracted data– “Things near the Eiffel Tower”– “The official websites of companies with more
than 50000 employees” – “Soccer players from team with stadium with
>40000 seats, who were born in a country with more than 10M inhabitants”
57
Fourth Generation Example: Linked Open DataFourth Generation Example: Linked Open Data
Build an RDF Data Commons– Create a single, simple set of
rules for publishing and linking RDF data
• Use URIs as names for things
• Use HTTP URIs so that people can look up those names
• When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
• Include links to other URIs, so that they can discover more things
– Set RDF links between data items from different data sources
May 2009 LOD dataset– ~4.7B triples, and ~140M RDF
interlinks, and growing faster than I can track
– Database linkage means that LOD will soon be impossible to count except by order of magnitude
58
May 2007September 2007
March 2009September 2008
The Growing Web of Linked DataThe Growing Web of Linked Data
59
Topic Distribution in the March 2009 Linked DatasetsTopic Distribution in the March 2009 Linked Datasets
Life Sciences
Publications
Online ActivitiesMusic
Geography
Cross-Domain
4,500,000,000 triples
180,000,000 data links
60
Common Tag SpecificationCommon Tag Specification
Instead of tagging with language terms, tag with terms + RDFa– Distinguish between “jaguar” the animal, the car company, and the operating system – Provides metadata for each Common Tag and relations to other Common Tags
• The Barack Obama Common Tag includes <employment, President of the United States> and <spouse, Michelle Obama>
More discoverable, more connected, more web-like, more useful
61
Semantic Dynamism at Web ScaleSemantic Dynamism at Web Scale
Semantics are always changing– Per minute, there are:
• 100 edits in Wikipedia (144K/day)• 200 tags in del.icio.us (288K/day)• 270 image uploads to flickr (388K/day)• 1100 blog entries (1.6M/day)
– Will the Semantic Web be less dynamic?
There is no “right ontology”– Ontologies are abstractions
• Different applications lead to different ontologies• Ontology authors make design choices all the time
– Google Base: >250K schemas– “Ontologies = Politics”
Intentionally false material (Spam)– Lesson of the HTML <META> tag Material from Denny Vrandečić, AIFB
How Do We Use this Dynamic Data for Decision Support?
62
Fourth Generation Application: The Large Knowledge ColliderFourth Generation Application: The Large Knowledge Collider
EC Framework 7 Program– Lead partners: Univ. Innsbruck and Vrije
University Amsterdam, plus 12 partners
Goals of LarKC – Scaling to Infinity– A platform for massive distributed
incomplete reasoning – Remove the scalability barriers of
currently existing reasoning systems for the Semantic Web
– Combine reasoning/retrieval and search– Want to trade off answer quality and
answer timeliness Reasoning pipeline
– Heavy emphasis on probability, decision theory, anytime algorithms
– Plugin architecture, with sampling– Explicit cost models
Public releases of LarKC platform, with APIs
63
Fourth Generation Application: The Large Knowledge ColliderFourth Generation Application: The Large Knowledge Collider
EC Framework 7 Program– Lead partners: Univ. Innsbruck and Vrije
University Amsterdam, plus 12 partners
Goals of LarKC – Scaling to Infinity– A platform for massive distributed
incomplete reasoning – Remove the scalability barriers of
currently existing reasoning systems for the Semantic Web.
– Combine reasoning/retrieval and search– Want to trade off answer quality and
answer timeliness
Reasoning pipeline– Heavy emphasis on probability,
decision theory, anytime algorithms– Plugin architecture, with sampling– Explicit cost models
Encourage participation through Thinking@home
Public releases of LarKC platform, with APIs
Exploiting web-scale semantics is the new frontier– Generations 1 and 2 used web resources to support classical KR approaches– Generation 3 (social semantic web) leverages web social patterns for KR
Fourth generation applications address general web-scale KR
64
Reasoning pipeline– Heavy emphasis on probability,
decision theory, anytime algorithms– Plugin architecture, with sampling– Explicit cost models
Encourage participation through Thinking@home
Public releases of LarKC platform, with APIs
Fourth Generation Application: The Large Knowledge ColliderFourth Generation Application: The Large Knowledge Collider
EC Framework 7 Program– Lead partners: Univ. Innsbruck and Vrije
University Amsterdam, plus 12 partners
Goals of LarKC – Scaling to Infinity– A platform for massive distributed
incomplete reasoning – Remove the scalability barriers of
currently existing reasoning systems for the Semantic Web.
– Combine reasoning/retrieval and search– Want to trade off answer quality and
answer timeliness
The real money in semantics will be made in apps/tools that exploit web-scale data– The cost of semantic data creation is going to zero– The size of semantic data is going to web-scale
At this scale, logic-based techniques break down ML/DM are key technologies to allow this exploitation to happen
65
Summing up: The Growing Semantic WebSumming up: The Growing Semantic Web
In mid-2004...– RDF and OWL had just been standardized– Advances were made via traditional corporate/public R&D programs– The first wave of semantic web startups (many of which have since failed)– Implementations were technically very sophisticated, but fully custom and had no
web involvement– A few early conferences (ISWC, SemTech) and session tracks
66
Summing up: The Growing Semantic WebSumming up: The Growing Semantic Web
In mid-2004...– RDF and OWL had just been standardized– Advances were made via traditional corporate/public R&D programs– The first wave of semantic web startups (many of which have since failed)– Implementations were technically very sophisticated, but fully custom and had no
web involvement– A few early conferences (ISWC, SemTech) and session tracks
Now in 2009...– The Semantic Web is the most exciting thing happening on the web– RDF assertions scaling into the billions, with little to no programmatic control– Search engines are starting to develop products– Bestbuy is publishing products, store descriptions, and hours in RDFa
67
Summing up: The Growing Semantic WebSumming up: The Growing Semantic Web
In mid-2004...– RDF and OWL had just been standardized– Advances were made via traditional corporate/public R&D programs– The first wave of semantic web startups (many of which have since failed)– Implementations were technically very sophisticated, but fully custom and had no
web involvement– A few early conferences (ISWC, SemTech) and session tracks
Now in 2009...– The Semantic Web is the most exciting thing happening on the web– RDF assertions scaling into the billions, with little to no programmatic control– Search engines are starting to develop products– Bestbuy is publishing products, store descriptions, and hours in RDFa
ML/DM is at the center of the needed techniques– Linked Data provides the necessary scale – Logic-based techniques are meeting their performance limits– The infrastructure to build non-tools-based companies is in place– This is what a revolution feels like!!
68
(Thank You)
Disclaimer: The preceding slides represent the views of the author only. All brands, logos and products are trademarks or registered trademarks of their respective companies.
Hvala