The Growing Semantic Web Dr. Mark Greaves Vulcan Inc. [email protected] © 2009 Vulcan Inc.

69
The Growing Semantic Web Dr. Mark Greaves Vulcan Inc. [email protected] © 2009 Vulcan Inc

Transcript of The Growing Semantic Web Dr. Mark Greaves Vulcan Inc. [email protected] © 2009 Vulcan Inc.

The Growing Semantic Web

Dr. Mark GreavesVulcan Inc.

[email protected]© 2009 Vulcan Inc.

2

What is the Semantic Web? 4 AnswersWhat is the Semantic Web? 4 Answers

The Web of Data– A fully distributed web-based system to publish logical assertions – A way to link to someone else’s data, augment it, and add to it– Democratic, crowd-based, scalable knowledge engineering– The hottest area of Artificial Intelligence right now

3

What is the Semantic Web? 4 AnswersWhat is the Semantic Web? 4 Answers

The Web of Data– A fully distributed web-based system to publish logical assertions – A way to link to someone else’s data, augment it, and add to it– Democratic, crowd-based, scalable knowledge engineering– The hottest area of Artificial Intelligence right now

The set of W3C standards for simple logic languages on the web– RDF, RDFS, OWL (Lite, DL, Full dialects), OWL2 (EL, QL, RL profiles)– Weaker than First Order Logic, more easily authorable, decidable, tractable in

most cases using tableaux provers

4

What is the Semantic Web? 4 AnswersWhat is the Semantic Web? 4 Answers

The Web of Data– A fully distributed web-based system to publish logical assertions – A way to link to someone else’s data, augment it, and add to it– Democratic, crowd-based, scalable knowledge engineering– The hottest area of Artificial Intelligence right now

The set of W3C standards for simple logic languages on the web– RDF, RDFS, OWL (3 versions), OWL2– Weaker than First Order Logic, more easily authorable, decidable, tractable in

most cases using tableaux provers

The largest formal knowledge base on Earth– Also the messiest formal knowledge base on Earth

5

What is the Semantic Web? 4 AnswersWhat is the Semantic Web? 4 Answers

The Web of Data– A fully distributed web-based system to publish logical assertions – A way to link to someone else’s data, augment it, and add to it– Democratic, crowd-based, scalable knowledge engineering– The hottest area of Artificial Intelligence right now

The set of W3C standards for simple logic languages on the web– RDF, RDFS, OWL (3 versions), OWL2– Weaker than First Order Logic, more easily authorable, decidable, tractable in

most cases using tableaux provers

The largest formal knowledge base on Earth– Also the messiest formal knowledge base on Earth

A revolution in the way we think of data, crowds, and schemas– Scaling at last comes to symbolic AI– Massive, partial, participatory, logically weak, dynamic, schema-last– A way to democratize and curate knowledge bases and knowledge systems

6

Talk Outline: The Growing Semantic WebTalk Outline: The Growing Semantic Web

The Origins of the Semantic Web – DARPA’s DAML Program– RDF, OWL, and the Semweb Infrastructure

Semantic Web Evolution to 2009

Semantic Web Transformation in three areas of application– Markets and Companies

The Fourth Generation– A Scalable Revolution

7

Talk Outline: The Growing Semantic WebTalk Outline: The Growing Semantic Web

The Origins of the Semantic Web – DARPA’s DAML Program– RDF, OWL, and the Semweb Infrastructure

Semantic Web Evolution to 2009

Semantic Web Transformation in three areas of application– Markets and Companies

The Fourth Generation– A Scalable Revolution

8

The Roots of the Semantic WebThe Roots of the Semantic Web

Semantic technology has been a distinct research field for decades– Symbolic Logic (from Russell and Frege)– Knowledge Representation Systems in AI

• Semantic Networks (Bill Woods, 1975)• US and EU R&D programs in information integration (80s and 90s)• Development of simple tractable “description logics” for classification• Conceptual Graphs and Formal Concept Analysis

– Relational Algebras and Schemas in Database Systems

Library Science (classifications, thesauri, taxonomies)

So, What Sparked the Semantic Web?

What’s new was the Web!– The material needed to answer almost any question is somewhere on the web– A massive infrastructure of data servers, protocols, authentication systems, presentation languages, and

thin clients that can be leveraged– A way around needing the “big data warehouse”

9

The Beginnings of the US Semantic Web: DARPA’s DAML ProgramThe Beginnings of the US Semantic Web: DARPA’s DAML Program

Solution:Augment the web to link machine-readable knowledge to web pages

Extend RDF with Description Logic Extensibility via frame-based language design Create the first fully distributed web-scale knowledge base

out of networks of hyperlinked facts and data

Approach:Design a family of new web languages

Basic knowledge representation (OWL)Reasoning (SWRL, OWL/P, OWL/T)Process representation (OWL/S)

Build definition and markup tools Link new knowledge to existing web page elements

Test design approach with operational pilots in US Government

Partner with parallel EU efforts to standardize the new web languages

People use implicit knowledge to reason with web pages

People use implicit knowledge to reason with web pages

Computers require explicit knowledge to reason with web pages

Computers require explicit knowledge to reason with web pages

Existing Web

(HTML/XML over HTTP)

Semantic Web

(OWL over HTTP)

Links via URLs

Problem: Computers cannot process most of the information stored on web pages

10

What is RDF?What is RDF?

RDF defines entities and relations for an area of knowledge – Assertions are triples of (resource, property, resource) or (resource, property, value)– Assertions are precise enough to be interpreted set-theoretically by machines– Graphically represented as a directed concept graph with typed links– Simple RDF semantics enables reuse of domain knowledge via class hierarchies

RDF is web data language with web-friendly syntax– All elements in triples are either primitive values or URIs– RDF Schema adds classes, domains, ranges, and inheritance of domains and ranges

typeOf

isa

VIN1234VIN1234

hasManufacturer

isa

isa

isa

isaisa

VehicleVehicle

TruckTruck

196419641964 Mustang1964 Mustang

FordFord

MustangMustang

CarCar

ResourcePropertyProperty Value

VehicleVehicleManufacturerManufacturer

YearYear

Joe’s MustangJoe’s Mustang

manufacturesmanufac

turedIn

manufacturedIn

hasVIN

hasManufacturer

Graphic from Mary Pulvermacher, MITRE

11

What Does OWL Add?What Does OWL Add?

Description Logic Semantics– Relations between classes

• Equivalent class (e.g., US_President and PrincipalResidentofWhiteHouse)• Disjoint class (e.g., Male and Female), subClassOf

– Derived classes (intersectionOf, unionOf, complementOf)– Property characteristics (inverseOf, transitive, symmetric, subPropertyOf, equivProperty, etc.)– Range and Cardinality constraints (e.g., birthMother has exactly one value, which is a person)

Generic (logically-grounded) ability to combine assertions into inferences according to the rules of the DL (for OWL, SHIQ )

Given... And...

spouseOf

Can conclude...

SymmetricPropertySymmetricPropertyJimJimJimJim

MaryMaryMaryMaryspouseOfspouseOfspouseOfspouseOf

rdf:type spouseOf

MaryMaryMaryMary

JimJimJimJim

Graphic from Mary Pulvermacher, MITRE

12

From XML to OWL Increasingly Expressive Options for Web Data MarkupFrom XML to OWL Increasingly Expressive Options for Web Data Markup

XMLIssue addressed: how to express data in text?XML Solution: “wrap” data within start tag/end tags, and empower users to create their own tagsExample:

<altitude>1500 feet</altitude>

Start tag End tag

altitude element

XML Schema (XMLS)

Issue addressed: how should the type structure of the data be expressed?XML Schema Solution: XML templatesExample:

<element name=“altitude” type=“integer”/>

altitude is constrained to be an integer

RDF and RDF SchemaIssue addressed: how can data support statements?RDF Solution: use a subject, property, object patternExample:

<Fighter rdf:ID=“F16”> <altitude>50,000 feet</altitude> <builder>Lockheed</builder></Fighter>

OWL

Issue addressed: how to express data semantics?OWL Solution: use inheritance and a description logic to express restrictions and describe entailmentExample:

<owl:Class rdf:ID=“Fighter”> <rdfs:subClassOf rdf:resource=“#Aircraft”/></owl:Class>

Unconstrained text string

Properties

Instance of the Class

Fighter class: a Fighter inherits properties type of Aircraft,

Values

HTML XML & XMLS RDF OWL

13

DAML Program ElementsDAML Program Elements

Web Ontology Language (OWL) (2/10/04)– Enables knowledge representation and

tractable inference in a web standard format– Based on Description Logics and RDF– Done by the W3C WebOnt Working Group

OWL Reasoning Languages– SWRL: Supports business rules and linking

between OWL ontologies (based on RuleML)– OWL/P Proof Language: Allows software

components to exchange chains of reasoning– OWL/T Trust Language: Represents trust that

OWL and SWRL inferences are valid

Semantic Web Services (OWL/S)– Allows discovery, matching, and execution of

web services based on action descriptions– Unifies semantic data models (OWL) with

process models (Agent) and shows how to dynamically compose web services

OWL Tools– www.semwebcentral.org and www.daml.org

Several US Govt pilots and prototypes

Completed standards process

Started standards process

Unfinished

SWRL: RulesOWL/P: Proof

SWRL: RulesOWL/P: Proof

OWL/S:Semantic Web

Services

OWL/S:Semantic Web

Services

Web OntologyLanguage (OWL)

Web OntologyLanguage (OWL)

OWL/T:Trust

OWL/T:Trust

DAML Program Technical Flow

Each DAML Program Element includesspecifications, software tools,

coordination teams, and use cases

~$45M over 5 years (FY01 – FY05)

14

The Semantic Web in 2009The Semantic Web in 2009

CuttingEdge

Mature

StillResearch

“The Famous Semantic Web Technology Stack”

15

The Semantic Web in 2009The Semantic Web in 2009

CommercialCuttingEdge

Mature

Active Researchand StandardsActivity

“The Famous Semantic Web Technology Stack”

16

CommercialCuttingEdge

The Semantic Web in 2009The Semantic Web in 2009

Mature

Active Researchand StandardsActivity

“The Famous Semantic Web Technology Stack”

17

The Semantic Web in 2009The Semantic Web in 2009

Mature

Active Researchand StandardsActivity

CommercialCuttingEdge

“The Famous Semantic Web Technology Stack”

18

Active Researchand StandardsActivity

Completing the Semantic Web PictureCompleting the Semantic Web Picture

Mature

Other Technologies Impact the Semantic Web

More OntologiesTag SystemsMicroformats

Social Authorship

Combined RDF/OWL and

RDBMS Systems

Scalable Reasoning Systems

A Huge Base of RDF data

CommercialCuttingEdge

Description LP, SWRL, SILK...

19

Beyond RDF and OWL: 2009 Semantic Web InfrastructureBeyond RDF and OWL: 2009 Semantic Web Infrastructure

Markup Languages– HTML-friendly markup dialects:

Microformats and RDFa– OWL 2 is a Candidate Recommendation

Triplestores and SPARQL Servers– Stores for 1B triples now available, though

with caveats around write performance– Commercial: AllegroGraph, Virtuoso,

BigOWLIM, Oracle 11g Semantic Technologies...

– Open Source: 4Store, Sesame, Redland...– Next step is parallel web delivery

architectures

Entity Name Service (Okkam, DBpedia)

Semantic Web Reasoners– Commercial: Oracle 11g RDFS/OWL

engine, Ontobroker, Ontotext, RacerPro– Open Source: Pellet, FaCT++...– RIF is at W3C Last Call status

Vocabularies and Design Tools– Ontologies: Dublin Core, FOAF, SIOC...– OpenSource: Protégé, SWOOP...– Commercial: TopBraid Composer, Schema

Semweb Data Generation– RDF / RDBMS front-ends – Entity extractors and taggers (OpenCalais)– Zemanta-type author’s assistants– Semantic wikis

Semweb Data Exploitation– Semweb search engines (Sindice, Watson,

Falcon...)– Yahoo SearchMonkey / Google Rich

Snippets– Browser extensions and facets

Visualization Tools– Simile Project (http://simile.mit.edu/) – Several Commercial Companies

User-layer ToolsServer Infrastructure

20

State of Semantic Web Work in the USState of Semantic Web Work in the US

DAML finished in 2005, with no followons– NIH (Protégé, NCBO), NSF, some small DoD funding– PAL/CALO funded broader semantic/AI work

But... leading-edge Venture Capital moved in– Vulcan, Crosslink, In-Q-Tel, Benchmark, Intel Capital...

An emerging commercialization ecosystem– Startup/Small: Radar, Metaweb, Evri, AdaptiveBlue...– Midsized: Metatomix, Dow Jones, Reuters/OpenCalais, Franz...– Large: Yahoo!, Google, Oracle, IBM, HP, Microsoft...– Recession is taking its toll on everybody

Emphasis is mostly Semantic dimension of Semantic Web– That was where the money and customers were– RDBMS scale and orientation, powerful analytics for Business Intelligence– Centralized workflows for ontology definition and management– Use cases surrounding corporate data integration and document markup

21

Semantic Web Work in the EUSemantic Web Work in the EU

Continuing Large Public-Sector Investments– Framework 6 (2002-6) – More than €100M in

several different programs– Framework 7 (2007-13) – ~€1B/year for information and

communications technologies• Semantics is more present as a general systems technology• Future Internet and Digital Libraries thrusts

Two Dedicated Multi-site Semweb R&D Institutes– DERI: 100+ people and the world leader in research– Semantic Technology Institute International– A strong and growing cadre of graduate students

Emphasis on the Social and Web Dimensions of the Semantic Web– Web-scale Linked Data, social networks, simple scalable imperfect inference– Ontology and data dynamics, imperfections, versioning– Semantically-boosted collaboration with limited knowledge engineer involvement

Clear R&D leadership but lags in Semweb commercialization

22

Talk Outline: The Growing Semantic WebTalk Outline: The Growing Semantic Web

The Origins of the Semantic Web – DARPA’s DAML Program– RDF, OWL, and the Semweb Infrastructure

Semantic Web Evolution to 2009

Semantic Web Transformation in three areas of application– Markets and Companies

The Fourth Generation– A Scalable Revolution

23

Evolving Conceptions for the Semantic WebEvolving Conceptions for the Semantic Web

Initial Semantic Web Conception*

* By most people but not Tim Berners-Lee

Semantic markup would be tightly associated with individual web pages

– “Translate the Web for machines”– Markup lives in associated RDF/XML docs

Core problem is labeling free-text web pages with a (pre-defined) ontology markup vocabulary

– Entity extraction and ontology-aware NLP– Document segmentation technologies– Manual annotation tools

Need an all-encompassing ontology or set of logically compatible ontologies

– Agreement was always elusive

Need to achieve scale and consistency in semantic page markup

– Manual annotation has incentive, quality/ consistency, and cost/benefit issues

Hypertext W

eb

(HTML/X

ML over H

TTP)

Seman

tic W

eb

(RDF/O

WL o

ver H

TTP)

Links via URLs

24

The Situation for Semantic Web and ML/DM in 2007The Situation for Semantic Web and ML/DM in 2007

Problem: Semantic Web will never see network effects until a sufficiently large amount of semantic markup/metadata is available

Solution: ML/DM techniques can generate consistent, cost-effective semantic markup, either semi-automatically or fully automatically

– Data/instance level: ML/DM derives semantic metadata for web resources (URIs)• Segmentation of heterogeneous pages (simple web pages to blogs/forums/Amazon...)• IE-based and ontology-driven annotation

– Standard text-oriented techniques (catgorization, clustering, partial parsing, relation extraction, topic extraction, fully-/semi-/un-supervised learning, etc.)

– Schema induction from the web (see Google Squared)– Leverage links, tags, deep web query results & wrapper induction, MIME-types, – Stylized and ungrammatical language use including telegraphic speech

• Content cleaning and normalization• Duplicate entity recognition and merging (currently >6000 “George Bush” mentions...)

– Model level: ML/DM generates semantic annotation schema• Derive RDF/OWL vocabularies, taxonomies, lexical nets, ontologies from web data• Propose ontology alignments, restrictions, and mapping rules

Pretty Standard ML/DM Stuff: Find some data, clean it, annotate it, categorize/cluster it, and map it

25

Evolving Conceptions for the Semantic Web (Part 2)Evolving Conceptions for the Semantic Web (Part 2)

The Web is a publishing platform for formal knowledge as well as pages

– RDFa allows mixed webtext and RDF– RDF/OWL documents can be published by

web-connected databases– Huge numbers of knowledge publishers– Linkage via owl:sameAs,

owl:equivalentClass, owl:equivalentProperty, skos:exactMatch, etc.

Core problem is maintaining a set of evolving and partial agreements on semantic models and labels

– Semantic Web rule languages (SWRL, SILK, etc.) can be used to specify mappings and symbolic relations

– Overall semantic cohesion is increased as more data is mapped together

– Hard problem is cost-effectively maintaining semantic models

Supplemental semantics is carried in the free-text web

The Semantic Web in 2009

Hypertext W

eb

(HTML/X

ML over H

TTP)

Seman

tic W

eb

(RDF/O

WL o

ver H

TTP)

Links via URLs

...

Semantic Data Publishers(publication of RDF and SPARQL endpoints)

RDFa

RDFa

26

What does this mean for ML/DM on the Semantic Web?What does this mean for ML/DM on the Semantic Web?

Semantic Web = “The World Wide Database”– Anyone can read, anyone can write, with very loose version control– “Schema-last” data engineering

• The semweb is a database with a constantly-changing Entity/Relation diagram• High data flexibility at the cost of non-optimized queries

– Most current Semantic Web data originates in RDBMSs via RDF and web servers

ML/DM on the Semantic Web is really about extending, cleaning, and making useful the largest and most dynamic database in the world

Opens the door to a more collaborative type of semantic annotation– 2007 idea: SW would be tied to the text web, and so annotation systems were key

• OpenCalais is currently gaining a dominant position• Thompson/Reuters is also getting a vast amount of data about entities

– 2009 idea: SW is (mostly) an issue of translating relational DB semantics in order to define entities, slots and values, followed by schema mapping

• Much easier than NL annotation; data is typically cleaner and semantics are less complex• Easier to add value with a small amount of semantic linking• Network effects are beginning

The Semantic Web is developing into a web-scale social platform for data integration and fusion

27

Before we continue: 2 challenges about the Semantic WebBefore we continue: 2 challenges about the Semantic Web

The Semantic Web is not consistent– Semantic Web data is published by many people who have not read the

specifications• The extant Semantic Web is a mix of formal and informal usages• Consistent usage is only found in islands

– Semantic structures are common and often unmaintained • There is substantial non-RDF semantic data describing the same concepts (XML with

reasonable offline semantic models, microformats, tag data, link data, etc.)• Sindice catalogs >36000 uses of the term “publication” in RDF, tags, RDFa,

microformats); how do you choose?– Semantic Web rules are currently rare, but use is growing

RDF/OWL has very limited expressivity for quantities– N-ary relations are awkward to express

• Reification is only a partial solution• W3C recommends class-based accounts of

relations but this is clumsy– RDF/OWL core semantics are truth-functional,

with no good support for units, probabilities, weights, vectors, uncertainty, etc.

28

Talk Outline: The Growing Semantic WebTalk Outline: The Growing Semantic Web

The Origins of the Semantic Web – DARPA’s DAML Program– RDF, OWL, and the Semweb Infrastructure

Semantic Web Evolution to 2009

Semantic Web Transformation in three areas of application– Markets and Companies

The Fourth Generation– A Scalable Revolution

29

First Generation Semantic Web ApplicationsFirst Generation Semantic Web Applications

A really old problem type– Semantics as the keystone technology for unstructured Information Retrieval– Requires powerful NLP and document interpretation systems

• Often also requires powerful semantic representations (e.g., events or causality)• Can use semantic web KR but usually augments it

Market Segments and Players– Enterprise Document Management (EDM) and search systems (Autonomy, Endeca...)– Email autoclassifiers and inbox managers (Xiant)– Web question answering: Hakia, Powerset, TrueKnowledge, Cycorp (inCyc)... – Semantics for Search Result Presentation: Yahoo! SearchMonkey

Some lessons with applying semantic web technology in this space– Still waiting for a compelling match between technical capability and business need

• Statistical methods are surprisingly good for basic relevance scoring (e.g., Latent Semantic Indexing, PageRank, SVMs)

• Vertical applications (esp. medical pharma) have seen some success– Semantic processing is only a small differentiator in these markets – you have to

be great at nonsemantic queries, data import, crawling, storage, performance...

Semantically-Boosted Search and Classification

30

Semantic Web in Search and PresentationSemantic Web in Search and Presentation

As far as I know, no one is yet using Semantic Web data for query interpretation or relevance judgments on the web

– Some issues are scale and trustworthiness (the lesson of the HTML <META> tag)– Powerset used to use DBpedia data in its knowledge base

Yahoo! SearchMonkey (see also Google’s Rich Snippets) use semantic data to enhance presentation

– GreaseMonkey-style presentation reformatting for search snippets– Yahoo’s crawler indexes and interprets RDFa, microformats, delicious data, etc.– Display URL as an enhanced result, with standard or custom presentations– Incentives: “Structured data is the new SEO” (Dries Buytaert, Drupal)

Alex Moskalyuk | FacebookAlex Moskalyuk is on Facebook. Join Facebook to connect with Alex Moskalyuk and others you may know. Facebook gives people the power to share and makes the ...www.facebook.com/.../Alex-Moskalyuk/100000220291330?... - Cached - Similar

31

Second Generation Semantic Web ApplicationsSecond Generation Semantic Web Applications

An only slightly newer problem type– Business exploitation of structured enterprise data (RDBMS, Spreadsheet, ERP data)

• Backwards to Data Management to reduce cost of managing, migrating, integrating• Forwards to Business Process Management

– Support for unified query, analytics, and application access• Includes SOA integration, Enterprise Application Integration

Markets Segments and Players– Gartner estimates that EII software and services alone is $14B/year, with 40% growth

over 5 years (pre-recession numbers, though)– Very complex market space includes EAI, Entity Analytics, MDM, BI, BPM, CPM...– Huge entrenched players (IBM, SAP, Oracle...) and major consulting shops

Some lessons with applying semantic web technology in this space– Fundamental problem is understanding the semantics from legacy systems, not in KR– Pure Semantic technology companies tend to be unsophisticated about the customer – RDF/OWL is typically too weak and must be augmented by rules, quantities, etc.– Raw performance is typically inferior to a well-designed RDBMS– Tends to be an IT sale (not Line-of-Business sale), with attendant cost pressures

Strategic Enterprise Information Technology

32

Layers of Semantic Data in Strategic Enterprise ITLayers of Semantic Data in Strategic Enterprise IT

Semantic data modeling

Automatic data management

EII - Information integration leadership

BI -- Better business intelligence

Financial modeling & intelligence

Semantic process modeling

BAM and CPM w/ predictive analytics

MDM – Master Data Management

Increasing Value of Semantic Data Models

Sales Pitch for Semantic Data Models– Promote flexibility and improvisation in the face of dynamism– Expose business processes as rules, for governance and compliance– Can be driven all the way through the architecture, from SOA to CPM dashboards

This vision has never been proven at scale outside the lab

33

Semantic Web in Strategic Enterprise ITSemantic Web in Strategic Enterprise IT

A few companies are using semantic web technology in Enterprise IT

– Majority of revenue is usually from consulting engagements, not software products or services

– Carefully-curated external ontologies with logic engines (e.g. F-Logic), are used for high-end data integration

– RDF has been useful as a KR language

General Data outsourcing– Do for general data what Bloomberg and Capital-IQ do for financial data

• Collect data, clean it, add value, sell as a service– Semweb-driven variant of Wolfram Alpha

• Ultra-calculator (Mathematica) combined with a massive hand-curated almanac

• Similar to Google’s special computations, but much more powerful• Displays the solution results using templates from Mathematica’s

visualization tools– Semantic web has huge amounts of data, but cleaning, maintaining, and

linking it has been not cost effective• ML/DM techniques on semantic web data are in use at Evri and Freebase

– [Unnamed Startups] use ML/DM and Semantic Web for personalized mobile results (huge opportunities here!)

34

Third Generation Semantic Web ApplicationsThird Generation Semantic Web Applications

A new problem type– “Semantic Web should allow people to have a better online experience” – Alex Iskold,

AdaptiveBlue– Enhance the human activities of content creation, publishing, linking my data to other

data, socializing, forming community, purchasing satisfying things, browsing, etc.– Improve the effectiveness of advertising

Market Segments and Players– Mashup systems and consumer-oriented semantic web services (Drupal, Ning, ...)– Semantic enhancements to blogs and wikis (Zemanta, Faviki, Ontoprise, Radar, ...)– Semantics for Social Networking (MySpace RDF service and microformats, Facebook

RDF models, etc.)

Some lessons with applying semantic web technology in this space– If we don’t have semantic convergence, then semantics isn’t a differentiator– No one really knows the design principles that allow some Web 2.0 sites to be

successful and others to never get traction

Web 2.0 and the Socio-Semantic Web

35

Web Knowledge Bases: Metaweb and FreebaseWeb Knowledge Bases: Metaweb and Freebase

Massive amounts of almanac-style RDF data (Creative Commons license) that is readily available from partners

Social authoring tools and wiki-style consensus combined with controlled reconciliation by Metaweb personnel

36

Semantic Blogger Support: ZemantaSemantic Blogger Support: Zemanta

Automatic link, image, keyword, tag suggestions for blogs and email– Average semi-professional blogger spends ~20 mins adding “decorative” content

Key technology is fast NLP and categorization, plus personalization and mining of aggregated choice streams

Accuracy is guaranteed because users explicitly add the suggestions (feedback)– Zemanta inserts RDFa and standard semantic markup in the background– Includes user specified friends/feeds/photos/etc as well as standard ones

37

Third Generation Example: Semantic WikisThird Generation Example: Semantic Wikis

Wikis are tools for Publication and Consensus

MediaWiki (software for Wikipedia, Wikimedia, Wikibooks, etc.)– Most successful Wiki software

• High performance: 10K pages/sec served, scalability demonstrated• LAMP web server architecture, GPL license

– Publication: simple distributed authoring model• Wikipedia: >2.9M English articles, 400K Russian, >2.5M images, #8 Alexa traffic rank

– Consensus achieved by global editing and rollback• Fixpoint hypothesis, although consensus is not static• Gardener/admin role for contentious cases

Semantic Wikis apply the wiki idea to structured (typically RDFS) information– Authoring includes instances, data types, vocabularies, classes– Natural language text used for explanations– Automatic list generation from structured data, basic analytics, database imports– Reuse of wiki knowledge– See e.g., http://smwforum.ontoprise.com for one powerful semantic wiki

Semantic Wiki Hypotheses:(1) Significant interesting Semantic Data can be collected cheaply

(2) Wiki mechanisms can be used to maintain consensus on vocabularies and classes

38

An Example of (Manually Annotated) Semantic WikiAn Example of (Manually Annotated) Semantic Wiki

39

Semantic MediaWiki and RulesSemantic MediaWiki and Rules

SMW Simple Rules– Implement basic rules– Automates some authoring

and data maintenance– Templates to help users

create rules– Publishable in SWRL

SMW Importers/Exporters– Web Services– Vocabulary and data files– MS Office and Excel Docs– Semantic Web representations

40

More complex: National Cancer Institute (NCI) ThesaurusMore complex: National Cancer Institute (NCI) Thesaurus

41

Using Semantic Web data from Wikis: Project HaloUsing Semantic Web data from Wikis: Project Halo

Project Halo: SME-based authoring for scientific question-answering systems

Project Halo Goal: To determine whether tools can be built to facilitate robust knowledge formulation, query and evaluation by domain experts, with ever-decreasing reliance on knowledge engineers

– Can SMEs build robust question-answering systems that demonstrate excellent coverage of a given syllabus, the ability to answer novel questions, and produce readable domain appropriate justifications using reasonable computational resources?

– Will SMEs be capable of posing questions and complex problems to these systems?

– Do these systems address key failure, scalability and cost issues encountered in expert systems?

Experimental Scope: Selected portions of the AP syllabi for chemistry, biology and physics

– Example: Balance the following reactions, and indicate whether they are examples of combustion, decomposition, or combination

(a) C4H10 + O2 CO2 + H2O(b) KClO3 KCl + O2

(c) CH3CH2OH + O2 CO2 + H2O(d) P4 + O2 P2O5

(e) N2O5 + H2O HNO3

42

AURA – Automated User-centered Reasoning and Acquisition SystemAURA – Automated User-centered Reasoning and Acquisition System

Aura is a tool to help users formalize high school-level scientific knowledge Aura can then reason with that knowledge Users can ask English questions and get answers and explanations

43

Symbiosis Between Aura and Semantic WikisSymbiosis Between Aura and Semantic Wikis

Classical Knowledge Engineering– Expressive knowledge representation– Sophisticated testing and debugging

Knowledge Engineering in Aura– Acquires knowledge for deductive Q/A that

can be used for answering AP questions in sciences

• Uses a DL-style class taxonomy, and logic programming style rules with many extensions

– Requires substantial hours of training for knowledge formulation

Semantic Web Knowledge Engineering– Simple knowledge representation– Quantity at some expense of quality

Knowledge Engineering in SMW+– Tool for online authoring and consensus-

building around semantic web content– Captures knowledge at the level of RDFS– Collective editing for quality control– Gardening appropriate for scientific

knowledge– Almost walk up and use system

Can we use the Semantic Media Wiki to capture knowledge that could be used for Q/A in AURA?

– Factual knowledge (e.g., atomic number for carbon is 6, solubility constraints, etc.)– Taxonomic knowledge (e.g., eukaryotic and prokaryotic are two types of cells)

Knowledge creation would be faster, distributed, and cheaper

44

Example: Wikipedia Article on OrganelleExample: Wikipedia Article on Organelle

45

Source Text of Article on Organelle in Semantic MediaWikiSource Text of Article on Organelle in Semantic MediaWiki

46

Fact Box With the Annotations in Semantic MediaWikiFact Box With the Annotations in Semantic MediaWiki

47

Ontology Browser for Biology Data in Semantic MediaWIkiOntology Browser for Biology Data in Semantic MediaWIki

48

Aura/Semantic MediaWiki Use CaseAura/Semantic MediaWiki Use Case

Semantic MediaWiki includes relevant knowledge

Aura knowledge formulation engineer (KFE) searches for knowledge during knowledge formulation

The KFE notices useful information in SMW+ The KFE maps the knowledge into Aura

– Uses a derivative of Ontomap– Mapping suggestions from FOAM (Framework for Ontology

Alignment and Mapping) from Univ. Karlsruhe AIFB– ETL workflow

The knowledge is mapped into Aura and is available to the question-answering algorithms

49

AURA User Searches for InformationAURA User Searches for Information

50

Aura User Notices Useful Information in WikiAura User Notices Useful Information in Wiki

51

Aura User Maps Wiki Knowledge into Aura KB (with FoaM proposals)Aura User Maps Wiki Knowledge into Aura KB (with FoaM proposals)

52

Wiki Knowledge Available in Aura for Question-AnsweringWiki Knowledge Available in Aura for Question-Answering

53

Talk Outline: The Growing Semantic WebTalk Outline: The Growing Semantic Web

The Origins of the Semantic Web – DARPA’s DAML Program– RDF, OWL, and the Semweb Infrastructure

Semantic Web Evolution to 2009

Semantic Web Transformation in three areas of application– Markets and Companies

The Fourth Generation– A Scalable Revolution

54

Fourth Generation Semantic WebFourth Generation Semantic Web

A problem of scale– The number of Internet devices is starting to explode (again!)

• Mobile devices, embedded systems, and sensors• In 2008, Google reported 1 trillion unique URLs, ~200M web sites• Total 2008 web page estimates are ~30 billion (significant variation in these estimates)

– Gartner (May 2007, Report G00148725)• "By 2012, 70% of public Web pages will have some level of semantic markup,

20% will use more extensive Semantic Web-based ontologies”– Can Semantic Web technologies work at web scales?

• Sindice (www.sindice.com) is now indexing >10B triples/microformats over 65M pages• 20% of 30 billion pages @ 1000 triples per page = 6 trillion triples • 30 billion and 1000 are underestimates

Some lessons with applying semantic web technology in this space– Can we exploit billions of triples, microformats, ontologies, rules, and services?

• Are Semantic Web systems deployable on parallel web architectures, friendly to out-of-core algorithms, and compatible with giant databases?

– Is there a scaling limit to useful, profitable Semantic Web implementations?

The Web of Data meets the Future Internet

Material from Frank van Harmelen,Vrije Universiteit, Amsterdam

55

Fourth Generation Example: DBpediaFourth Generation Example: DBpedia

Mine Wikipedia for assertions– Mainly from Wikipedia Factboxes

• ~23M triples– Category assertions

DBpedia 3.3 dataset (May 09 Wikipedia)– ~2.6M things, ~274M triples

• 213K persons, 328K places, 57K music albums, 36K films, 609K links to images, 3.2M links to relevant external web pages, 4.9M links into RDF datasets

– Classifications via Wikipedia categories, YAGO, and WordNet synsets

– One of the largest broad knowledge bases in the world

Simple queries over extracted data– “Things near the Eiffel Tower”– “The official websites of companies with more

than 50000 employees” – “Soccer players from team with stadium with

>40000 seats, who were born in a country with more than 10M inhabitants”

56

DBpedia for UsersDBpedia for Users

Query Wikipedia like a database

DBpedia Mobile

57

Fourth Generation Example: Linked Open DataFourth Generation Example: Linked Open Data

Build an RDF Data Commons– Create a single, simple set of

rules for publishing and linking RDF data

• Use URIs as names for things

• Use HTTP URIs so that people can look up those names

• When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)

• Include links to other URIs, so that they can discover more things

– Set RDF links between data items from different data sources

May 2009 LOD dataset– ~4.7B triples, and ~140M RDF

interlinks, and growing faster than I can track

– Database linkage means that LOD will soon be impossible to count except by order of magnitude

58

May 2007September 2007

March 2009September 2008

The Growing Web of Linked DataThe Growing Web of Linked Data

59

Topic Distribution in the March 2009 Linked DatasetsTopic Distribution in the March 2009 Linked Datasets

Life Sciences

Publications

Online ActivitiesMusic

Geography

Cross-Domain

4,500,000,000 triples

180,000,000 data links

60

Common Tag SpecificationCommon Tag Specification

Instead of tagging with language terms, tag with terms + RDFa– Distinguish between “jaguar” the animal, the car company, and the operating system – Provides metadata for each Common Tag and relations to other Common Tags

• The Barack Obama Common Tag includes <employment, President of the United States> and <spouse, Michelle Obama>

More discoverable, more connected, more web-like, more useful

61

Semantic Dynamism at Web ScaleSemantic Dynamism at Web Scale

Semantics are always changing– Per minute, there are:

• 100 edits in Wikipedia (144K/day)• 200 tags in del.icio.us (288K/day)• 270 image uploads to flickr (388K/day)• 1100 blog entries (1.6M/day)

– Will the Semantic Web be less dynamic?

There is no “right ontology”– Ontologies are abstractions

• Different applications lead to different ontologies• Ontology authors make design choices all the time

– Google Base: >250K schemas– “Ontologies = Politics”

Intentionally false material (Spam)– Lesson of the HTML <META> tag Material from Denny Vrandečić, AIFB

How Do We Use this Dynamic Data for Decision Support?

62

Fourth Generation Application: The Large Knowledge ColliderFourth Generation Application: The Large Knowledge Collider

EC Framework 7 Program– Lead partners: Univ. Innsbruck and Vrije

University Amsterdam, plus 12 partners

Goals of LarKC – Scaling to Infinity– A platform for massive distributed

incomplete reasoning – Remove the scalability barriers of

currently existing reasoning systems for the Semantic Web

– Combine reasoning/retrieval and search– Want to trade off answer quality and

answer timeliness Reasoning pipeline

– Heavy emphasis on probability, decision theory, anytime algorithms

– Plugin architecture, with sampling– Explicit cost models

Public releases of LarKC platform, with APIs

63

Fourth Generation Application: The Large Knowledge ColliderFourth Generation Application: The Large Knowledge Collider

EC Framework 7 Program– Lead partners: Univ. Innsbruck and Vrije

University Amsterdam, plus 12 partners

Goals of LarKC – Scaling to Infinity– A platform for massive distributed

incomplete reasoning – Remove the scalability barriers of

currently existing reasoning systems for the Semantic Web.

– Combine reasoning/retrieval and search– Want to trade off answer quality and

answer timeliness

Reasoning pipeline– Heavy emphasis on probability,

decision theory, anytime algorithms– Plugin architecture, with sampling– Explicit cost models

Encourage participation through Thinking@home

Public releases of LarKC platform, with APIs

Exploiting web-scale semantics is the new frontier– Generations 1 and 2 used web resources to support classical KR approaches– Generation 3 (social semantic web) leverages web social patterns for KR

Fourth generation applications address general web-scale KR

64

Reasoning pipeline– Heavy emphasis on probability,

decision theory, anytime algorithms– Plugin architecture, with sampling– Explicit cost models

Encourage participation through Thinking@home

Public releases of LarKC platform, with APIs

Fourth Generation Application: The Large Knowledge ColliderFourth Generation Application: The Large Knowledge Collider

EC Framework 7 Program– Lead partners: Univ. Innsbruck and Vrije

University Amsterdam, plus 12 partners

Goals of LarKC – Scaling to Infinity– A platform for massive distributed

incomplete reasoning – Remove the scalability barriers of

currently existing reasoning systems for the Semantic Web.

– Combine reasoning/retrieval and search– Want to trade off answer quality and

answer timeliness

The real money in semantics will be made in apps/tools that exploit web-scale data– The cost of semantic data creation is going to zero– The size of semantic data is going to web-scale

At this scale, logic-based techniques break down ML/DM are key technologies to allow this exploitation to happen

65

Summing up: The Growing Semantic WebSumming up: The Growing Semantic Web

In mid-2004...– RDF and OWL had just been standardized– Advances were made via traditional corporate/public R&D programs– The first wave of semantic web startups (many of which have since failed)– Implementations were technically very sophisticated, but fully custom and had no

web involvement– A few early conferences (ISWC, SemTech) and session tracks

66

Summing up: The Growing Semantic WebSumming up: The Growing Semantic Web

In mid-2004...– RDF and OWL had just been standardized– Advances were made via traditional corporate/public R&D programs– The first wave of semantic web startups (many of which have since failed)– Implementations were technically very sophisticated, but fully custom and had no

web involvement– A few early conferences (ISWC, SemTech) and session tracks

Now in 2009...– The Semantic Web is the most exciting thing happening on the web– RDF assertions scaling into the billions, with little to no programmatic control– Search engines are starting to develop products– Bestbuy is publishing products, store descriptions, and hours in RDFa

67

Summing up: The Growing Semantic WebSumming up: The Growing Semantic Web

In mid-2004...– RDF and OWL had just been standardized– Advances were made via traditional corporate/public R&D programs– The first wave of semantic web startups (many of which have since failed)– Implementations were technically very sophisticated, but fully custom and had no

web involvement– A few early conferences (ISWC, SemTech) and session tracks

Now in 2009...– The Semantic Web is the most exciting thing happening on the web– RDF assertions scaling into the billions, with little to no programmatic control– Search engines are starting to develop products– Bestbuy is publishing products, store descriptions, and hours in RDFa

ML/DM is at the center of the needed techniques– Linked Data provides the necessary scale – Logic-based techniques are meeting their performance limits– The infrastructure to build non-tools-based companies is in place– This is what a revolution feels like!!

68

(Thank You)

Disclaimer: The preceding slides represent the views of the author only. All brands, logos and products are trademarks or registered trademarks of their respective companies.

Hvala

69

Google SquaredGoogle Squared

Back