Presented By: Kiran Kancharlapalli DBMS - Topics 11 & 12.

55
Presented By: Kiran Kancharlapalli DBMS - Topics 11 & 12

Transcript of Presented By: Kiran Kancharlapalli DBMS - Topics 11 & 12.

Presented By:Kiran Kancharlapalli

DBMS - Topics 11 & 12

Semantic Interoperability

What is Semantic Interoperability?

• Ability of computer systems to transmit data with unambiguous, shared meaning

• Data must be made available between heterogeneous agents• Metadata must also be made available allowing a software

agent to learn how to interpret the data– Document Type Definition– XML-Schema– RDF Annotations

• Requirement to enable machine computable logic, inferring and knowledge discovery between information systems.

• Results in Semantic Web

How is it accomplished?

• By adding data about the data (metadata), linking each data element to a controlled, shared vocabulary

• The meaning of the data is transmitted with the data itself, in one self-describing "information package" that is independent of any information system

• Syntatic interoperability is a prerequisite for semantic interoperability– refers to the packaging and transmission mechanisms for

data

What is Semantic Web?

• Will be able to provide justified answers to natural language questions– Current search engines provide lists of resources that are

supposed to contain the answer

• Knowledge rather than plain data would be retrieved i.e. data which is relevant to the user’s task

• Social factors such as privacy and trust would also be taken into account

Benefits

• Search can often be frustrating because of the limitations of keyword-based matching techniques. – Users frequently experience one of two problems:

• either get back no results or • too many irrelevant results.

• The problem is that words can be synonymous (that is, two words have the same meaning) or polysemous (a single word has multiple meanings).

• However, if the languages used to describe web pages were semantically interoperable, then the user could specify a query in the terminology that was most convenient, and be assured that the correct results were returned, regardless of how the data was expressed in the sources.

Ontologies

What are Ontologies?

• Content theories possible about objects in a specified domain

• A representation vocabulary, specialized to some domain or subject matter

• Provide potential terms for describing knowledge about the domain

• Translating the terms in an ontology from, say English to French, does not change the ontology conceptually

What are Ontologies?

• Designed to reuse across multiple applications and implementations

Motivation

• select EMPDAT from PERSTAB where POS=“mgmnt”– What does it mean?– PERSTAB is a table which lists employee data

• What’s an employee? How is an employee different from a contractor? What if I want data on both?

• Even if this information is available in English, a human has to read it

Motivation (cntd…)

• "Parenthood is a more general relationship than motherhood."

• "Mary is the mother of Bill."

• "Who are Bill's parents?“• "Mary is the parent of Bill.”

– that fact is not stated anywhere, but can be derived by a DAML application.

• More formally stated, given the statements

(motherOf subProperty parentOf)(Mary motherOf Bill)

• when stated in DAML, allows you to conclude

(Mary parentOf Bill)

• Java code or a stored procedure could do this sort of inference for facts in XML or SQL

• But the DAML spec itself says the conclusion is true• In contrast, different Java code could reach a different conclusion

Everything is not a nail

• Ontology is not always the right tool for the job

• Face recognition, vehicle control systems etc – not the right applications for ontology

Many Ways to Use Ontology

• As an information engineering tool– Create a database schema– Map the schema to an upper ontology– Use the ontology as a set of reminders for additional

information that should be included• As more formal comments

– Define an ontology that is used to create a DB or OO system– Use a theorem prover at design time to check for

inconsistencies• For taxonomic reasoning

– Do limited run-time inference in Prolog, a description logic, or even Java

• For first order logical inference– Full-blown use of all the axioms at run time

Upper Ontology

• An attempt to capture the most general and reusable terms and definitions

Motivation to capture Upper Ontology

• Ontologies may have different names for the same things– type – a relation between a class and an instance– instance – a relation between a class and an instance– isa – a relation between a class and an instance– …

• Ontologies may have the same name for different things, and no corresponding terms– before – a relation between two time points– before – a relation between two time intervals

• Either use the same upper ontology, or at least map to a common upper ontology

Some Formal Upper Ontologies

• DOLCE• Cyc• SUMO

Simple Methodology• Extract nouns and verbs from a source text• Find classes in SUMO for the nouns and verbs• Record a mapping as being either equal, subsuming or

instance.– type a single word that relates to the UBL term in the "SUMO term" or

"English Word" text areas in the SUMO browser

• Create a subclass of SUMO if it's a subsuming mapping• Add properties to the subclass

– reusing SUMO properties– extending SUMO properties by creating a &%subrelation of an existing

property

• Add English definition to the class – define constraints that express how the subclass is more specific than the

superclass

• Express the classes and properties in KIF and begin creating axioms, based on the English definitions created previously

High Level Distinctions

• The first fundamental distinction is that between ‘Physical’ (things which have a position in space/time) and ‘Abstract’ (things which don’t)

Physical Abstract

High Level Distinctions

• Partition of ‘Physical’ into ‘Objects’ and ‘Processes’

Physical

Object Process

DBpedia:A Nucleus for a Web of Open Data

• DBpedia.org is an effort to:– extract structured information from Wikipedia– make this information available on the Web under an

open license– interlink the DBpedia dataset with other datasets on the

Web

•Title

•Abstract

•Infoboxes

•Geo-coordinates

•Categories

•Images

•Links

• Other languages

• Other wiki pages

• To the web

• Redirects

• Disambiguates

Extracting Structured Information from Wikipedia

Wikipedia consists of– 6.9 million articles– in 251 languages– monthly growth-rate: 4%

Wikipedia articles contain structured information– infoboxes which use a template mechanism– images depicting the article’s topic– categorization of the article– links to external webpages– intra-wiki links to other articles– inter-language links to articles about the same topic in

different languages

TraditionalWeb Browser

Web 2.0Mashups

Semantic WebBrowsers

SPARQLEndpoint

Linked Data SNORQLBrowser

QueryBuilder

Virtuoso

Articles

MySQL

Infobox Categories

Wikipedia Dumps

DB tablesArticle texts

DBpedia datasets loaded into

published via

Extraction

Extracting Infobox Data (RDF Representation)

DBpedia Basics

• The structured information can be extracted from Wikipedia and can serve as a basis for enabling sophisticated queries against Wikipedia content.

• The DBpedia.org project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. It uses the SPARQL query language to query this data.

• At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data.

The DBpedia Dataset

• 1,600,000 concepts

• including– 58,000 persons

– 70,000 places

– 35,000 music albums

– 12,000 films

• described by 91 million triples– using 8,141 different properties.

– 557,000 links to pictures

– 1,300,000 links external web pages

– 207,000 Wikipedia categories

– 75,000 YAGO categories

Accessing the DBpedia Dataset over the Web

1. SPARQL Endpoint

2. Linked Data Interface

3. DB Dumps for Download

SPARQL

• SPARQL is a query language for RDF.

• RDF is a directed, labeled graph data format for representing information in the Web.

• This specification defines the syntax and semantics of the SPARQL query language for RDF.

• SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware.

The DBpedia SPARQL Endpoint

• http://dbpedia.org/sparql

• hosted on a OpenLink Virtuoso server

• can answer SPARQL queries like– Give me all Sitcoms that are set in NYC?

– All tennis players from Moscow?

– All films by Quentin Tarentino?

– All German musicians that were born in Berlin in the 19th century?

Example

To know everything Bart wrote on blackboard board in season 12 of Simpson's:•The Simpson episode Wikipedia pages are the identified "things” that we would consider as the subjects of our RDF triples.•The bottom of the Wikipedia page for the "Tennis the Menace" episode tells us that it is a member of the Wikipedia category "The Simpsons episodes, season 12".•The episode's DBpedia page tells us that p:blackboard is the property name for the Wikipedia infobox "Chalkboard" field.

entities

SELECT ?episode,?chalkboard_gag WHERE { ?episode skos:subject <http://dbpedia.org/resource/Category:The_Simpsons_episodes%2C_season_12>. ?episode dbpedia2:blackboard ?chalkboard_gag }

Table

Possible Improvements

• Better data cleansing required.

• Improvement in the classification.

• Interlink DBpedia with more datasets.

• Improvement in the user interfaces.

• Performance

• Scalability

• More Expressiveness

Questions for Discussion

• DBpedia gains new information when it extracts data from the latest Wikipedia dump, whereas Freebase, in addition to Wikipedia extractions, gains new information through its userbase of editors.– Which one is better approach?

• Can Freebase or DBpedia be substitute for Wikipedia?– Freebase : Not good in that we have two similar things –

Wikipedia, Freebase– DBPedia : Not good in that it extracts data from dump

• How can we interlink Freebase & DBpedia?• What can be killer applications using Dbpedia?

– If there is, okay– If there is no, do we really need a large general structured knowledge?

Uncertainty propagation

• Every physical quantity has :

– A value or size

– Uncertainty (or ‘Error’)

– Units

• Without these three things, no physical quantity is complete.

• When quoting your measured result, follow the simple rules : Ex: A = 1.71 0.01 m

Always quote main value to the same number

of decimal places as the uncertainty

Always include Units ! !(but if the quantity is dimensionless, say so)

Never quote uncertainty to more than 1 or 2significant figures (this would make no sense)

Terminology: ‘Uncertainty’ and ‘Error’

• The terms Uncertainty and Error are used interchangeably to describe a measured range of possible true values.

• The meaning of the term Error is :– NOT the DIFFERENCE between your experimental result &

that predicted by theory, or an accepted standard result !

– NOT a MISTAKE in the experimental procedure or analysis !

• Hence, the term Uncertainty is less ambiguous. Nevertheless, we still use terms like ‘propagation of errors’, ‘error bars’, ‘standard

error’, etc.

• The term “human error” is imprecise - avoid using this as an explanation of the source of error.

Error Propagation using CalculusFunctions of one variable

If uncertainty in measured x is Δx, what is uncertainty in a derived quantity z(x) ?

Error propagation is just calculus – you do this formally in the “Data Handling” course

Basic principle is that, if (Δx)/x is small, then to first order:

e.g., if z = xn , then : xx

nzxx

nxxnxxdxdz

z nn

1

Hence, for this particular function, the percent (or fractional) error in z is :

÷øö

çèæD=÷

øö

çèæD

xx

nzz

or...... just n times the percent error in x

Error Propagation using CalculusFunctions of more than one variable

Suppose uncertainties in two measured quantities x and y are : Δx and Δy , what is the uncertainty in some derived quantity z(x,y) ?

For such functions of 2 variables we use partial differentiation

yy

zx

x

zz

But, combining errors ALWAYS INCREASES total error - so make sure terms add with the same sign :

yy

zx

x

zz

22

22

yyz

xxz

z

It is better to add in quadraturei.e. “the root of the sum of the squares” :

We can usually always handle error propagation in this way by calculus

Simplified Error PropagationA short-cut avoiding calculus

Instead of differentiating z/x, z/y etc, a simpler approach is also acceptable :

1. In the derived quantity z, replace x by x + Δx, say

2. Evaluate Δz in the approximation that Δx is small xzxxzz

xz

xzaxxzz

)(Ex. 1 : z = x + a , where a = constant

xbz

xbzxbbxxxbzz

Ex. 2 : z = bx , where b = constant

x

xzz

x

xbxxxxxbxxbzz

22

12 2222

Ex. 3 : z = bx2 , where b = constant

x

x

z

z

x

xzz

2

2

Synthetic Data

• Any production data applicable to a given situation that are not obtained by direct measurement

• Used in a variety of fields as a filter for information that would otherwise compromise the confidentiality of particular aspects of the data.

• Many times the particular aspects come in the form of human information (i.e. name, home address, IP address, telephone number, social security number, credit card number, etc.)

Importance

• Obtaining actual or real data sets could be difficult, and sometimes impossible due to impediments such as– Privacy issues– Image control– Logistics issues– Time– Cost

• Protecting information confidentiality– Data cannot be traced back to an individual

• Certain conditions may not be found in the original data

Importance (cntd.)

• Used to train the fraud detection system itself, thus creating the necessary adaptation of the system to a specific environment– By creating realistic behavior profiles of users and attackers– Ex: Intrusion Detection Systems are trained using Synthetic

Data

• Allow a baseline to be set– Ex: Researcher doing clinical trials generate synthetic data to

aid in creating a baseline for future studies and testing

• More or less realism could be exhibited according to the selected properties of the original data sets

Synthetic Data Generation

• Mostly Scenario based– Evaluating Information Analytics Software– Matching Data Mining Patterns– Evaluate quality of extraction algorithms

• Specific Algorithms and generators for a scenario or a set of (similar) scenarios

• Patterns from data mining techniques could be used to generate synthetic data sets

• Researchers frequently need to explore the effects of certain data characteristics on their models. – To help construct datasets exhibiting specific

properties, such as autocorrelation or degree disparity, synthetic data could be generated having one of several types of graph structure:

• random graphs• independent and identically distributed (i.i.d.) connected

components• lattice graphs having a ring structure• lattice graphs having a grid structure• forest fire graphs• cluster graphs with nodes arranged in separate clusters

(cliques)

• Synthetic data is generated with simple forms of realism by:– Domain sampling within a field– Preserving cardinality relationships

• In all cases, the data generation process follows the same process:– Generate the empty graph structure.– Generate attribute values based on user-supplied prior

probabilities.

• Because the attribute values of one object may depend on the attribute values of related objects, the attribute generation process assigns values collectively.

Data Quality

• Some Definitions– The state of completeness, validity, consistency, timeliness

and accuracy that makes data appropriate for a specific use.

– The totality of features and characteristics of data that bears on their ability to satisfy a given purpose; the sum of the degrees of excellence for factors related to data.

– Complete, standards based, consistent, accurate and time stamped.

Data Quality

• Data are of high quality if,– they are fit for their intended uses

in operations, decision making and planning– they correctly represent the real-world construct to

which they refer• As data volume increases

– the question of internal consistency within data arises, regardless of fitness for use for any external purpose

• e.g. a person's age and birth date may conflict within different parts of a database

Data Attributes

• Nearly 200 such attributes are there and there is little agreement in their definition and measures

• Most common are– Accuracy– Correctness– Currency– Completeness– Relevance

Incorrect Data

• Includes– invalid and outdated information – can originate from

different data sources resulting from • data entry, or data migration and conversion projects

• Total cost to the US economy due to data quality problems is over US$600 billion per annum

Frameworks for understanding data quality

• A systems-theoretical approach– influenced by American pragmatism expands the

definition of data quality to include• information quality, and emphasizes the inclusiveness of

the fundamental dimensions of accuracy and precision

• One framework seeks to integrate– product perspective (conformance to specifications)

and – service perspective (meeting consumers'

expectations)

• One highly theoretical approach analyzes the ontological nature of information systems to define data quality rigorously

• Another framework evaluates the quality of the form, meaning and use of the data

Data Quality Assurance

• Service providers clean the data on a contract basis

• Consultants advise on fixing processes or systems to avoid data quality problems in the first place

• Tools for analyzing and repairing poor quality data

• Data profiling - initially assessing the data to understand its quality challenges

• Data standardization - a business rules engine ensures that data conforms to quality rules

• Geocoding - for name and address data. Corrects data to US and Worldwide postal standards

• Matching or Linking - a way to compare data so that similar, but slightly different records can be aligned. – Matching may use "fuzzy logic" to find duplicates in the data. It often recognizes

that 'Bob' and 'Robert' may be the same individual. – It might be able to find links between husband and wife at the same address. – It often can build a 'best of breed' record, taking the best components from

multiple data sources and building a single super-record.• Monitoring - keeping track of data quality over time and reporting

variations in the quality of data. Software can also auto-correct the variations based on pre-defined business rules.

• Batch and Real time - Once the data is initially cleansed (batch), companies build the processes into enterprise applications to keep it clean.

?