© 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182...

34
© 2003 IBM Corporation 1 Mukesh Mohania June 7, 2007 Mukesh Mohania IBM India research Lab [email protected] A Journey from Data Warehousing to Active Information Integration

Transcript of © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182...

Page 1: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

© 2003 IBM Corporation1

Mukesh Mohania June 7, 2007

Mukesh Mohania

IBM India research Lab

[email protected]

A Journey from Data Warehousing to

Active Information Integration

Page 2: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania2

Outline

What is Data Warehousing (DW)? What is Information Integration (II)? Existing Solutions From DW II Event based (Active) Information Integration Context-Oriented Information Integration

Page 3: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

IBM India Research Lab

IBM Research © 2007 IBM Corporation

Structured and Unstructured Information

Information content in an enterprise can be structured or unstructuredStructured Content: payroll, sales orders, invoice, customer profiles, etc.

Unstructured Content:: emails, reports, web-pages, complaints, information on sales, customers, competitors, products, suppliers and people, etc.

According to recent estimates, structured content < 20%, unstructured content > 80%

Historically, the structured and unstructured data management technologies have evolved separately Artificial separation between these two “kinds” of information

Enterprises are realizing the need to bridge this separation, and are demanding integrated retrieval, management and analysis of both the structured and unstructured content

Page 4: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania4

From Data Warehouse to Information Integration

Data warehousing was first driven by a need for consistent business information from disparate systems.

Business needs•Support for decision making, based on

•Historical or point-in-time view of the business•Aligned across different departments

Technical limitations•OLTP systems and performance must be protected•Historical and summary data stores needed•Reconciliation of data in different systems is slow•Ad hoc query performance needs to be optimized

Page 5: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania5

Mid-1980s: Data Warehousing

Business data warehouse

Operational systems

Data marts

Metadata

Data warehouse• Reconciling disparate data• Single version of the truth• Historical record

Characteristics• Information not needed immediately• Structured data• Unidirectional data flow• Trusted sources

Page 6: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania6

The need for consistency extends to more immediate business information

Technical limitationsOLTP systems and performance must be protectedHistorical and summary data stores neededReconciliation of data in different systems is slowAd hoc query performance needs to be optimised

Business needsSupport for decision making, based on

Historical or point-in-time view of the business, as well as near real-time views

Aligned across different departments

Page 7: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania7

Mid-1990s: Operational Data Store

Business data warehouse

Operational systems

Data marts

Operational data store

Metadata

Operational data store•Near real-time•Reconciling a subset of data

Characteristics• Immediate and historical info needs• Structured data• Partial bi-directional data flow• Increasing technical metadata

Page 8: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania8

The vision: Comprehensive integration of information

Business data warehouse

Operational systems

(Operational) data marts

Information integration

Metadata

Untrusted & unstructured sources (e.g. Internet)

Integrated information • Real-time knowledge• Integrating all information

Characteristics• Immediate and historical information needs• Fully merged informational & operational needs• Structured and unstructured data • Bi-directional data flow• Caching reduces data flow• Complete business & technical metadata

VLDB 2006, SIGMOD 2007, PODS 2007

Page 9: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

© 2003 IBM Corporation9

Mukesh Mohania June 7, 2007

Existing Solutions

Page 10: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

IBM India Research Lab

IBM Research © 2007 IBM Corporation

Structured and Unstructured Information Integration:A Brief Background on Existing Solutions

Existing solutions can be classified in terms of the query paradigm used:

Keyword Query Based Solutions (DB2 ESE, DbXplorer/BANKS [ICDE02])Relational data exposed to search engine as virtual text documents

Query both structured and unstructured information using keywords

SQL Query Based Solutions (SQL LIKE predicate, DB2 NetSearch Extender)Text data exposed to relational engine as virtual tables with text columns

Query both structured and unstructured information using SQL

Provide SQL primitives to search text in table columns using a set of keywords

Page 11: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

IBM India Research Lab

IBM Research © 2007 IBM Corporation

Keyword Query Based Solution: DB2 ESE

DB2Enterprise

SearchExtender

DB2Enterprise

SearchExtender

Keyword Query

Page 12: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

IBM India Research Lab

IBM Research © 2007 IBM Corporation

Keyword Query Based Solution:DbXplorer/BANKS [ICDE02]

DbXplorer/BANKS

DbXplorer/BANKS

Search EngineSearch Engine

Keyword Query

Keyword Query

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

Page 13: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

IBM India Research Lab

IBM Research © 2007 IBM Corporation

Keyword Query Based Solutions: Summary

Advantage: Simplicity!

DisadvantagesLess expressive (as compared to SQL)

How to ask for the information related to the five best performing stocks in the past week?

Need to specify a set of keywords that succinctly encodes the information need

Not always easy

Page 14: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

IBM India Research Lab

IBM Research © 2007 IBM Corporation

SQL Query Based Solution:Standard SQL LIKE Predicate

DB2 UDB / DB2 Information

Integrator

DB2 UDB / DB2 Information

Integrator

SELECT stocks.price, docs.textFROM stocks, docsWHERE (stocks.name = ‘IBM’AND docs.text LIKE ‘% IBM %’)OR (stocks.name = ‘ORCL’AND docs.text LIKE ‘% ORCL %’)

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

Page 15: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

IBM India Research Lab

IBM Research © 2007 IBM Corporation

SELECT stocks.price, docs.textFROM stocks, docsWHERE (stocks.name = ‘IBM’AND CONTAINS(docs.text, “IBM”))OR (stocks.name = ‘ORCL’AND CONTAINS(docs.text, “ORCL”))

SQL Query Based Solution:Net Search Extender

DB2 UDB / DB2 Information

Integrator

DB2 UDB / DB2 Information

Integrator

Net Search Extender

Net Search Extender

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

CONTAINS(…)

Page 16: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

IBM India Research Lab

IBM Research © 2007 IBM Corporation

SQL Query Based Solutions: Summary

Advantages:More expressive – can specify more involved and sophisticated queries

Disadvantages:The unstructured data is still queried using keywords

Need to specify a set of keywords that succinctly encodes the information need

Not always easy

The SQL query and the embedded keyword query encode the same information need

Redundant effort

Association of documents with tuples (local context), not with the entire result (global context)

Same documents get attached to “IBM” when “IBM” is queried with “ORCL” as when “IBM” is queried with “DELL”

Page 17: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania17

Active Information Integration

Page 18: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania18

Data Stream

A data stream is a sequence of data items X1, X2, …, Xn, coming continuously from single or multiple sources where random access to data is not allowed.

Data Stream CharacteristicsStrongly regular: strongly periodic (inclusive zero time interval between two

data items), only one type of data, schema can be derived or conforms schema.

Weakly regular: weakly periodic (follows some time interval), mixed types of data but follows the order, schema can be derived.

Irregular: aperiodic, types of data unknown, no order, schema cannot be derived.

Page 19: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania19

Active functionalities over streaming data

Provides real-time functionalities that is needed in several advanced applications.– Alert a doctor when the blood pressure of a patient goes below X, heart

beats less than Y and ECG touches Z.– Sell all my INTC stocks at the higher trading price exchange if the price

difference at any time between two exchanges is more than 2%.– Cancel my tomorrow’s flight if there is a terrorist attack in the region of

flying.

Events can be defined on composition of data streams that can trigger some pre-defined actions (notification and alert, database change, etc.)

Context can be associated with the events – INTC was trading higher at NASDAQ at 9:32 AM since CEO of INTC

rang the opening bell.

Page 20: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania20

Active Rules

An active rule is composed of three components:

Event (E): Monitor - Detect - EvaluateCondition (C): Derive - Analyze - EvaluateAction (A): Collaborate - Integrate - Effect

Events: customer-event (1st purchase, new subscription, etc.) time-based event (birthday, retirement, etc.) product-based (launch of a new product, decline in sales, etc.) calendar-based (Christmas, Diwali, etc.)

Page 21: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania21

Rule Examples

• If the value of a transaction is less than $2 and this transaction is made by credit card, notify to the fraud detection system and send a notification to the customer service representative for calling the credit card holder immediately to check the validity of the transaction.

• If a customer has made at least 3 transactions or the total value of all transactions is more than $2000 during Christmas holidays, then offer 10% discount to the user between January 10-January 31.

• If the duration of a telephone call exceeds by more than 40 minutes, then send a notification to the fraud detection system.

• If less than 10% of the stock is sold in a retail store by the end of the week, offer a 20% discount on the non-luxury items for the next week.

Page 22: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania22

Architecture

Data SourcesDB Data Stream

WebMDB

Adaptor

Monitor

Adaptor Adaptor Adaptor

MonitorMonitorMonitorConnectors

Business Logic/Process

Business Logic/Process Feedback

Active Functionalities

Page 23: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania23

Rule Execution Engine

Event Composer and Detector

WorkflowExecution

Information Flow Engine

Rules DB

Information Integrator and Decision Analysis

Metadata          

Integration Hub

Page 24: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

© 2003 IBM Corporation24

Mukesh Mohania June 7, 2007

Context-Oriented Information Integration

Page 25: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania25

Motivation

CM SystemRDBMS

Relational Query Non-relational Query

DB2 Result Retrieved data and documents

Structured Data Management Unstructured Data (Content) Management

Current Scenario: Isolated Management of Structured and Unstructured Information

(20% of enterprise data) (80% of enterprise data)

Needed: Consolidated Management

Page 26: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania26

Motivation

CM SystemRDBMSBroker

Relational Query Content Query

Main result: Relational Result

Addition: Relevant Documents

Main result: Retrieved documents

Addition: Relevant database fragment

Consolidated Management

(100% of enterprise data)

A broker enables consolidation of information stored in Relational and Non-relational (CM) systems

Page 27: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania27

Problem

To enhance structured and unstructured data retrieval through symbiotic consolidation of related information. Specifically:

Enhance structured data retrieval by associating additional documents relevant to the user context with the query result

Enhance document contents by associating additional information derived from structured data

Structured data = relations, schema-based (XML) documentsUnstructured data = schema-less (free-flow) documents,

web-pages

Page 28: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

IBM India Research Lab

IBM Research © 2004 IBM Corporation

Solution Overview

DB2 Information

Integrator

DB2 Information

Integrator

DB2Enterprise

SearchExtender

DB2Enterprise

SearchExtender

SELECT name, max(price) - min(price)FROM stocksGROUP BY nameORDER BY 2FETCH FIRST 3 ROWS ONLY

“IBM” “ORCL” “MSFT” “Database” “Software”

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

SCORESCORE

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

SELECT name, max(price) -min(price)FROM stocksGROUP BY nameORDER BY 2FETCH FIRST 3 ROWS ONLY

“Doctype:Patents”

“Doctype:Patents”

“Get the 3 companies with max price variation”

CIKM 2005 Best Paper Award

Page 29: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

IBM India Research Lab

IBM Research © 2004 IBM Corporation

Solution Overview

DB2 Information

Integrator

DB2 Information

Integrator

DB2Enterprise

SearchExtender

DB2Enterprise

SearchExtender

SELECT name, max(price) - min(price)FROM stocksGROUP BY nameORDER BY 2FETCH FIRST 3 ROWS ONLY

“IBM” “ORCL” “MSFT” “Database” “Software”

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

SCORESCORE

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

C3C2C1

X

X

Y

X

X

Y

X

X

8

7

6

5

4

3

2

1

B

B

B

A

B

A

A

A

SELECT name, max(price) -min(price)FROM stocksGROUP BY nameORDER BY 2FETCH FIRST 3 ROWS ONLY

“Doctype:Patents”

“Doctype:Patents”

SQL Query Result

SQL Query Context

Page 30: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

IBM India Research Lab

IBM Research © 2004 IBM Corporation

Main Idea

Specify information need in terms of SQL over the structured database

Additional information needs specified using “directives” (optional)

Automatically synthesize the “context” of the SQL query from its result and the known semantic dependencies in the structured data

Use this context and the directives to retrieve the unstructured data

Page 31: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

IBM India Research Lab

IBM Research © 2004 IBM Corporation

Overall Architecture

Query + Directives

Enterprise Search

Query

Query Result + Context

Context + Directives

Relevant Documents

SC

OR

EUser Interface/Application

Metadata

Modified Query

Modified Query Result

Metadata

MetadataDB2 II

CM

Query Handler

Context Handler

Metadata mapping

Me

tad

ata

Re

po

sit

ory

(Cri

oll

o)

Query Result + Relevant

Documents

CMUnstruct Data Source

CMCM

Structured Data Source

Page 32: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania32

Applications

• Financial: Customer-centric investment account and risk assessment documents

• Health: Patient specific report and medical articles

• Telecommunications and Manufacturing: Defect statistics and engineering specifications

• Marketing: Customer transaction history and marketing documents

Page 33: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania33

Research problems in Active Information Integration

How to model the Active Rules? Is it just an ECA or something more? How to define events and rules along dimensional and fact tables in data

warehousing? Do we need a different data modeling schemes for Active/real-time data

warehousing? How to monitor the data sources for active integration? What data need to be materialized for computing the exact change at data

warehouse site? How to provide keyword search on data warehousing considering the data

is exposed as business objects to users? How to translate keyword query into more semantic based query for OLAP

analysis? How to handle uncertainty in keyword based queries? Single data indexing mechanism for both unstructured and structured

data? New architectures for Active/Real-time Data warehousing – handling large

number of queries in almost sub-seconds.

Page 34: © 2003 IBM Corporation 1 Mukesh Mohania Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum length: 2 lines Confidentiality/date.

Mukesh Mohania34

Conclusions

• Integration Evolution1980 Federated Data Integration1985 Data warehouse1995 Operational data store1999 Client information integration2003 Information integration

• Active Information Integration Approaches• Rule based Information Integration • Co-relating structured and unstructured data