Chaps26.28.29-1 CSE 4701 Chapters 26, 28 & 29, 6e - 24, 26 & 27 5e Database System Architectures,...

124
Chaps26.28.29-1 CSE 4701 26 & 27 5e Database System Architectures, Data Mining/Warehousing, Web DB Prof. Steven A. Demurjian, Sr. Computer Science & Engineering Department The University of Connecticut 191 Auditorium Road, Box U-155 Storrs, CT 06269-3155 [email protected] http://www.engr.uconn.edu/~steve (860) 486 - 4818 A portion of these slides are being used with the permission of Dr. Ling Lui, Associate Professor, College of Computing, Georgia Tech. Remaining slides represent new material.

Transcript of Chaps26.28.29-1 CSE 4701 Chapters 26, 28 & 29, 6e - 24, 26 & 27 5e Database System Architectures,...

Chaps26.28.29-1

CSE 4701

Chapters 26, 28 & 29, 6e - 24, 26 & 27 5e

Database System Architectures, Data

Mining/Warehousing, Web DB

Prof. Steven A. Demurjian, Sr. Computer Science & Engineering Department

The University of Connecticut191 Auditorium Road, Box U-155

Storrs, CT 06269-3155 [email protected]

http://www.engr.uconn.edu/~steve(860) 486 - 4818

A portion of these slides are being used with the permission of Dr. Ling Lui, Associate Professor, College of Computing, Georgia Tech.

Remaining slides represent new material.

Chaps26.28.29-2

CSE 4701

Classical and Distributed Architectures Classic/Centralized DBMS Dominated the

Commercial Market from 1970s Forward Problems of this Approach

Difficult to Scale w.r.t. Performance Gains If DB Overloaded, replace with a Faster Computer this can Only Go So Far - Disk Bottlenecks

Distributed DBMS have Evolved to Address a Number of Issues Improved Performance Putting Data “Near” Location where it is Needed Replication of Data for Fault Tolerance Vertical and Horizontal Partitioning of DB Tuples

Chaps26.28.29-3

CSE 4701

Common Features of Centralized DBMS Data Independence

High-Level Representation via Conceptual and External Schemas

Physical Representation (Internal Schema) Hidden Program Independence

Multiple Applications can Share Data Views/External Schema Support this Capability

Reduction of Program/Data Redundancy Single, Unique, Conceptual Schema Shared Database

Almost No Data Redundancy Controlled Data Access Reduces Inconsistencies Programs Execute with Consistent Results

Chaps26.28.29-4

CSE 4701

Common Features of Centralized DBMS Promote Sharing: Automatically Provided via CC

No Longer Programmatic Issue Most DBMS Offer Locking for Key Shared Data

Oracle Allows Locks on Data Item (Attributes) For Example, Controlling Access to Shared Identifier

Coherent and Central DB Administration Semantic DB Integrity via the Automatic Enforcement

of Data Consistency via Integrity Constraints/Rules Data Resiliency

Physical Integrity of Data in the Presence of Faults and Errors

Supported by DB Recovery Data Security: Control Access for Authorized Users

Against Sensitive Data

Chaps26.28.29-5

CSE 4701

Shared Nothing Architecture In this Architecture, Each DBMS

Operates Autonomously There is No Sharing

Three Separate DBMSs on Three Different Computers

Applications/Clients Must Know About the External Schemas of all Three DBMSs for Database Retrieval Client Processing

Complicates Client Different DBMS

Platforms(Oracle, Sybase, Informix, ..)

Different Access Modes(Query, Embedded, ODBC)

Difficult for SWE to Code

Chaps26.28.29-6

CSE 4701

Difficulty in Access – Manage Multiple APIs Each Platform has a Different API

API1 , API3 , …. , APIn An App Programmer Must Utilize All three APIs

which could differ by PL – C++, C, Java, REST, etc.

Any interactions Across 3 DBs – must be programmatically handled without DB Capabilities

API1 API2APIn

Chaps26.28.29-7

CSE 4701

NW Architecture with Centralized DB High-Speed NWs/WANs Spawned Centralized DB

Accessible Worldwide Clients at Any Site can Access Repository Data May be “Far” Away - Increased Access Time In Practice, Each Remote Site Needs only Portion

of the Data in DB1 and/or DB2 Inefficient, no Replication w.r.t. Failure

Chaps26.28.29-8

CSE 4701

Fully Distributed Architecture The Five Sites (Chicago, SF, LA, NY, Atlanta) each

have a “Portion” of the Database - its Distributed Replication is Possible for Fault Tolerance Queries at one Site May Need to Access Data at

Another Site (e.g., for a Join) Increased Transaction Processing Complexity

Chaps26.28.29-9

CSE 4701

Distributed Database Concepts A transaction can be executed by multiple networked

computers in a unified manner. A distributed database (DDB) processes a Unit of

execution (a transaction) in a distributed manner. A distributed database (DDB) can be defined as

Collection of multiple logically related database distributed over a computer network

Distributed database management system as a software system that manages a distributed database while making the distribution transparent to the user.

Chaps26.28.29-10

CSE 4701

Goals of DDBMS Support User Distribution Across Multiple Sites

Remote Access by Users Regardless of Location Distribution and Replication of Database Content

Provide Location Transparency Users Manipulate their Own Data Non-Local Sites “Appear” Local to Any User

Provide Transaction Control Akin to Centralized Case Transaction Control Hides Distribution CC and Serializability - Must be Extended

Minimize Communications Cost Optimize Use of Network - a Critical Issue Distribute DB Design Supported by Partitioning

(Fragmentation) and Replication

Chaps26.28.29-11

CSE 4701

Goals of DDBMS Improve Response Time for DB Access

Use a More Sophisticated Load Control for Transaction Processing

However, Synchronization Across Sites May Introduce Additional Overhead

System Availability Site Independence in the Presence of Site Failure Subset of Database is Always Available Replication can Keep All Data Available, Even

When Multiple Sites Fail Modularity

Incremental Growth with the Addition of Sites Dedicate Sites to Specific Tasks

Chaps26.28.29-12

CSE 4701

Advantages of DDBMS There are Four Major Advantages Transparency

Distribution/NW Transparency User Doesn’t Know about NW Configuration (Location

Transparency) User can Find Object at any Site (Naming

Transparency) Replication Transparency (see next PPT)

User Doesn’t Know Location of Data Replicas are Transparently Accessible

Fragmentation Transparency Horizontal Fragmentation (Distribute by Row) Vertical Fragmentation (Distribute by Column)

Chaps26.28.29-13

CSE 4701

Data Distribution and Replication

Chaps26.28.29-14

CSE 4701

Other Advantages of DDBMS Increased Reliability and Availability

Reliability - System Always Running Availability - Data Always Present Achieved via Replication and Distribution Ability to Make Single Query for Entire DDBMS

Improved Performance Sites Able to Utilize Data that is Local for Majority

of Queries Easier Expansion

Improve Performance of Site by Upgrading Processor of Computer Adding Additional Disks Splitting a Site into Two or More Sites

Expansion over Time as Business Grows

Chaps26.28.29-15

CSE 4701

Challenges of DDBMS Tracking Data - Meta Data More Complex

Must Track Distribution (where is the Data) V & H Fragmentation (How is Data Split) Replication (Multiple Copies for Consistency)

Distributed Query Processing Optimization, Accessibility, etc., More Complex Block Analysis of Data Size Must also Now

Consider the NW Transmitting Time Distributed Transaction Processing

TP Potentially Spans Multiple Sites Submit Query to Multiple Sites Collect and Collate Results

Distributed Concurrency Control Across Nodes

Chaps26.28.29-16

CSE 4701

Challenges of DDBMS Replicated Data Management

TP Must Choose the Replica to Access Updates Must Modify All Replica Copies

Distributed Database Recovery Recovery of Individual Sites Recovery Across DDBMS

Security Local and Remote Authorization During TP, be Able to Verify Remote Privileges

Distributed Directory Management Meta-Data on Database - Local and Remote Must maintain Replicas of this - Every Site Tracks

the Meta-Data for All Sites

Chaps26.28.29-17

CSE 4701

A Complete Schema with Keys ...

Keys Allow us to Establish Links Between Relations

Chaps26.28.29-18

CSE 4701

…and Corresponding DB Tables

which Represent Tuples/Instances of Each Relation

1455

ASCnullWBnullnull

Chaps26.28.29-19

CSE 4701

…with Remaining DB Tables

Chaps26.28.29-20

CSE 4701

What is Fragmentation? Fragmentation Divides a DB Across Multiple Sites Two Types of Fragmentation

Horizontal Fragmentation Given a Relation R with n Total Tuples, Spread Entire

Tuples Across Multiple Sites Each Site has a Subset of the n Tuples Essentially Fragmentation is a Selection

Vertical Fragmentation Given a Relation R with m Attributes and n Total

Tuples, Spread the Columns Across Multiple Sites Essentially Fragmentation is a Projection Not Generally Utilized in Practice

In Both Cases, Sites can Overlap for Replication

Chaps26.28.29-21

CSE 4701

Horizontal Fragmentation A horizontal subset of a relation which contain those

of tuples which satisfy selection conditions. Consider Employee relation with condition DNO = 5 All tuples satisfying this create a subset which will be

a horizontal fragment of Employee relation. A selection condition may be composed of several

conditions connected by AND or OR. Derived horizontal fragmentation:

Partitioning of a primary relation to other secondary relations which are related with Foreign keys.

Chaps26.28.29-22

CSE 4701

Horizontal Fragmentation Site 2 Tracks All Information Related to Dept. 5

Chaps26.28.29-23

CSE 4701

Horizontal Fragmentation Site 3 Tracks All Information Related to Dept. 4 Note that an Employee Could be Listed in Both Cases,

if s/he Works on a Project for Both Departments

Chaps26.28.29-24

CSE 4701

Refined Horizontal Fragmentation Further Fragment from Site

2 based on Dept. that Employee Works in

Notice that G1 + G2 + G3 is the Same as WORKS_ON5

there is no Overlap

Chaps26.28.29-25

CSE 4701

Refined Horizontal Fragmentation Further Fragment from Site

3 based on Dept. that Employee Works in

Notice that G4 + G5 + G6 is the Same as WORKS_ON4

Note Some Fragments can be Empty

Chaps26.28.29-26

CSE 4701

Vertical Fragmentation Subset of a relation created via a subset of columns.

A vertical fragment of a relation will contain values of selected columns.

There is no selection condition used in vertical fragmentation.

A strict vertical slice/partition Consider the Employee relation.

A vertical fragment of can be created by keeping the values of Name, Bdate, Sex, and Address.

Since no condition for creating a vertical fragment Each fragment must include the primary key

attribute of the parent relation Employee. All vertical fragments of a relation are connected.

Chaps26.28.29-27

CSE 4701

Vertical Fragmentation Example Partition the Employee Table as Below Notice Each Vertical Fragment Needs Key Column

EmpDemo EmpSupvrDept

Chaps26.28.29-28

CSE 4701

Homogeneous DDBMS Homogeneous

Identical Software (w.r.t. Database) One DB Product (e.g., Oracle, DB2, Sybase) is

Distributed and Available at All Sites Uniformity w.r.t. Administration, Maintenance,

Client Access, Users, Security, etc. Interaction by Programmatic Clients is Consistent

(e.g., JDBC or ODBC or REST API …)

Chaps26.28.29-29

CSE 4701

Non-Federated Heterogeneous DDBMS Non-Federated Heterogeneous

Different Software (w.r.t. Database) Multiple DB Products (e.g., Oracle at One Site,

Access another, Sybase, Informix, etc.) Replicated Administration (e.g., Users Needs

Accounts on Multiple Systems) Varied Programmatic Access - SWEs Must Know

All Platforms/Client Software Complicated Very Close to Shared Nothing Architecture

Chaps26.28.29-30

CSE 4701

Federated DDBMS Federated

Multiple DBMS Platforms Overlaid with a Global Schema View

Single External Schema Combines Schemas from all Sites

Multiple Data Models Relational in one

Component DBS Object-Oriented in

another DBS Hierarchical in a

3rd DBS

Chaps26.28.29-31

CSE 4701

Federated DBMS Issues Differences in Data Models

Reconcile Relational vs. Object-Oriented Models Each Different Model has Different Capabilities These Differences Must be Addressed in Order to

Present a Federated Schema Differences in Constraints

Referential Integrity Constraints in Different DBSs Different Constraints on “Similar” Data Federated Schema Must Deal with these Conflicts

Differences in Query Languages SQL-89, SQL-92, SQL2, SQL3 Specific Types in Different DBMS (Oracle Blobs )

Differences in Key Processing & Timestamping

Chaps26.28.29-32

CSE 4701

Heterogeneous Distributed Database Systems Federated: Each site may run different database system but the

data access is managed through a single conceptual schema. The degree of local autonomy is minimum. Each site must adhere to a centralized access policy There may be a global schema.

Multi-database: There is no one conceptual global schema For data access a schema is constructed dynamically as

needed by the application software.

Communications network

Site 5Site 1

Site 2Site 3

NetworkDBMS

Relational

Site 4

ObjectOriented

LinuxLinux

Unix

Hierarchical

ObjectOriented

RelationalUnix

Window

Chaps26.28.29-33

CSE 4701

Query Processing in Distributed Databases

Issues Cost of transferring data (files and results) over the network.

This cost is usually high so some optimization is necessary. Example relations: Employee at site 1 and Department at Site 2

– Employee at site 1. 10,000 rows. Row size = 100 bytes. Table size = 106 bytes.

– Department at Site 2. 100 rows. Row size = 35 bytes. Table size = 3,500 bytes.

Q: For each employee, retrieve employee name and department name Where the employee works.

Q: Fname,Lname,Dname (Employee Dno = Dnumber Department)

Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno

Dname Dnumber Mgrssn Mgrstartdate

Chaps26.28.29-34

CSE 4701

Query Processing in Distributed Databases Result

The result of this query will have 10,000 tuples, assuming that every employee is related to a department.

Suppose each result tuple is 40 bytes long. The query is submitted at site 3 and the result is

sent to this site. Problem: Employee and Department relations are

not present at site 3.

Chaps26.28.29-35

CSE 4701

Query Processing in Distributed Databases Strategies:

1. Transfer Employee and Department to site 3. Total transfer bytes = 1,000,000 + 3500 = 1,003,500

bytes.2. Transfer Employee to site 2, execute join at site 2 and send

the result to site 3. Query result size = 40 * 10,000 = 400,000 bytes. Total

transfer size = 400,000 + 1,000,000 = 1,400,000 bytes.3. Transfer Department relation to site 1, execute the join at site

1, and send the result to site 3. Total bytes transferred = 400,000 + 3500 = 403,500 bytes.

Optimization criteria: minimizing data transfer. Preferred approach: strategy 3.

Chaps26.28.29-36

CSE 4701

Query Processing in Distributed Databases Consider the query

Q’: For each department, retrieve the department name and the name of the department manager

Relational Algebra expression: Fname,Lname,Dname (Employee Mgrssn = SSN Department)

Chaps26.28.29-37

CSE 4701

Query Processing in Distributed Databases Result of query has 100 tuples, assuming that every

department has a manager, the execution strategies are:1. Transfer Employee and Department to the result site and

perform the join at site 3. Total bytes transferred = 1,000,000 + 3500 = 1,003,500

bytes.2. Transfer Employee to site 2, execute join at site 2 and send

the result to site 3. Query result size = 40 * 100 = 4000 bytes. Total transfer size = 4000 + 1,000,000 = 1,004,000 bytes.

3. Transfer Department relation to site 1, execute join at site 1 and send the result to site 3. Total transfer size = 4000 + 3500 = 7500 bytes.

Preferred strategy: Choose strategy 3.

Chaps26.28.29-38

CSE 4701

Query Processing in Distributed Databases Now suppose the result site is 2. Possible strategies :

1. Transfer Employee relation to site 2, execute the query and present the result to the user at site 2. Total transfer size = 1,000,000 bytes for both queries Q

and Q’.

2. Transfer Department relation to site 1, execute join at site 1 and send the result back to site 2. Total transfer size for Q = 400,000 + 3500 = 403,500

bytes and for Q’ = 4000 + 3500 = 7500 bytes.

Chaps26.28.29-39

CSE 4701

DDBS Concurrency Control and Recovery Distributed Databases encounter a number of

concurrency control and recovery problems which are not present in centralized databases, including: Dealing with multiple copies of data items

How are they All Updated if Needed? Failure of individual sites

How are Queries Restarted or Rerouted? Communication link failure

Network Failure Distributed commit

How to Know All Updates Done at all Sites? Distributed deadlock

How to Detect and Recover?

Chaps26.28.29-40

CSE 4701

Data Warehousing and Data Mining Data Warehousing

Provide Access to Data for Complex Analysis, Knowledge Discovery, and Decision Making

Underlying Infrastructure in Support of Mining Provides Means to Interact with Multiple DBs OLAP (on-Line Analytical Processing) vs. OLTP

Data Mining Discovery of Information in a Vast Data Sets Search for Patterns and Common Features based Discover Information not Previously Known

Medical Records Accessible Nationwide Research/Discover Cures for Rare Diseases

Relies on Knowledge Discovery in DBs (KDD)

Chaps26.28.29-41

CSE 4701

What is Purpose of a Data Warehouse? Traditional databases are not optimized for data access but have

to balance the requirement of data access to ensure integrity Most data warehouse users need only read access, but need the

access to be fast over a large volume of data. Most of the data required for data warehouse analysis comes

from multiple databases and these analysis are recurrent and predictable to be able to design software meet requirements.

Critical for tools that provide decision makers with information to make decisions quickly and reliably based on historical data.

Aforementioned Charactereistics achieved by Data Warehousing and Online analytical processing (OLAP)

W. H Inmon characterized a data warehouse as: “A subject-oriented, integrated, nonvolatile, time-variant

collection of data in support of management’s decisions.”

Chaps26.28.29-42

CSE 4701

Data Warehousing and OLAP A Data Warehouse

Database is Maintained Separately from an Operational Database

“A Subject-Oriented, Integrated, Time-Variant, and Non-Volatile Collection of Data in Support for Management’s Decision Making Process [W.H.Inmon]”

OLAP (on-Line Analytical Processing) Analysis of Complex Data in the Warehouse Attempt to Attain “Value” through Analysis Relies on Trained and Adept Skilled Knowledge

Workers who Discover Information Data Mart

Organized Data for a Subset of an Organization

Chaps26.28.29-43

CSE 4701

Conceptual Structure of Data Warehouse

Data Warehouse processing involves Cleaning and reformatting of data OLAP Data Mining

Databases

Data Warehouse

Cleaning Reformatting

Updates/New Data

Back Flushing

Other Data Inputs

OLAP

DataMining

Data

Metadata

DSSIEIS

Chaps26.28.29-44

CSE 4701

Corporate data warehouse

Data Mart Data MartData MartData Mart

Corporate data

Option 1:Consolidate Data Marts

Option 2:Build from scratch

...

Building a Data Warehouse Option 1

Leverage Existing Repositories

Collate and Collect May Not Capture

All Relevant Data

Option 2 Start from Scratch Utilize Underlying

Corporate Data

Chaps26.28.29-45

CSE 4701

Comparison with Traditional Databases Data Warehouses are mainly optimized for appropriate

data access Traditional databases are transactional Optimized for both access mechanisms and

integrity assurance measures. Data warehouses emphasize historical data as their

support time-series and trend analysis. Compared with transactional databases, data

warehouses are nonvolatile. In transactional databases, transaction is the

mechanism change to the database. In warehouse, data is relatively coarse grained and

refresh policy is carefully chosen, usually incremental.

Chaps26.28.29-46

CSE 4701

Classification of Data Warehouses Generally, Data Warehouses are an order of magnitude

larger than the source databases. The sheer volume of data is an issue, based on which

Data Warehouses could be classified as follows. Enterprise-wide data warehouses

Huge projects requiring massive investment of time and resources.

Virtual data warehouses Provide views of operational databases that are

materialized for efficient access. Data marts

Generally targeted to a subset of organization, such as a department, and are more tightly focused.

Chaps26.28.29-47

CSE 4701

Data Warehouse Characteristics Utilizes a “Multi-Dimensional” Data Model Warehouse Comprised of

Store of Integrated Data from Multiple Sources Processed into Multi-Dimensional Model

Warehouse Supports of Times Series and Trend Analysis “Super-Excel” Integrated with DB Technologies

Data is Less Volatile than Regular DB Doesn’t Dramatically Change Over Time Updates at Regular Intervals Specific Refresh Policy Regarding Some Data

Chaps26.28.29-48

CSE 4701

External data sources

metadata

Operational databasesExtraxtTransformLoadRefresh

monitor

integrator

Data Warehouse

Data marts

OLAP Server

Summarizationreport

Query report

Data mining

serve

Three Tier Architecture

Chaps26.28.29-49

CSE 4701

Data Modeling for Data Warehouses Traditional Databases generally deal with two-

dimensional data (similar to a spread sheet). However, querying performance in a multi-

dimensional data storage model is much more efficient.

Data warehouses can take advantage of this feature as generally these are Non volatile The degree of predictability of the analysis that

will be performed on them is high.

Chaps26.28.29-50

CSE 4701

What is a Multi-Dimensional Data Cube? Representation of Information in Two or More

Dimensions Typical Two-Dimensional - Spreadsheet In Practice, to Track Trends or Conduct Analysis,

Three or More Dimensions are Useful Aggregate Raw Data!

Chaps26.28.29-51

CSE 4701

Multi-Dimensional Schemas Supporting Multi-Dimensional Schemas Requires Two

Types of Tables: Dimension Table: Tuples of Attributes for Each

Dimension Fact Table: Measured/Observed Variables with

Pointers into Dimension Table Star Schema

Characterizes Data Cubes by having a Single Fact Table for Each Dimension

Snowflake Schema Dimension Tables from Star Schema are Organized

into Hierarchy via Normalization Both Represent Storage Structures for Cubes

Chaps26.28.29-52

CSE 4701

Data Modeling for Data Warehouses Advantages of a multi-dimensional model

Multi-dimensional models lend themselves readily to hierarchical views in what is known as roll-up display & drill-down display.

The data can be directly queried in any combination of dimensions, bypassing complex database queries.

Chaps26.28.29-53

CSE 4701

Data Warehouse Design Most of Data Warehouses use a Start Schema to

Represent Multi-Dimensional Data Model Each Dimension is Represented by a Dimension

Table that Provides its Multidimensional Coordinates and Stores Measures for those Coordinates

A Fact Table Connects All Dimension Tables with a Multiple Join Each Tuple in Fact Table Represents the Content of

One Dimension Each Tuple in the Fact Table Consists of a Pointer

to Each of the Dimensional Tables Links Between the Fact Table and the Dimensional

Tables for a Shape Like a Star

Chaps26.28.29-54

CSE 4701

Sample Fact Tables

Chaps26.28.29-55

CSE 4701

Date

Product

Store

Customer

Unit_Sales

Dollar_Sales

ProductNoProdNameProdDescCategoryu

Product

CustIDCustNameCustCityCustCountry

Customer

DateMonthYear

Date

StoreIDCityStateCountryRegion

Store

Sale Fact Table

Example of Star Schema

Chaps26.28.29-56

CSE 4701

A Second Example of Star Schema …

Chaps26.28.29-57

CSE 4701

and Corresponding Snowflake Schema

Chaps26.28.29-58

CSE 4701

Multi-dimensional Schemas Fact Constellation

Fact constellation is a set of tables that share some dimension tables.

However, fact constellations limit the possible queries for the warehouse.

Chaps26.28.29-59

CSE 4701

Fact Table i2b2 (Integrating Biology &Bedside)

Chaps26.28.29-60

CSE 4701

Data Warehouse Issues Data Acquisition

Extraction from Heterogeneous Sources Reformatted into Warehouse Context - Names,

Meanings, Data Domains Must be Consistent Data Cleaning for Validity and Quality

is the Data as Expected w.r.t. Content? Value? Transition of Data into Data Model of Warehouse Loading of Data into the Warehouse

Other Issues Include: How Current is the Data? Frequency of Update? Availability of Warehouse? Dependencies of Data? Distribution, Replication, and Partitioning Needs? Loading Time (Clean, Format, Copy, Transmit,

Index Creation, etc.)?

Chaps26.28.29-61

CSE 4701

OLAP Strategies OLAP Strategies

Roll-Up: Summarization of Data Drill-Down: from the General to Specific (Details) Pivot: Cross Tabulate the Data Cubes Slice and Dice: Projection Operations Across

Dimensions Sorting: Ordering Result Sets Selection: Access by Value or Value Range

Implementation Issues Persistent with Infrequent Updates (Loading) Optimization for Performance on Queries is More

Complex - Across Multi-Dimensional Cubes Recovery Less Critical - Mostly Read Only Temporal Aspects of Data (Versions) Important

Chaps26.28.29-62

CSE 4701

Knowledge Discovery Data Warehousing Requires Knowledge Discovery to

Organize/Extract Information Meaningfully Knowledge Discovery

Technology to Extract Interesting Knowledge (Rules, Patterns, Regularities, Constraints) from a Vast Data Set

Process of Non-trivial Extraction of Implicit, Previously Unknown, and Potentially Useful Information from Large Collection of Data

Data Mining A Critical Step in the Knowledge Discovery

Process Extracts Implicit Information from Large Data Set

KDD: Knowledge Discovery and Data Mining

Chaps26.28.29-63

CSE 4701

Steps in a KDD Process Learning the Application Domain (goals) Gathering and Integrating Data Data Cleaning Data Integration Data Transformation/Consolidation Data Mining

Choosing the Mining Method(s) and Algorithm(s) Mining: Search for Patterns or Rules of Interest

Analysis and Evaluation of the Mining Results Use of Discovered Knowledge in Decision Making Important Caveats

This is Not an Automated Process! Requires Significant Human Interaction!

Chaps26.28.29-64

CSE 4701

Processing in a Data Warehouse Processing Types are Varied and Include:

Roll-up: Data is summarized with increasing generalization

Drill-Down: Increasing levels of detail are revealed

Pivot: Cross tabulation is performed Slice and dice: Performing projection operations

on the dimensions. Sorting: Data is sorted by ordinal value. Selection: Data is available by value or range. Derived attributes: Attributes are computed by

operations on stored derived values.

Chaps26.28.29-65

CSE 4701

Product

Product Store Date Sale

acron Rolla,MO 7/3/99 325.24

budwiser LA,CA 5/22/99 833.92

large pants NY,NY 2/12/99 771.24

3’ diaper Cuba,MO 7/30/99 81.99

PantsDiapers

BeerNuts

West

East

Central

Mountain

South

Jan Feb March April

Date

Region

On-Line Analytical Processing Data Cube

A Multidimensonal Array Each Attribute is a Dimension

In Example Below, the Data Must be Interpreted so that it Can be Aggregated by Region/Product/Date

Chaps26.28.29-66

CSE 4701

Months

Cities

Prod

ucts

Sal

es

Multi-Dimensional Data Cube

Months

Cities

Prod

ucts

Sal

es

Slice on city Atlanta

Examples of Data Mining The Slicing Action

A Vertical or Horizontal Slice Across Entire Cube

Chaps26.28.29-67

CSE 4701

March 2000

Atla

nta

Electronics Dice on Electronics and Atlanta

Months

Cities

Prod

ucts

Sal

es

Examples of Data Mining The Dicing Action

A Slide First Identifies on Dimension A Selection of Any Cube within the Slice which

Essentially Constrains All Three Dimensions

Prod

ucts

Sal

es

Months

Atlanta

Chaps26.28.29-68

CSE 4701

Examples of Data Mining

Drill Down - Takes a Facet (e.g., Q1)

and Decomposes into Finer Detail

Q1 Q2 Q3 Q4

Location (city, GA)

Pro

duct

s S

ales

Jan Feb March

Citi

esP

rodu

cts

Sal

es

Drill down on Q1

Roll Up on Location(State, USA)

Atlanta

Columbus

Gainesville

Savannah

Q1 Q2 Q3 Q4

Pro

duct

s S

ales

Arizona

CaliforniaGeorgiaIowa

Roll Up: Combines Multiple DimensionsFrom Individual Cities to State

Chaps26.28.29-69

CSE 4701

Time series data

Geographical and Satellite Data

Spatial databases

Multimedia databases

World Wide Web

Mining Other Types of Data Analysis and Access Dramatically More Complicated!

Chaps26.28.29-70

CSE 4701

Advantages/Objectives of Data Mining Descriptive Mining

Discover and Describe General Properties 60% People who buy Beer on Friday also have

Bought Nuts or Chips in the Past Three Months Predictive Mining

Infer Interesting Properties based on Available Data

People who Buy Beer on Friday usually also Buy Nuts or Chips

Result of Mining Order from Chaos Mining Large Data Sets in Multiple Dimensions

Allows Businesses, Individuals, etc. to Learn about Trends, Behavior, etc.

Impact on Marketing Strateg

Chaps26.28.29-71

CSE 4701

Data Mining Methods Association

Discover the Frequency of Items Occurring Together in a Transaction or an Event

Example 80% Customers who Buy Milk also Buy Bread

Hence - Bread and Milk Adjacent in Supermarket 50% of Customers Forget to Buy Milk/Soda/Drinks

Hence - Available at Register Prediction

Predicts Some Unknown or Missing Information based on Available Data

Example Forecast Sale Value of Electronic Products for Next

Quarter via Available Data from Past Three Quarters

Chaps26.28.29-72

CSE 4701

Association Rules Motivated by Market Analysis Rules of the Form

Item1^Item2^…^ ItemkItemk+1 ^ … ^ Itemn Example

“Beer ^ Soft Drink Pop Corn” Problem: Discovering All Interesting Association

Rules in a Large Database is Difficult! Issues

Interestingness Completeness Efficiency

Basic Measurement for Association Rules Support of the Rule Confidence of the Rule

Chaps26.28.29-73

CSE 4701

Data Mining Methods Classification

Determine the Class or Category of an Object based on its Properties

Example Classify Companies based on the Final Sale Results in

the Past Quarter Clustering

Organize a Set of Multi-dimensional Data Objects in Groups to Minimize Inter-group Similarity is and Maximize Intra-group Similarity

Example Group Crime Locations to Find Distribution Patterns

Chaps26.28.29-74

CSE 4701

Classification Classification is the process of learning a model that

is able to describe different classes of data. Learning is supervised as the classes to be learned are

predetermined. Learning is accomplished by using a training set of

pre-classified data. The model produced is usually in the form of a

decision tree or a set of rules.

Chaps26.28.29-75

CSE 4701

One Classification Example

Rule extracted from the decision tree of Figure 28.7.IF 50K > salary >= 20K

AND age >=25THEN class is “yes”

Chaps26.28.29-76

CSE 4701

Classification Two Stages

Learning Stage: Construction of a Classification Function or Model

Classification Stage: Predication of Classes of Objects Using the Function or Model

Tools for Classification Decision Tree Bayesian Network Neural Network Regression

Problem Given a Set of Objects whose Classes are Known

(Training Set), Derive a Classification Model which can Correctly Classify Future Objects

Chaps26.28.29-77

CSE 4701

Attributes

Class Attribute - Play/Don’t Play the Game Training Set

Values that Set the Condition for the Classification What are the Pattern Below?

Attribute Possible Valuesoutlook sunny, overcast, raintemperature continuoushumidity continuouswindy true, false

Outlook Temperature Humidity Windy Playsunny 85 85 false Noovercast 83 78 false Yessunny 80 90 true Nosunny 72 95 false Nosunny 72 70 false Yes… … … … ...

An Example

Chaps26.28.29-78

CSE 4701

Data Mining Methods Summarization

Characterization (Summarization) of General Features of Objects in the Target Class

Example Characterize People’s Buying Patterns on the Weekend Potential Impact on “Sale Items” & “When Sales Start” Department Stores with Bonus Coupons

Discrimination Comparison of General Features of Objects

Between a Target Class and a Contrasting Class Example

Comparing Students in Engineering and in Art Attempt to Arrive at Commonalities/Differences

Chaps26.28.29-79

CSE 4701

barcode category brand content size

14998 milk diaryland Skim 2L

12998 mechanical MotorCraft valve 23a 12in

… … … … ...

food

Milk … bread

Skim milk … 2% milk White whole bread … wheat

Lucern … DairylandWonder … Safeway

Category Content Count

milk skim 280milk 2% 98… … ...

Summarization Technique Attribute-Oriented Induction Generalization using Concert hierarchy (Taxonomy)

Chaps26.28.29-80

CSE 4701

Building A Data Warehouse The builders of Data warehouse should take a broad

view of the anticipated use of the warehouse. The design should support ad-hoc querying An appropriate schema should be chosen that

reflects the anticipated usage. The Design of a Data Warehouse involves following

steps. Acquisition of data for the warehouse. Ensuring that Data Storage meets the query

requirements efficiently. Giving full consideration to the environment in

which the data warehouse resides.

Chaps26.28.29-81

CSE 4701

Building A Data Warehouse Acquisition of data for the warehouse

The data must be extracted from multiple, heterogeneous sources.

Data must be formatted for consistency within the warehouse.

The data must be cleaned to ensure validity. Difficult to automate cleaning process. Back flushing, upgrading the data with cleaned data.

The data must be fitted into the data model of the warehouse.

The data must be loaded into the warehouse. Proper design for refresh policy should be considered.

Chaps26.28.29-82

CSE 4701

Building A Data Warehouse Storing the data according to the data model of the

warehouse Creating and maintaining required data structures Creating and maintaining appropriate access paths Providing for time-variant data as new data are added Supporting the updating of warehouse data. Refreshing the data Purging data

Chaps26.28.29-83

CSE 4701

Why is Data Mining Popular? Technology Push

Technology for Collecting Large Quantity of Data Bar Code, Scanners, Satellites, Cameras

Technology for Storing Large Collection of Data Databases, Data Warehouses Variety of Data Repositories, such as Virtual Worlds,

Digital Media, World Wide Web Corporations want to Improve Direct Marketing and

Promotions - Driving Technology Advances Targeted Marketing by Age, Region, Income, etc. Exploiting User Preferences/Customized Shopping

Chaps26.28.29-84

CSE 4701

Requirements & Challenges in Data Mining Security and Social

What Information is Available to Mine? Preferences via Store Cards/Web Purchases What is Your Comfort Level with Trends?

User Interfaces and Visualization What Tools Must be Provided for End Users of

Data Mining Systems? How are Results for Multi-Dimensional Data

Displayed? Performance Guarantees

Range from Real-Time for Some Queries to Long-Term for Other Queries

Data Sources of Complex Data Types or Unstructured Data - Ability to Format, Clean, and Load Data Sets

Chaps26.28.29-85

CSE 4701

Data Mining Visualization Leverage Improving 3D Graphics and Increased PC

Processing Power for Displaying Results Significant Research in Visualization w.r.t. Displaying

Multi-Dimensional Data

Chaps26.28.29-86

CSE 4701

Successful Data Mining Applications Business Data Analysis and Decision Support

Marketing, Customer Profiling, Market Analysis and Management, Risk Analysis and Management

Fraud Detection Detecting Telephone Fraud, Automotive and

Health Insurance Fraud, Credit-card Fraud, Suspicious Money Transactions (Money Laundering)

Text Mining Message Filtering (Email, Newsgroups, Etc.) Newspaper Articles Analysis

Sports IBM Advanced Scout Analyzed NBA Game

Statistics (Shots Blocked, Assists and Fouls) to Gain Competitive Advantage

Chaps26.28.29-87

CSE 4701

Select Data Mining Products

Chaps26.28.29-88

CSE 4701

Databases on WWW Web has changed the way we do Business & Research Facts:

Industry Saw an Opportunity, knew it had to Move Quickly to Capitalize Lots of Action, Lots of Money, Lots of Releases Line Between R&D is Very Narrow Many Researchers Moved to Industry (Trying to Return

Back to Academia) Emergence of Java

Java changed the way that Software was Designed, Developed, and Utilized

Particularly w.r.t. Web-Based Applications, Database Interoperability, Web Architectures, etc.

Emergence of Enterprise Computing

Chaps26.28.29-89

CSE 4701

Internet and the Web A Major Opportunity for Business

A Global Marketplace Business Across State and Country Boundaries

A Way of Extending Services Online Payment vs. VISA, Mastercard

A Medium for Creation of New Services Publishers, Travel Agents, Teller, Virtual Yellow Pages,

Online Auctions … A Boon for Academia

Research Interactions and Collaborations Free Software for Classroom/Research Usage Opportunities for Exploration of Technologies in

Student Projects

Chaps26.28.29-90

CSE 4701

Intranet Decision

support Mfg.. System

monitoring corporate

repositories Workgroups

Server

CorporateNetwork

Server

ServerServer

CorporateNetwork

Internet

Internet Sales Marketing Information Services

Business to Business Information sharing Ordering info./status Targeted electronic

commerce

WWW: Three Market Segments

Chaps26.28.29-91

CSE 4701

Information Delivery Problems on the Net Everyone can Publish Information on the Web

Independently at Any Time Consequently, there is an Information Explosion Identifying Information Content More Difficult

There are too Many Search Engines but too Few Capable of Returning High Quality Data Is this Still True?

Most Search Engines are Useful for Ad-hoc Searches but Awkward for Tracking Changes Is this Still True?

Chaps26.28.29-92

CSE 4701

Example Web Applications Scenario 1: World Wide Wait

A Major Event is Underway and the Latest, Up-to-the Minute Results are Being Posted on the Web

You Want to Monitor the Results for this Important Event, so you Fire up your Trusty Web Browser, Pointing at the Result Posting Site, and Wait, and Wait, and Wait …

What is the Problem? The Scalability Problems are the Result of a

Mismatch Between the Data Access Characteristics of the Application and the Technology Used to Implement the Application

Changed with Emergence of Mobile Computing?

Chaps26.28.29-93

CSE 4701

Example Web Applications Scenario 2:

Many Applications Today have the Need for Tracking Changes in Local and Remote Data Sources and Notifying Changes If Some Condition Over the Data Source(s) is Met

If You Want to Monitor the Changes on Web, You Need to Fire Your Trusty Web Browser from Time to Time, and Cache the Most Recent Result, and do the Difference Manually Each Time You Poll the Data Source(s) …

What is the Problem? Pure Pull is Not the Answer to All Problems

Changed with Emergence of Mobile Computing?

Chaps26.28.29-94

CSE 4701

What is the Problem? Applications are Asymmetric but the Web is Not

Computation Centric vs. Information Flow Centric Type of Asymmetry

Network Asymmetry Satellite, CATV, Mobile Clients, Etc.

Client to Server Ratio Too Many Clients can Swamp Servers

Data Volume Mouse and Key Click vs. Content Delivery

Update and Information Creation Clients Need to be Informed or Must Poll

What have we Seen re. Cell Networks Over Time?

Chaps26.28.29-95

CSE 4701

Useful Solutions Combination/Interleave of Pull and Push Protocols

User-initiated, Comprehensive Search-based Information Delivery (Pull)

Server-initiated Information Dissemination (Push) Provide Support for a Variety of Data Delivery

Protocols, Frequencies, and Delivery Modes Information Delivery Frequencies

Periodic, Conditional, Ad-Hoc Information Delivery Modes Information Delivery Protocols (IDP)

Request/Respond, Polling, Publish/Subscribe, Broadcast

Information Delivery Styles (IDS) Pull, Push, Hybrid

Chaps26.28.29-96

CSE 4701

Information Delivery Frequencies Periodic

Data is Delivered from a Server to Clients Periodically

Period can be Defined by System-default or by Clients Using their Profiles

Period can be Influenced by Client and Bandwidth Mobile Device vs. PC w/Modem PC w/DSL vs. PC w/Cable Modem Multiple Mobile Devices of All Types Streaming of Videos, Live Streaming of Events

Conditional (Aperiodic) Data is Delivered from a Server when Conditions

Installed by Clients in their Profiles are Satisfied Ad-hoc (or Irregular)

Chaps26.28.29-97

CSE 4701

Information Delivery Modes Uni-cast

Data is Sent from a Data Source (a Single Server) to Another Machine

1-to-n Data is Sent by a Single Data Source and Received

by Multiple Machines Multicast vs. Broadcast

Multicast: Data is Sent to a Specific Set of Clients Broadcast: Sending Data Over a Medium which an

Unidentified or Unbounded Set of Clients can Listen

Chaps26.28.29-98

CSE 4701

IDP: Request/Respond Semantics of Request/Respond

Clients Send their Request to Servers to Ask the Information of their Interest

Servers Respond to the Client Request by Delivering the Information Requested

Client can Wait (Synchronous) or Not Applications

Most Database Systems and Web Search Engines are Using the Request/Respond Protocol for Client-Server Communication

What has Changed with Mobile Computing?

Chaps26.28.29-99

CSE 4701

IDP: Programmed Polling vs. User Polling Semantics:

Programmed Polling: a System Periodically Sends Requests to Other Sites to Obtain Status Information or Detect Changed Values

User Polling: a User or Application Periodically or Aperiodically Polls the Data Sites and Obtains the Changes

Applications Programmed Polling: Save the Users from having

to Click, but does Nothing to Solve the Scalability Problems Caused by the Request/Respond Mechanism

What do Today’s Mobile Devices Use?

Chaps26.28.29-100

CSE 4701

IDP: Publish/Subscribe Semantics: Servers Publish/Clients Subscribe

Servers Publish Information Online Clients Subscribe to the Information of Interest

(Subscription-based Information Delivery) Data Flow is Initiated by the Data Sources

(Servers) and is Aperiodic Danger: Subscriptions can Lead to Other

Unwanted Subscriptions Applications

Unicast: Database Triggers and Active Databases 1-to-n: Online News Groups

How is this Utilized in Mobile Devices?

Chaps26.28.29-101

CSE 4701

Information Delivery Styles Pull-Based System

Transfer of Data from Server to Client is Initiated by a Client Pull

Clients Determine when to Get Information Potential for Information to be Old Unless Client

Periodically Pulls Push-Based System

Transfer of Data from Server to Client is Initiated by a Server Push

Clients may get Overloaded if Push is Too Frequent

Hybrid Pull and Push Combined Pull First and then Push Continually

Chaps26.28.29-102

CSE 4701

Request/Respond

Publish/Subscribe

Broadcast Periodic Conditional Ad-hoc

Pure Pull Y Y

Pure Push Y Y Y Y*

Hybrid Y Y Y Y Y Y*

Summary: Pull vs. Push

Chaps26.28.29-103

CSE 4701

Design Options for Nodes Three Types of Nodes:

Data Sources Provide Base Data which is to be Disseminated

Clients Who are the Net Consumers of the Information

Information Brokers Acquire Information from Other Data Sources, Add

Value to that Information and then Distribute this Information to Other Consumers

By Creating a Hierarchy of Brokers, Information Delivery can be Tailored to the Need of Many Users

How has this Changed with Today’s Mobile Computing?

Chaps26.28.29-104

CSE 4701

The Next Big Challenge Interoperability

Heterogeneous Distributed Databases Heterogeneous Distributed Systems Autonomous Applications

Scalability Rapid and Continuous Growth

Amount of Data Variety of Data Types Dealing with personally identifiable information (PII)

and personal health information (PHI) Emergence of Fitness and Health Monitoring Apps Google Fit and Apple HealthKit New Apple ResearchKit for Medical Research

Chaps26.28.29-105

CSE 4701

Interoperability: A Classic View

FDB Global Schema

FederatedIntegration

Local Schema

Local Schema

Local Schema

FDB Global Schema 4

FederatedIntegration

FDB 1Local

SchemaFDB3

Federation Federation

Simple Federation Multiple Nested Federation

Chaps26.28.29-106

CSE 4701

LegacyApplication

Network

Java Client

Java Application Code

WRAPPER

Mapping Classes

JAVA LAYER

NATIVE LAYER

Native Functions (C++)RPC Client Stubs (C)

Interactions Between Java Clientand Legacy Appl. via C and RPC

C is the Medium of Info. Exchange

Java Client with C++/C Wrapper

Java Client with Wrapper to Legacy Application

Chaps26.28.29-107

CSE 4701

Network

Java Application Code

JAVA NETWORK WRAPPER

Mapping Classes

NATIVE LAYER

JAVA LAYER

Native Functions that Map to COTS Appl

Java Client Java Client

Java Application Code

JAVA NETWORK WRAPPER

Mapping Classes

NATIVE LAYER

JAVA LAYER

Native Functions that Map to Legacy Appl

COTS Application Legacy Application

Java is Medium of Info. Exchange - C/C++ Appls with Java Wrappers

COTS and Legacy Appls. to Java Clients

Chaps26.28.29-108

CSE 4701

Java Client

LegacyApplication

Relational Database

System(RDS)

Transformed Legacy Data

Updated Data

Extract and Generate Data

Transform andStore Data

Java Client to Legacy App via RDBS

Chaps26.28.29-109

CSE 4701

Information Broker

•Mediator-Based Systems•Agent-Based Systems

Database Interoperability in the Internet Technology

Web/HTTP, JDBC/ODBC, CORBA (ORBs + IIOP), XML, SOAP, REST API, WSDL

Architecture

Chaps26.28.29-110

CSE 4701

Driver Driver

Java Application

JDBC API

Driver Manager

Oracle SybaseAccess

Driver

JDBC

Driver Driver

JDBC API Provides DB Access Protocols for Open, Query, Close, etc.

Different Drivers for Different DB Platforms

Chaps26.28.29-111

CSE 4701

Connecting a DB to the Web

Web Server are Stateless

DB Interactions Tend to be Stateful

Invoking a CGI Script on Each DB Interaction is Very Expensive, Mainly Due to the Cost of DB Open

DBMS

Web Server

Browser

Internet

CGI Script Invocationor JDBC Invocation

Chaps26.28.29-112

CSE 4701

Connecting More Efficiently

To Avoid Cost of Opening Database, One can Use Helper Processes that Always Keep Database Open and Outlive Web Connection

Newly Invoked CGI Scripts Connect to a Preexisting Helper Process

System is Still Stateless

DBMS

Web Server

Browser

Internet

CGI Scriptor JDBC Invocation

Helper Processes

Chaps26.28.29-113

CSE 4701

DB-Internet Architecture

WWW Client(Netscape)

WWW Client(HotJava)

WWW client(Info. Explore)

Internet

HTTP Server

DBWeb Gateway

DBWeb Gateway

DBWeb Gateway

DBWeb Gateway

DBWeb Dispatcher

Chaps26.28.29-114

CSE 4701

EJB Architecture

Chaps26.28.29-115

CSE 4701

The focus is to make information available to users, in the right form, at the right time, in the appropriate place.

Technology Push Computer/Communication Technology (Almost Free)

Plenty of Affordable CPU, Memory, Disk, Network Bandwidth

Next Generation Internet: Gigabit Now Wireless: Ubiquitous, High Bandwidth

Information Growth Massively Parallel Generation of Information on

the Internet and from New Generation of Sensors Disk Capacity on the Order of Peta-bytes

Small, Handy Devices to Access Information

Chaps26.28.29-116

CSE 4701 Ubiquitous/Pervasive

Many computers and information appliances everywhere,

networked together

Research Challenges Inherent Complexity:

Coping with Latency (Sometimes Unpredictable)

Failure Detection and Recovery (Partial Failure)

Concurrency, Load Balancing, Availability, Scale

Service Partitioning Ordering of Distributed Events

“Accidental” Complexity: Heterogeneity: Beyond the Local

Case: Platform, Protocol, Plus All Local Heterogeneity in Spades.

Autonomy: Change and Evolve Autonomously

Tool Deficiencies: Language Support (Sockets,rpc), Debugging, Etc.

Chaps26.28.29-117

CSE 4701

Problem: too many sources,too much information

Internet:Information Jungle

Clean, Reliable,Timely Information,Anywhere

DigitalEarth

Sensors

PersonalizedFiltering &Info. Delivery

Infopipes

Resou

rce A

dapta

tion Property Mgmt

Information QualityContinual Queries

Mic

rofe

edba

ck

specializationInfosphere

Chaps26.28.29-118

CSE 4701

ThinClient

WebServer

MainframeDatabaseServer

Current State-of-Art – Has Mobile Changed This?

Chaps26.28.29-119

CSE 4701 Infotaps &

Fat Clients

Varietyof Servers

Sensors

DatabaseServer

Many sources

Infosphere Scenario – Where Does Mobile Fit?

Chaps26.28.29-120

CSE 4701

Heterogeneity and Autonomy Heterogeneity:

How Much can we Really Integrate? Syntactic Integration

Different Formats and Models XML/JSON/RDF/OWL/SQL Query Languages

Semantic Interoperability Basic Research on Ontology, Etc.

Autonomy No Central DBA on the Net Independent Evolution of Schema and Content Interoperation is Voluntary Interface Technology DCOM: Microsoft Standard

CORBA, Etc...

Chaps26.28.29-121

CSE 4701

Security and Data Quality Security

System Security in the Broad Sense Attacks: Penetrations, Denial of Service System (and Information) Survivability

Security Fault Tolerance Replication for Performance, Availability, and

Survivability Data Quality

Web Data Quality Problems Local Updates with Global Effects Unchecked Redundancy (Mutual Copying) Registration of Unchecked Information Spam on the Rise

Chaps26.28.29-122

CSE 4701

Legacy Data Challenge Legacy Applications and Data

Definition: Important and Difficult to Replace Typically, Mainframe Mission Critical Code Most are OLTP and Database Applications

Evolution of Legacy Databases Client-server Architectures Wrappers Expensive and Gradual in Any Case

Chaps26.28.29-123

CSE 4701

Potential Value Added/Jumping on Bandwagon

Sophisticated Query Capability Combining SQL with Keyword Queries

Consistent Updates Atomic Transactions and Beyond

But Everything has to be in a Database! Only If we Stick with Classic DB Assumptions

Relaxing DB Assumptions Interoperable Query Processing Extended Transaction Updates

Commodities DB Software A Little Help is Still Good If it is Cheap Internet Facilitates Software Distribution Databases as Middleware

Chaps26.28.29-124

CSE 4701

Concluding Remarks Four-Fold Objective

Distributed Database Processing Data Warehouses Data Mining of Vast Information Repositories Web-Based Architectures for DB Interoperability

All Three are Tightly Related DDBMS can Improve Performance of Mining

Repositories as Backend Database Processors Web-Based Architectures Provide Access Means

for DDBMS or Mining Warehouses are Infrastructure to Facilitate Mining

Geographic Information Systems, Deductive DBMS, Multi-Media DBMS, Mobile DBMS, Embedded/Real-Time DBMS, etc.