Candidate: Rishi Kanth Saripalle Major Advisor: Prof. Steven A. Demurjian
Chaps26.28.29-1 CSE 4701 Chapters 26, 28 & 29, 6e - 24, 26 & 27 5e Database System Architectures,...
-
Upload
damian-fields -
Category
Documents
-
view
239 -
download
0
Transcript of Chaps26.28.29-1 CSE 4701 Chapters 26, 28 & 29, 6e - 24, 26 & 27 5e Database System Architectures,...
Chaps26.28.29-1
CSE 4701
Chapters 26, 28 & 29, 6e - 24, 26 & 27 5e
Database System Architectures, Data
Mining/Warehousing, Web DB
Prof. Steven A. Demurjian, Sr. Computer Science & Engineering Department
The University of Connecticut191 Auditorium Road, Box U-155
Storrs, CT 06269-3155 [email protected]
http://www.engr.uconn.edu/~steve(860) 486 - 4818
A portion of these slides are being used with the permission of Dr. Ling Lui, Associate Professor, College of Computing, Georgia Tech.
Remaining slides represent new material.
Chaps26.28.29-2
CSE 4701
Classical and Distributed Architectures Classic/Centralized DBMS Dominated the
Commercial Market from 1970s Forward Problems of this Approach
Difficult to Scale w.r.t. Performance Gains If DB Overloaded, replace with a Faster Computer this can Only Go So Far - Disk Bottlenecks
Distributed DBMS have Evolved to Address a Number of Issues Improved Performance Putting Data “Near” Location where it is Needed Replication of Data for Fault Tolerance Vertical and Horizontal Partitioning of DB Tuples
Chaps26.28.29-3
CSE 4701
Common Features of Centralized DBMS Data Independence
High-Level Representation via Conceptual and External Schemas
Physical Representation (Internal Schema) Hidden Program Independence
Multiple Applications can Share Data Views/External Schema Support this Capability
Reduction of Program/Data Redundancy Single, Unique, Conceptual Schema Shared Database
Almost No Data Redundancy Controlled Data Access Reduces Inconsistencies Programs Execute with Consistent Results
Chaps26.28.29-4
CSE 4701
Common Features of Centralized DBMS Promote Sharing: Automatically Provided via CC
No Longer Programmatic Issue Most DBMS Offer Locking for Key Shared Data
Oracle Allows Locks on Data Item (Attributes) For Example, Controlling Access to Shared Identifier
Coherent and Central DB Administration Semantic DB Integrity via the Automatic Enforcement
of Data Consistency via Integrity Constraints/Rules Data Resiliency
Physical Integrity of Data in the Presence of Faults and Errors
Supported by DB Recovery Data Security: Control Access for Authorized Users
Against Sensitive Data
Chaps26.28.29-5
CSE 4701
Shared Nothing Architecture In this Architecture, Each DBMS
Operates Autonomously There is No Sharing
Three Separate DBMSs on Three Different Computers
Applications/Clients Must Know About the External Schemas of all Three DBMSs for Database Retrieval Client Processing
Complicates Client Different DBMS
Platforms(Oracle, Sybase, Informix, ..)
Different Access Modes(Query, Embedded, ODBC)
Difficult for SWE to Code
Chaps26.28.29-6
CSE 4701
Difficulty in Access – Manage Multiple APIs Each Platform has a Different API
API1 , API3 , …. , APIn An App Programmer Must Utilize All three APIs
which could differ by PL – C++, C, Java, REST, etc.
Any interactions Across 3 DBs – must be programmatically handled without DB Capabilities
API1 API2APIn
Chaps26.28.29-7
CSE 4701
NW Architecture with Centralized DB High-Speed NWs/WANs Spawned Centralized DB
Accessible Worldwide Clients at Any Site can Access Repository Data May be “Far” Away - Increased Access Time In Practice, Each Remote Site Needs only Portion
of the Data in DB1 and/or DB2 Inefficient, no Replication w.r.t. Failure
Chaps26.28.29-8
CSE 4701
Fully Distributed Architecture The Five Sites (Chicago, SF, LA, NY, Atlanta) each
have a “Portion” of the Database - its Distributed Replication is Possible for Fault Tolerance Queries at one Site May Need to Access Data at
Another Site (e.g., for a Join) Increased Transaction Processing Complexity
Chaps26.28.29-9
CSE 4701
Distributed Database Concepts A transaction can be executed by multiple networked
computers in a unified manner. A distributed database (DDB) processes a Unit of
execution (a transaction) in a distributed manner. A distributed database (DDB) can be defined as
Collection of multiple logically related database distributed over a computer network
Distributed database management system as a software system that manages a distributed database while making the distribution transparent to the user.
Chaps26.28.29-10
CSE 4701
Goals of DDBMS Support User Distribution Across Multiple Sites
Remote Access by Users Regardless of Location Distribution and Replication of Database Content
Provide Location Transparency Users Manipulate their Own Data Non-Local Sites “Appear” Local to Any User
Provide Transaction Control Akin to Centralized Case Transaction Control Hides Distribution CC and Serializability - Must be Extended
Minimize Communications Cost Optimize Use of Network - a Critical Issue Distribute DB Design Supported by Partitioning
(Fragmentation) and Replication
Chaps26.28.29-11
CSE 4701
Goals of DDBMS Improve Response Time for DB Access
Use a More Sophisticated Load Control for Transaction Processing
However, Synchronization Across Sites May Introduce Additional Overhead
System Availability Site Independence in the Presence of Site Failure Subset of Database is Always Available Replication can Keep All Data Available, Even
When Multiple Sites Fail Modularity
Incremental Growth with the Addition of Sites Dedicate Sites to Specific Tasks
Chaps26.28.29-12
CSE 4701
Advantages of DDBMS There are Four Major Advantages Transparency
Distribution/NW Transparency User Doesn’t Know about NW Configuration (Location
Transparency) User can Find Object at any Site (Naming
Transparency) Replication Transparency (see next PPT)
User Doesn’t Know Location of Data Replicas are Transparently Accessible
Fragmentation Transparency Horizontal Fragmentation (Distribute by Row) Vertical Fragmentation (Distribute by Column)
Chaps26.28.29-14
CSE 4701
Other Advantages of DDBMS Increased Reliability and Availability
Reliability - System Always Running Availability - Data Always Present Achieved via Replication and Distribution Ability to Make Single Query for Entire DDBMS
Improved Performance Sites Able to Utilize Data that is Local for Majority
of Queries Easier Expansion
Improve Performance of Site by Upgrading Processor of Computer Adding Additional Disks Splitting a Site into Two or More Sites
Expansion over Time as Business Grows
Chaps26.28.29-15
CSE 4701
Challenges of DDBMS Tracking Data - Meta Data More Complex
Must Track Distribution (where is the Data) V & H Fragmentation (How is Data Split) Replication (Multiple Copies for Consistency)
Distributed Query Processing Optimization, Accessibility, etc., More Complex Block Analysis of Data Size Must also Now
Consider the NW Transmitting Time Distributed Transaction Processing
TP Potentially Spans Multiple Sites Submit Query to Multiple Sites Collect and Collate Results
Distributed Concurrency Control Across Nodes
Chaps26.28.29-16
CSE 4701
Challenges of DDBMS Replicated Data Management
TP Must Choose the Replica to Access Updates Must Modify All Replica Copies
Distributed Database Recovery Recovery of Individual Sites Recovery Across DDBMS
Security Local and Remote Authorization During TP, be Able to Verify Remote Privileges
Distributed Directory Management Meta-Data on Database - Local and Remote Must maintain Replicas of this - Every Site Tracks
the Meta-Data for All Sites
Chaps26.28.29-17
CSE 4701
A Complete Schema with Keys ...
Keys Allow us to Establish Links Between Relations
Chaps26.28.29-18
CSE 4701
…and Corresponding DB Tables
which Represent Tuples/Instances of Each Relation
1455
ASCnullWBnullnull
Chaps26.28.29-20
CSE 4701
What is Fragmentation? Fragmentation Divides a DB Across Multiple Sites Two Types of Fragmentation
Horizontal Fragmentation Given a Relation R with n Total Tuples, Spread Entire
Tuples Across Multiple Sites Each Site has a Subset of the n Tuples Essentially Fragmentation is a Selection
Vertical Fragmentation Given a Relation R with m Attributes and n Total
Tuples, Spread the Columns Across Multiple Sites Essentially Fragmentation is a Projection Not Generally Utilized in Practice
In Both Cases, Sites can Overlap for Replication
Chaps26.28.29-21
CSE 4701
Horizontal Fragmentation A horizontal subset of a relation which contain those
of tuples which satisfy selection conditions. Consider Employee relation with condition DNO = 5 All tuples satisfying this create a subset which will be
a horizontal fragment of Employee relation. A selection condition may be composed of several
conditions connected by AND or OR. Derived horizontal fragmentation:
Partitioning of a primary relation to other secondary relations which are related with Foreign keys.
Chaps26.28.29-23
CSE 4701
Horizontal Fragmentation Site 3 Tracks All Information Related to Dept. 4 Note that an Employee Could be Listed in Both Cases,
if s/he Works on a Project for Both Departments
Chaps26.28.29-24
CSE 4701
Refined Horizontal Fragmentation Further Fragment from Site
2 based on Dept. that Employee Works in
Notice that G1 + G2 + G3 is the Same as WORKS_ON5
there is no Overlap
Chaps26.28.29-25
CSE 4701
Refined Horizontal Fragmentation Further Fragment from Site
3 based on Dept. that Employee Works in
Notice that G4 + G5 + G6 is the Same as WORKS_ON4
Note Some Fragments can be Empty
Chaps26.28.29-26
CSE 4701
Vertical Fragmentation Subset of a relation created via a subset of columns.
A vertical fragment of a relation will contain values of selected columns.
There is no selection condition used in vertical fragmentation.
A strict vertical slice/partition Consider the Employee relation.
A vertical fragment of can be created by keeping the values of Name, Bdate, Sex, and Address.
Since no condition for creating a vertical fragment Each fragment must include the primary key
attribute of the parent relation Employee. All vertical fragments of a relation are connected.
Chaps26.28.29-27
CSE 4701
Vertical Fragmentation Example Partition the Employee Table as Below Notice Each Vertical Fragment Needs Key Column
EmpDemo EmpSupvrDept
Chaps26.28.29-28
CSE 4701
Homogeneous DDBMS Homogeneous
Identical Software (w.r.t. Database) One DB Product (e.g., Oracle, DB2, Sybase) is
Distributed and Available at All Sites Uniformity w.r.t. Administration, Maintenance,
Client Access, Users, Security, etc. Interaction by Programmatic Clients is Consistent
(e.g., JDBC or ODBC or REST API …)
Chaps26.28.29-29
CSE 4701
Non-Federated Heterogeneous DDBMS Non-Federated Heterogeneous
Different Software (w.r.t. Database) Multiple DB Products (e.g., Oracle at One Site,
Access another, Sybase, Informix, etc.) Replicated Administration (e.g., Users Needs
Accounts on Multiple Systems) Varied Programmatic Access - SWEs Must Know
All Platforms/Client Software Complicated Very Close to Shared Nothing Architecture
Chaps26.28.29-30
CSE 4701
Federated DDBMS Federated
Multiple DBMS Platforms Overlaid with a Global Schema View
Single External Schema Combines Schemas from all Sites
Multiple Data Models Relational in one
Component DBS Object-Oriented in
another DBS Hierarchical in a
3rd DBS
Chaps26.28.29-31
CSE 4701
Federated DBMS Issues Differences in Data Models
Reconcile Relational vs. Object-Oriented Models Each Different Model has Different Capabilities These Differences Must be Addressed in Order to
Present a Federated Schema Differences in Constraints
Referential Integrity Constraints in Different DBSs Different Constraints on “Similar” Data Federated Schema Must Deal with these Conflicts
Differences in Query Languages SQL-89, SQL-92, SQL2, SQL3 Specific Types in Different DBMS (Oracle Blobs )
Differences in Key Processing & Timestamping
Chaps26.28.29-32
CSE 4701
Heterogeneous Distributed Database Systems Federated: Each site may run different database system but the
data access is managed through a single conceptual schema. The degree of local autonomy is minimum. Each site must adhere to a centralized access policy There may be a global schema.
Multi-database: There is no one conceptual global schema For data access a schema is constructed dynamically as
needed by the application software.
Communications network
Site 5Site 1
Site 2Site 3
NetworkDBMS
Relational
Site 4
ObjectOriented
LinuxLinux
Unix
Hierarchical
ObjectOriented
RelationalUnix
Window
Chaps26.28.29-33
CSE 4701
Query Processing in Distributed Databases
Issues Cost of transferring data (files and results) over the network.
This cost is usually high so some optimization is necessary. Example relations: Employee at site 1 and Department at Site 2
– Employee at site 1. 10,000 rows. Row size = 100 bytes. Table size = 106 bytes.
– Department at Site 2. 100 rows. Row size = 35 bytes. Table size = 3,500 bytes.
Q: For each employee, retrieve employee name and department name Where the employee works.
Q: Fname,Lname,Dname (Employee Dno = Dnumber Department)
Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno
Dname Dnumber Mgrssn Mgrstartdate
Chaps26.28.29-34
CSE 4701
Query Processing in Distributed Databases Result
The result of this query will have 10,000 tuples, assuming that every employee is related to a department.
Suppose each result tuple is 40 bytes long. The query is submitted at site 3 and the result is
sent to this site. Problem: Employee and Department relations are
not present at site 3.
Chaps26.28.29-35
CSE 4701
Query Processing in Distributed Databases Strategies:
1. Transfer Employee and Department to site 3. Total transfer bytes = 1,000,000 + 3500 = 1,003,500
bytes.2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3. Query result size = 40 * 10,000 = 400,000 bytes. Total
transfer size = 400,000 + 1,000,000 = 1,400,000 bytes.3. Transfer Department relation to site 1, execute the join at site
1, and send the result to site 3. Total bytes transferred = 400,000 + 3500 = 403,500 bytes.
Optimization criteria: minimizing data transfer. Preferred approach: strategy 3.
Chaps26.28.29-36
CSE 4701
Query Processing in Distributed Databases Consider the query
Q’: For each department, retrieve the department name and the name of the department manager
Relational Algebra expression: Fname,Lname,Dname (Employee Mgrssn = SSN Department)
Chaps26.28.29-37
CSE 4701
Query Processing in Distributed Databases Result of query has 100 tuples, assuming that every
department has a manager, the execution strategies are:1. Transfer Employee and Department to the result site and
perform the join at site 3. Total bytes transferred = 1,000,000 + 3500 = 1,003,500
bytes.2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3. Query result size = 40 * 100 = 4000 bytes. Total transfer size = 4000 + 1,000,000 = 1,004,000 bytes.
3. Transfer Department relation to site 1, execute join at site 1 and send the result to site 3. Total transfer size = 4000 + 3500 = 7500 bytes.
Preferred strategy: Choose strategy 3.
Chaps26.28.29-38
CSE 4701
Query Processing in Distributed Databases Now suppose the result site is 2. Possible strategies :
1. Transfer Employee relation to site 2, execute the query and present the result to the user at site 2. Total transfer size = 1,000,000 bytes for both queries Q
and Q’.
2. Transfer Department relation to site 1, execute join at site 1 and send the result back to site 2. Total transfer size for Q = 400,000 + 3500 = 403,500
bytes and for Q’ = 4000 + 3500 = 7500 bytes.
Chaps26.28.29-39
CSE 4701
DDBS Concurrency Control and Recovery Distributed Databases encounter a number of
concurrency control and recovery problems which are not present in centralized databases, including: Dealing with multiple copies of data items
How are they All Updated if Needed? Failure of individual sites
How are Queries Restarted or Rerouted? Communication link failure
Network Failure Distributed commit
How to Know All Updates Done at all Sites? Distributed deadlock
How to Detect and Recover?
Chaps26.28.29-40
CSE 4701
Data Warehousing and Data Mining Data Warehousing
Provide Access to Data for Complex Analysis, Knowledge Discovery, and Decision Making
Underlying Infrastructure in Support of Mining Provides Means to Interact with Multiple DBs OLAP (on-Line Analytical Processing) vs. OLTP
Data Mining Discovery of Information in a Vast Data Sets Search for Patterns and Common Features based Discover Information not Previously Known
Medical Records Accessible Nationwide Research/Discover Cures for Rare Diseases
Relies on Knowledge Discovery in DBs (KDD)
Chaps26.28.29-41
CSE 4701
What is Purpose of a Data Warehouse? Traditional databases are not optimized for data access but have
to balance the requirement of data access to ensure integrity Most data warehouse users need only read access, but need the
access to be fast over a large volume of data. Most of the data required for data warehouse analysis comes
from multiple databases and these analysis are recurrent and predictable to be able to design software meet requirements.
Critical for tools that provide decision makers with information to make decisions quickly and reliably based on historical data.
Aforementioned Charactereistics achieved by Data Warehousing and Online analytical processing (OLAP)
W. H Inmon characterized a data warehouse as: “A subject-oriented, integrated, nonvolatile, time-variant
collection of data in support of management’s decisions.”
Chaps26.28.29-42
CSE 4701
Data Warehousing and OLAP A Data Warehouse
Database is Maintained Separately from an Operational Database
“A Subject-Oriented, Integrated, Time-Variant, and Non-Volatile Collection of Data in Support for Management’s Decision Making Process [W.H.Inmon]”
OLAP (on-Line Analytical Processing) Analysis of Complex Data in the Warehouse Attempt to Attain “Value” through Analysis Relies on Trained and Adept Skilled Knowledge
Workers who Discover Information Data Mart
Organized Data for a Subset of an Organization
Chaps26.28.29-43
CSE 4701
Conceptual Structure of Data Warehouse
Data Warehouse processing involves Cleaning and reformatting of data OLAP Data Mining
Databases
Data Warehouse
Cleaning Reformatting
Updates/New Data
Back Flushing
Other Data Inputs
OLAP
DataMining
Data
Metadata
DSSIEIS
Chaps26.28.29-44
CSE 4701
Corporate data warehouse
Data Mart Data MartData MartData Mart
Corporate data
Option 1:Consolidate Data Marts
Option 2:Build from scratch
...
Building a Data Warehouse Option 1
Leverage Existing Repositories
Collate and Collect May Not Capture
All Relevant Data
Option 2 Start from Scratch Utilize Underlying
Corporate Data
Chaps26.28.29-45
CSE 4701
Comparison with Traditional Databases Data Warehouses are mainly optimized for appropriate
data access Traditional databases are transactional Optimized for both access mechanisms and
integrity assurance measures. Data warehouses emphasize historical data as their
support time-series and trend analysis. Compared with transactional databases, data
warehouses are nonvolatile. In transactional databases, transaction is the
mechanism change to the database. In warehouse, data is relatively coarse grained and
refresh policy is carefully chosen, usually incremental.
Chaps26.28.29-46
CSE 4701
Classification of Data Warehouses Generally, Data Warehouses are an order of magnitude
larger than the source databases. The sheer volume of data is an issue, based on which
Data Warehouses could be classified as follows. Enterprise-wide data warehouses
Huge projects requiring massive investment of time and resources.
Virtual data warehouses Provide views of operational databases that are
materialized for efficient access. Data marts
Generally targeted to a subset of organization, such as a department, and are more tightly focused.
Chaps26.28.29-47
CSE 4701
Data Warehouse Characteristics Utilizes a “Multi-Dimensional” Data Model Warehouse Comprised of
Store of Integrated Data from Multiple Sources Processed into Multi-Dimensional Model
Warehouse Supports of Times Series and Trend Analysis “Super-Excel” Integrated with DB Technologies
Data is Less Volatile than Regular DB Doesn’t Dramatically Change Over Time Updates at Regular Intervals Specific Refresh Policy Regarding Some Data
Chaps26.28.29-48
CSE 4701
External data sources
metadata
Operational databasesExtraxtTransformLoadRefresh
monitor
integrator
Data Warehouse
Data marts
OLAP Server
Summarizationreport
Query report
Data mining
serve
Three Tier Architecture
Chaps26.28.29-49
CSE 4701
Data Modeling for Data Warehouses Traditional Databases generally deal with two-
dimensional data (similar to a spread sheet). However, querying performance in a multi-
dimensional data storage model is much more efficient.
Data warehouses can take advantage of this feature as generally these are Non volatile The degree of predictability of the analysis that
will be performed on them is high.
Chaps26.28.29-50
CSE 4701
What is a Multi-Dimensional Data Cube? Representation of Information in Two or More
Dimensions Typical Two-Dimensional - Spreadsheet In Practice, to Track Trends or Conduct Analysis,
Three or More Dimensions are Useful Aggregate Raw Data!
Chaps26.28.29-51
CSE 4701
Multi-Dimensional Schemas Supporting Multi-Dimensional Schemas Requires Two
Types of Tables: Dimension Table: Tuples of Attributes for Each
Dimension Fact Table: Measured/Observed Variables with
Pointers into Dimension Table Star Schema
Characterizes Data Cubes by having a Single Fact Table for Each Dimension
Snowflake Schema Dimension Tables from Star Schema are Organized
into Hierarchy via Normalization Both Represent Storage Structures for Cubes
Chaps26.28.29-52
CSE 4701
Data Modeling for Data Warehouses Advantages of a multi-dimensional model
Multi-dimensional models lend themselves readily to hierarchical views in what is known as roll-up display & drill-down display.
The data can be directly queried in any combination of dimensions, bypassing complex database queries.
Chaps26.28.29-53
CSE 4701
Data Warehouse Design Most of Data Warehouses use a Start Schema to
Represent Multi-Dimensional Data Model Each Dimension is Represented by a Dimension
Table that Provides its Multidimensional Coordinates and Stores Measures for those Coordinates
A Fact Table Connects All Dimension Tables with a Multiple Join Each Tuple in Fact Table Represents the Content of
One Dimension Each Tuple in the Fact Table Consists of a Pointer
to Each of the Dimensional Tables Links Between the Fact Table and the Dimensional
Tables for a Shape Like a Star
Chaps26.28.29-55
CSE 4701
Date
Product
Store
Customer
Unit_Sales
Dollar_Sales
ProductNoProdNameProdDescCategoryu
Product
CustIDCustNameCustCityCustCountry
Customer
DateMonthYear
Date
StoreIDCityStateCountryRegion
Store
Sale Fact Table
Example of Star Schema
Chaps26.28.29-58
CSE 4701
Multi-dimensional Schemas Fact Constellation
Fact constellation is a set of tables that share some dimension tables.
However, fact constellations limit the possible queries for the warehouse.
Chaps26.28.29-60
CSE 4701
Data Warehouse Issues Data Acquisition
Extraction from Heterogeneous Sources Reformatted into Warehouse Context - Names,
Meanings, Data Domains Must be Consistent Data Cleaning for Validity and Quality
is the Data as Expected w.r.t. Content? Value? Transition of Data into Data Model of Warehouse Loading of Data into the Warehouse
Other Issues Include: How Current is the Data? Frequency of Update? Availability of Warehouse? Dependencies of Data? Distribution, Replication, and Partitioning Needs? Loading Time (Clean, Format, Copy, Transmit,
Index Creation, etc.)?
Chaps26.28.29-61
CSE 4701
OLAP Strategies OLAP Strategies
Roll-Up: Summarization of Data Drill-Down: from the General to Specific (Details) Pivot: Cross Tabulate the Data Cubes Slice and Dice: Projection Operations Across
Dimensions Sorting: Ordering Result Sets Selection: Access by Value or Value Range
Implementation Issues Persistent with Infrequent Updates (Loading) Optimization for Performance on Queries is More
Complex - Across Multi-Dimensional Cubes Recovery Less Critical - Mostly Read Only Temporal Aspects of Data (Versions) Important
Chaps26.28.29-62
CSE 4701
Knowledge Discovery Data Warehousing Requires Knowledge Discovery to
Organize/Extract Information Meaningfully Knowledge Discovery
Technology to Extract Interesting Knowledge (Rules, Patterns, Regularities, Constraints) from a Vast Data Set
Process of Non-trivial Extraction of Implicit, Previously Unknown, and Potentially Useful Information from Large Collection of Data
Data Mining A Critical Step in the Knowledge Discovery
Process Extracts Implicit Information from Large Data Set
KDD: Knowledge Discovery and Data Mining
Chaps26.28.29-63
CSE 4701
Steps in a KDD Process Learning the Application Domain (goals) Gathering and Integrating Data Data Cleaning Data Integration Data Transformation/Consolidation Data Mining
Choosing the Mining Method(s) and Algorithm(s) Mining: Search for Patterns or Rules of Interest
Analysis and Evaluation of the Mining Results Use of Discovered Knowledge in Decision Making Important Caveats
This is Not an Automated Process! Requires Significant Human Interaction!
Chaps26.28.29-64
CSE 4701
Processing in a Data Warehouse Processing Types are Varied and Include:
Roll-up: Data is summarized with increasing generalization
Drill-Down: Increasing levels of detail are revealed
Pivot: Cross tabulation is performed Slice and dice: Performing projection operations
on the dimensions. Sorting: Data is sorted by ordinal value. Selection: Data is available by value or range. Derived attributes: Attributes are computed by
operations on stored derived values.
Chaps26.28.29-65
CSE 4701
Product
Product Store Date Sale
acron Rolla,MO 7/3/99 325.24
budwiser LA,CA 5/22/99 833.92
large pants NY,NY 2/12/99 771.24
3’ diaper Cuba,MO 7/30/99 81.99
PantsDiapers
BeerNuts
West
East
Central
Mountain
South
Jan Feb March April
Date
Region
On-Line Analytical Processing Data Cube
A Multidimensonal Array Each Attribute is a Dimension
In Example Below, the Data Must be Interpreted so that it Can be Aggregated by Region/Product/Date
Chaps26.28.29-66
CSE 4701
Months
Cities
Prod
ucts
Sal
es
Multi-Dimensional Data Cube
Months
Cities
Prod
ucts
Sal
es
Slice on city Atlanta
Examples of Data Mining The Slicing Action
A Vertical or Horizontal Slice Across Entire Cube
Chaps26.28.29-67
CSE 4701
March 2000
Atla
nta
Electronics Dice on Electronics and Atlanta
Months
Cities
Prod
ucts
Sal
es
Examples of Data Mining The Dicing Action
A Slide First Identifies on Dimension A Selection of Any Cube within the Slice which
Essentially Constrains All Three Dimensions
Prod
ucts
Sal
es
Months
Atlanta
Chaps26.28.29-68
CSE 4701
Examples of Data Mining
Drill Down - Takes a Facet (e.g., Q1)
and Decomposes into Finer Detail
Q1 Q2 Q3 Q4
Location (city, GA)
Pro
duct
s S
ales
Jan Feb March
Citi
esP
rodu
cts
Sal
es
Drill down on Q1
Roll Up on Location(State, USA)
Atlanta
Columbus
Gainesville
Savannah
Q1 Q2 Q3 Q4
Pro
duct
s S
ales
Arizona
CaliforniaGeorgiaIowa
Roll Up: Combines Multiple DimensionsFrom Individual Cities to State
Chaps26.28.29-69
CSE 4701
Time series data
Geographical and Satellite Data
Spatial databases
Multimedia databases
World Wide Web
Mining Other Types of Data Analysis and Access Dramatically More Complicated!
Chaps26.28.29-70
CSE 4701
Advantages/Objectives of Data Mining Descriptive Mining
Discover and Describe General Properties 60% People who buy Beer on Friday also have
Bought Nuts or Chips in the Past Three Months Predictive Mining
Infer Interesting Properties based on Available Data
People who Buy Beer on Friday usually also Buy Nuts or Chips
Result of Mining Order from Chaos Mining Large Data Sets in Multiple Dimensions
Allows Businesses, Individuals, etc. to Learn about Trends, Behavior, etc.
Impact on Marketing Strateg
Chaps26.28.29-71
CSE 4701
Data Mining Methods Association
Discover the Frequency of Items Occurring Together in a Transaction or an Event
Example 80% Customers who Buy Milk also Buy Bread
Hence - Bread and Milk Adjacent in Supermarket 50% of Customers Forget to Buy Milk/Soda/Drinks
Hence - Available at Register Prediction
Predicts Some Unknown or Missing Information based on Available Data
Example Forecast Sale Value of Electronic Products for Next
Quarter via Available Data from Past Three Quarters
Chaps26.28.29-72
CSE 4701
Association Rules Motivated by Market Analysis Rules of the Form
Item1^Item2^…^ ItemkItemk+1 ^ … ^ Itemn Example
“Beer ^ Soft Drink Pop Corn” Problem: Discovering All Interesting Association
Rules in a Large Database is Difficult! Issues
Interestingness Completeness Efficiency
Basic Measurement for Association Rules Support of the Rule Confidence of the Rule
Chaps26.28.29-73
CSE 4701
Data Mining Methods Classification
Determine the Class or Category of an Object based on its Properties
Example Classify Companies based on the Final Sale Results in
the Past Quarter Clustering
Organize a Set of Multi-dimensional Data Objects in Groups to Minimize Inter-group Similarity is and Maximize Intra-group Similarity
Example Group Crime Locations to Find Distribution Patterns
Chaps26.28.29-74
CSE 4701
Classification Classification is the process of learning a model that
is able to describe different classes of data. Learning is supervised as the classes to be learned are
predetermined. Learning is accomplished by using a training set of
pre-classified data. The model produced is usually in the form of a
decision tree or a set of rules.
Chaps26.28.29-75
CSE 4701
One Classification Example
Rule extracted from the decision tree of Figure 28.7.IF 50K > salary >= 20K
AND age >=25THEN class is “yes”
Chaps26.28.29-76
CSE 4701
Classification Two Stages
Learning Stage: Construction of a Classification Function or Model
Classification Stage: Predication of Classes of Objects Using the Function or Model
Tools for Classification Decision Tree Bayesian Network Neural Network Regression
Problem Given a Set of Objects whose Classes are Known
(Training Set), Derive a Classification Model which can Correctly Classify Future Objects
Chaps26.28.29-77
CSE 4701
Attributes
Class Attribute - Play/Don’t Play the Game Training Set
Values that Set the Condition for the Classification What are the Pattern Below?
Attribute Possible Valuesoutlook sunny, overcast, raintemperature continuoushumidity continuouswindy true, false
Outlook Temperature Humidity Windy Playsunny 85 85 false Noovercast 83 78 false Yessunny 80 90 true Nosunny 72 95 false Nosunny 72 70 false Yes… … … … ...
An Example
Chaps26.28.29-78
CSE 4701
Data Mining Methods Summarization
Characterization (Summarization) of General Features of Objects in the Target Class
Example Characterize People’s Buying Patterns on the Weekend Potential Impact on “Sale Items” & “When Sales Start” Department Stores with Bonus Coupons
Discrimination Comparison of General Features of Objects
Between a Target Class and a Contrasting Class Example
Comparing Students in Engineering and in Art Attempt to Arrive at Commonalities/Differences
Chaps26.28.29-79
CSE 4701
barcode category brand content size
14998 milk diaryland Skim 2L
12998 mechanical MotorCraft valve 23a 12in
… … … … ...
food
Milk … bread
Skim milk … 2% milk White whole bread … wheat
Lucern … DairylandWonder … Safeway
Category Content Count
milk skim 280milk 2% 98… … ...
Summarization Technique Attribute-Oriented Induction Generalization using Concert hierarchy (Taxonomy)
Chaps26.28.29-80
CSE 4701
Building A Data Warehouse The builders of Data warehouse should take a broad
view of the anticipated use of the warehouse. The design should support ad-hoc querying An appropriate schema should be chosen that
reflects the anticipated usage. The Design of a Data Warehouse involves following
steps. Acquisition of data for the warehouse. Ensuring that Data Storage meets the query
requirements efficiently. Giving full consideration to the environment in
which the data warehouse resides.
Chaps26.28.29-81
CSE 4701
Building A Data Warehouse Acquisition of data for the warehouse
The data must be extracted from multiple, heterogeneous sources.
Data must be formatted for consistency within the warehouse.
The data must be cleaned to ensure validity. Difficult to automate cleaning process. Back flushing, upgrading the data with cleaned data.
The data must be fitted into the data model of the warehouse.
The data must be loaded into the warehouse. Proper design for refresh policy should be considered.
Chaps26.28.29-82
CSE 4701
Building A Data Warehouse Storing the data according to the data model of the
warehouse Creating and maintaining required data structures Creating and maintaining appropriate access paths Providing for time-variant data as new data are added Supporting the updating of warehouse data. Refreshing the data Purging data
Chaps26.28.29-83
CSE 4701
Why is Data Mining Popular? Technology Push
Technology for Collecting Large Quantity of Data Bar Code, Scanners, Satellites, Cameras
Technology for Storing Large Collection of Data Databases, Data Warehouses Variety of Data Repositories, such as Virtual Worlds,
Digital Media, World Wide Web Corporations want to Improve Direct Marketing and
Promotions - Driving Technology Advances Targeted Marketing by Age, Region, Income, etc. Exploiting User Preferences/Customized Shopping
Chaps26.28.29-84
CSE 4701
Requirements & Challenges in Data Mining Security and Social
What Information is Available to Mine? Preferences via Store Cards/Web Purchases What is Your Comfort Level with Trends?
User Interfaces and Visualization What Tools Must be Provided for End Users of
Data Mining Systems? How are Results for Multi-Dimensional Data
Displayed? Performance Guarantees
Range from Real-Time for Some Queries to Long-Term for Other Queries
Data Sources of Complex Data Types or Unstructured Data - Ability to Format, Clean, and Load Data Sets
Chaps26.28.29-85
CSE 4701
Data Mining Visualization Leverage Improving 3D Graphics and Increased PC
Processing Power for Displaying Results Significant Research in Visualization w.r.t. Displaying
Multi-Dimensional Data
Chaps26.28.29-86
CSE 4701
Successful Data Mining Applications Business Data Analysis and Decision Support
Marketing, Customer Profiling, Market Analysis and Management, Risk Analysis and Management
Fraud Detection Detecting Telephone Fraud, Automotive and
Health Insurance Fraud, Credit-card Fraud, Suspicious Money Transactions (Money Laundering)
Text Mining Message Filtering (Email, Newsgroups, Etc.) Newspaper Articles Analysis
Sports IBM Advanced Scout Analyzed NBA Game
Statistics (Shots Blocked, Assists and Fouls) to Gain Competitive Advantage
Chaps26.28.29-88
CSE 4701
Databases on WWW Web has changed the way we do Business & Research Facts:
Industry Saw an Opportunity, knew it had to Move Quickly to Capitalize Lots of Action, Lots of Money, Lots of Releases Line Between R&D is Very Narrow Many Researchers Moved to Industry (Trying to Return
Back to Academia) Emergence of Java
Java changed the way that Software was Designed, Developed, and Utilized
Particularly w.r.t. Web-Based Applications, Database Interoperability, Web Architectures, etc.
Emergence of Enterprise Computing
Chaps26.28.29-89
CSE 4701
Internet and the Web A Major Opportunity for Business
A Global Marketplace Business Across State and Country Boundaries
A Way of Extending Services Online Payment vs. VISA, Mastercard
A Medium for Creation of New Services Publishers, Travel Agents, Teller, Virtual Yellow Pages,
Online Auctions … A Boon for Academia
Research Interactions and Collaborations Free Software for Classroom/Research Usage Opportunities for Exploration of Technologies in
Student Projects
Chaps26.28.29-90
CSE 4701
Intranet Decision
support Mfg.. System
monitoring corporate
repositories Workgroups
Server
CorporateNetwork
Server
ServerServer
CorporateNetwork
Internet
Internet Sales Marketing Information Services
Business to Business Information sharing Ordering info./status Targeted electronic
commerce
WWW: Three Market Segments
Chaps26.28.29-91
CSE 4701
Information Delivery Problems on the Net Everyone can Publish Information on the Web
Independently at Any Time Consequently, there is an Information Explosion Identifying Information Content More Difficult
There are too Many Search Engines but too Few Capable of Returning High Quality Data Is this Still True?
Most Search Engines are Useful for Ad-hoc Searches but Awkward for Tracking Changes Is this Still True?
Chaps26.28.29-92
CSE 4701
Example Web Applications Scenario 1: World Wide Wait
A Major Event is Underway and the Latest, Up-to-the Minute Results are Being Posted on the Web
You Want to Monitor the Results for this Important Event, so you Fire up your Trusty Web Browser, Pointing at the Result Posting Site, and Wait, and Wait, and Wait …
What is the Problem? The Scalability Problems are the Result of a
Mismatch Between the Data Access Characteristics of the Application and the Technology Used to Implement the Application
Changed with Emergence of Mobile Computing?
Chaps26.28.29-93
CSE 4701
Example Web Applications Scenario 2:
Many Applications Today have the Need for Tracking Changes in Local and Remote Data Sources and Notifying Changes If Some Condition Over the Data Source(s) is Met
If You Want to Monitor the Changes on Web, You Need to Fire Your Trusty Web Browser from Time to Time, and Cache the Most Recent Result, and do the Difference Manually Each Time You Poll the Data Source(s) …
What is the Problem? Pure Pull is Not the Answer to All Problems
Changed with Emergence of Mobile Computing?
Chaps26.28.29-94
CSE 4701
What is the Problem? Applications are Asymmetric but the Web is Not
Computation Centric vs. Information Flow Centric Type of Asymmetry
Network Asymmetry Satellite, CATV, Mobile Clients, Etc.
Client to Server Ratio Too Many Clients can Swamp Servers
Data Volume Mouse and Key Click vs. Content Delivery
Update and Information Creation Clients Need to be Informed or Must Poll
What have we Seen re. Cell Networks Over Time?
Chaps26.28.29-95
CSE 4701
Useful Solutions Combination/Interleave of Pull and Push Protocols
User-initiated, Comprehensive Search-based Information Delivery (Pull)
Server-initiated Information Dissemination (Push) Provide Support for a Variety of Data Delivery
Protocols, Frequencies, and Delivery Modes Information Delivery Frequencies
Periodic, Conditional, Ad-Hoc Information Delivery Modes Information Delivery Protocols (IDP)
Request/Respond, Polling, Publish/Subscribe, Broadcast
Information Delivery Styles (IDS) Pull, Push, Hybrid
Chaps26.28.29-96
CSE 4701
Information Delivery Frequencies Periodic
Data is Delivered from a Server to Clients Periodically
Period can be Defined by System-default or by Clients Using their Profiles
Period can be Influenced by Client and Bandwidth Mobile Device vs. PC w/Modem PC w/DSL vs. PC w/Cable Modem Multiple Mobile Devices of All Types Streaming of Videos, Live Streaming of Events
Conditional (Aperiodic) Data is Delivered from a Server when Conditions
Installed by Clients in their Profiles are Satisfied Ad-hoc (or Irregular)
Chaps26.28.29-97
CSE 4701
Information Delivery Modes Uni-cast
Data is Sent from a Data Source (a Single Server) to Another Machine
1-to-n Data is Sent by a Single Data Source and Received
by Multiple Machines Multicast vs. Broadcast
Multicast: Data is Sent to a Specific Set of Clients Broadcast: Sending Data Over a Medium which an
Unidentified or Unbounded Set of Clients can Listen
Chaps26.28.29-98
CSE 4701
IDP: Request/Respond Semantics of Request/Respond
Clients Send their Request to Servers to Ask the Information of their Interest
Servers Respond to the Client Request by Delivering the Information Requested
Client can Wait (Synchronous) or Not Applications
Most Database Systems and Web Search Engines are Using the Request/Respond Protocol for Client-Server Communication
What has Changed with Mobile Computing?
Chaps26.28.29-99
CSE 4701
IDP: Programmed Polling vs. User Polling Semantics:
Programmed Polling: a System Periodically Sends Requests to Other Sites to Obtain Status Information or Detect Changed Values
User Polling: a User or Application Periodically or Aperiodically Polls the Data Sites and Obtains the Changes
Applications Programmed Polling: Save the Users from having
to Click, but does Nothing to Solve the Scalability Problems Caused by the Request/Respond Mechanism
What do Today’s Mobile Devices Use?
Chaps26.28.29-100
CSE 4701
IDP: Publish/Subscribe Semantics: Servers Publish/Clients Subscribe
Servers Publish Information Online Clients Subscribe to the Information of Interest
(Subscription-based Information Delivery) Data Flow is Initiated by the Data Sources
(Servers) and is Aperiodic Danger: Subscriptions can Lead to Other
Unwanted Subscriptions Applications
Unicast: Database Triggers and Active Databases 1-to-n: Online News Groups
How is this Utilized in Mobile Devices?
Chaps26.28.29-101
CSE 4701
Information Delivery Styles Pull-Based System
Transfer of Data from Server to Client is Initiated by a Client Pull
Clients Determine when to Get Information Potential for Information to be Old Unless Client
Periodically Pulls Push-Based System
Transfer of Data from Server to Client is Initiated by a Server Push
Clients may get Overloaded if Push is Too Frequent
Hybrid Pull and Push Combined Pull First and then Push Continually
Chaps26.28.29-102
CSE 4701
Request/Respond
Publish/Subscribe
Broadcast Periodic Conditional Ad-hoc
Pure Pull Y Y
Pure Push Y Y Y Y*
Hybrid Y Y Y Y Y Y*
Summary: Pull vs. Push
Chaps26.28.29-103
CSE 4701
Design Options for Nodes Three Types of Nodes:
Data Sources Provide Base Data which is to be Disseminated
Clients Who are the Net Consumers of the Information
Information Brokers Acquire Information from Other Data Sources, Add
Value to that Information and then Distribute this Information to Other Consumers
By Creating a Hierarchy of Brokers, Information Delivery can be Tailored to the Need of Many Users
How has this Changed with Today’s Mobile Computing?
Chaps26.28.29-104
CSE 4701
The Next Big Challenge Interoperability
Heterogeneous Distributed Databases Heterogeneous Distributed Systems Autonomous Applications
Scalability Rapid and Continuous Growth
Amount of Data Variety of Data Types Dealing with personally identifiable information (PII)
and personal health information (PHI) Emergence of Fitness and Health Monitoring Apps Google Fit and Apple HealthKit New Apple ResearchKit for Medical Research
Chaps26.28.29-105
CSE 4701
Interoperability: A Classic View
FDB Global Schema
FederatedIntegration
Local Schema
Local Schema
Local Schema
FDB Global Schema 4
FederatedIntegration
FDB 1Local
SchemaFDB3
Federation Federation
Simple Federation Multiple Nested Federation
Chaps26.28.29-106
CSE 4701
LegacyApplication
Network
Java Client
Java Application Code
WRAPPER
Mapping Classes
JAVA LAYER
NATIVE LAYER
Native Functions (C++)RPC Client Stubs (C)
Interactions Between Java Clientand Legacy Appl. via C and RPC
C is the Medium of Info. Exchange
Java Client with C++/C Wrapper
Java Client with Wrapper to Legacy Application
Chaps26.28.29-107
CSE 4701
Network
Java Application Code
JAVA NETWORK WRAPPER
Mapping Classes
NATIVE LAYER
JAVA LAYER
Native Functions that Map to COTS Appl
Java Client Java Client
Java Application Code
JAVA NETWORK WRAPPER
Mapping Classes
NATIVE LAYER
JAVA LAYER
Native Functions that Map to Legacy Appl
COTS Application Legacy Application
Java is Medium of Info. Exchange - C/C++ Appls with Java Wrappers
COTS and Legacy Appls. to Java Clients
Chaps26.28.29-108
CSE 4701
Java Client
LegacyApplication
Relational Database
System(RDS)
Transformed Legacy Data
Updated Data
Extract and Generate Data
Transform andStore Data
Java Client to Legacy App via RDBS
Chaps26.28.29-109
CSE 4701
Information Broker
•Mediator-Based Systems•Agent-Based Systems
Database Interoperability in the Internet Technology
Web/HTTP, JDBC/ODBC, CORBA (ORBs + IIOP), XML, SOAP, REST API, WSDL
Architecture
Chaps26.28.29-110
CSE 4701
Driver Driver
Java Application
JDBC API
Driver Manager
Oracle SybaseAccess
Driver
JDBC
Driver Driver
JDBC API Provides DB Access Protocols for Open, Query, Close, etc.
Different Drivers for Different DB Platforms
Chaps26.28.29-111
CSE 4701
Connecting a DB to the Web
Web Server are Stateless
DB Interactions Tend to be Stateful
Invoking a CGI Script on Each DB Interaction is Very Expensive, Mainly Due to the Cost of DB Open
DBMS
Web Server
Browser
Internet
CGI Script Invocationor JDBC Invocation
Chaps26.28.29-112
CSE 4701
Connecting More Efficiently
To Avoid Cost of Opening Database, One can Use Helper Processes that Always Keep Database Open and Outlive Web Connection
Newly Invoked CGI Scripts Connect to a Preexisting Helper Process
System is Still Stateless
DBMS
Web Server
Browser
Internet
CGI Scriptor JDBC Invocation
Helper Processes
Chaps26.28.29-113
CSE 4701
DB-Internet Architecture
WWW Client(Netscape)
WWW Client(HotJava)
WWW client(Info. Explore)
Internet
HTTP Server
DBWeb Gateway
DBWeb Gateway
DBWeb Gateway
DBWeb Gateway
DBWeb Dispatcher
Chaps26.28.29-115
CSE 4701
The focus is to make information available to users, in the right form, at the right time, in the appropriate place.
Technology Push Computer/Communication Technology (Almost Free)
Plenty of Affordable CPU, Memory, Disk, Network Bandwidth
Next Generation Internet: Gigabit Now Wireless: Ubiquitous, High Bandwidth
Information Growth Massively Parallel Generation of Information on
the Internet and from New Generation of Sensors Disk Capacity on the Order of Peta-bytes
Small, Handy Devices to Access Information
Chaps26.28.29-116
CSE 4701 Ubiquitous/Pervasive
Many computers and information appliances everywhere,
networked together
Research Challenges Inherent Complexity:
Coping with Latency (Sometimes Unpredictable)
Failure Detection and Recovery (Partial Failure)
Concurrency, Load Balancing, Availability, Scale
Service Partitioning Ordering of Distributed Events
“Accidental” Complexity: Heterogeneity: Beyond the Local
Case: Platform, Protocol, Plus All Local Heterogeneity in Spades.
Autonomy: Change and Evolve Autonomously
Tool Deficiencies: Language Support (Sockets,rpc), Debugging, Etc.
Chaps26.28.29-117
CSE 4701
Problem: too many sources,too much information
Internet:Information Jungle
Clean, Reliable,Timely Information,Anywhere
DigitalEarth
Sensors
PersonalizedFiltering &Info. Delivery
Infopipes
Resou
rce A
dapta
tion Property Mgmt
Information QualityContinual Queries
Mic
rofe
edba
ck
specializationInfosphere
Chaps26.28.29-118
CSE 4701
ThinClient
WebServer
MainframeDatabaseServer
Current State-of-Art – Has Mobile Changed This?
Chaps26.28.29-119
CSE 4701 Infotaps &
Fat Clients
Varietyof Servers
Sensors
DatabaseServer
Many sources
Infosphere Scenario – Where Does Mobile Fit?
Chaps26.28.29-120
CSE 4701
Heterogeneity and Autonomy Heterogeneity:
How Much can we Really Integrate? Syntactic Integration
Different Formats and Models XML/JSON/RDF/OWL/SQL Query Languages
Semantic Interoperability Basic Research on Ontology, Etc.
Autonomy No Central DBA on the Net Independent Evolution of Schema and Content Interoperation is Voluntary Interface Technology DCOM: Microsoft Standard
CORBA, Etc...
Chaps26.28.29-121
CSE 4701
Security and Data Quality Security
System Security in the Broad Sense Attacks: Penetrations, Denial of Service System (and Information) Survivability
Security Fault Tolerance Replication for Performance, Availability, and
Survivability Data Quality
Web Data Quality Problems Local Updates with Global Effects Unchecked Redundancy (Mutual Copying) Registration of Unchecked Information Spam on the Rise
Chaps26.28.29-122
CSE 4701
Legacy Data Challenge Legacy Applications and Data
Definition: Important and Difficult to Replace Typically, Mainframe Mission Critical Code Most are OLTP and Database Applications
Evolution of Legacy Databases Client-server Architectures Wrappers Expensive and Gradual in Any Case
Chaps26.28.29-123
CSE 4701
Potential Value Added/Jumping on Bandwagon
Sophisticated Query Capability Combining SQL with Keyword Queries
Consistent Updates Atomic Transactions and Beyond
But Everything has to be in a Database! Only If we Stick with Classic DB Assumptions
Relaxing DB Assumptions Interoperable Query Processing Extended Transaction Updates
Commodities DB Software A Little Help is Still Good If it is Cheap Internet Facilitates Software Distribution Databases as Middleware
Chaps26.28.29-124
CSE 4701
Concluding Remarks Four-Fold Objective
Distributed Database Processing Data Warehouses Data Mining of Vast Information Repositories Web-Based Architectures for DB Interoperability
All Three are Tightly Related DDBMS can Improve Performance of Mining
Repositories as Backend Database Processors Web-Based Architectures Provide Access Means
for DDBMS or Mining Warehouses are Infrastructure to Facilitate Mining
Geographic Information Systems, Deductive DBMS, Multi-Media DBMS, Mobile DBMS, Embedded/Real-Time DBMS, etc.