CPSC-608 Database Systems
Fall 2011
Instructor: Jianer Chen
Office: HRBB 315C
Phone: 845-4259
Email: [email protected]
Notes #15
2
Brief Overview on
• Data/information integration (data warehouse)
• Data mining
3
Data Warehouse (Overview)
A data warehouse is the main repository of an organization's historical data, its corporate memory. It contains the raw material for management's decision support system. The critical factor leading to the use of a data warehouse is that a data analyst can perform complex queries and analysis, such as data mining, on the information without slowing down the operational systems. [Wikipedia]
What is a Warehouse?
• Collection of (possibly diverse) data– subject oriented
– aimed at executive, decision maker, analysts
– often a copy of operational data– with value-added data (e.g., summaries, history)
– integrated schema
– time-varying
– non-volatile
4
What is a Warehouse?
• Collection of tools/services– gathering data
– cleansing, integrating, ...
– querying, reporting, aggregation, analysis
– data mining
– monitoring, administration
5
Why a Warehouse?
• Ship and integrate data from different sources to the analyst
• Three Approaches:– Database federations (legacy)– Query-driven (lazy)– Warehouse (eager)
66
Database Federations
7
• An application program for each connection, • Simple, good if DB communications are limited• Needs to write many application programs
Warehouse Architecture
8
Client
MetadataMetadata
Client
SQL & data stored in unifiedDB schema
SQL & data stored in unifiedDB schema
Each source has a wrapper/extractor that consists of a collection of predefined queries on the source, and communication mechanisms
Query-Driven Approach
9
query
result
queryresult
query result
query result query resultquery result
SQL, but
no data storedSQL, but
no data stored
Each source has a wrapper, which classifies queries into templates, and translates them into queries for the source. The wrapper can be generated from templates using modern compiler techniques.
Advantages of Query-Driven
• No need to copy data– less storage
– no need to purchase data
• More up-to-date data
• Query needs can be unknown
• Only query interface needed at sources
• May be less draining on sources
10
Advantages of Warehousing
• High query performance
• Queries not visible outside warehouse
• Local processing at sources unaffected
• Can operate when sources unavailable
• Can query data not stored in a DBMS
• Extra information at warehouse– Modify, summarize (store aggregates)
– Add historical information
11
OLTP vs. OLAP
• OLTP: On Line Transaction Processing– Describes processing at operational sites (sources)
• OLAP: On Line Analytical Processing– Describes processing at warehouse
12
OLTP vs. OLAP
• Mostly updates
• Many small transactions
• Megabyte-terabyte of data
• Raw data
• Up-to-date data
• Consistency, recoverability critical
• Clerical users
• Mostly reads
• Queries long, typically complex aggregations
• Gigabyte-terabyte of data
• Summarized, consolidated data
• Decision-makers, analysts as users
13
OLTP OLAP
Implementing a Warehouse
• Monitoring: Sending data from sources
• Integrating: Data loading, cleaning,...
• Processing: Query processing, indexing, ...
• Managing: Metadata, Design, ...
14
Monitoring Issues
• Frequency– periodic: daily, weekly, …
– triggered: on “big” change, lots of changes, ...
• Data transformation/normalization– convert data to uniform format– remove & add fields (e.g., add date to get history)
• Standards• Gateways (Intranet/internet, firewalls, VPN, etc.)
15
Integration
• Data Cleaning
• Data Loading
• Derived Data
16
Processing
• Index Structures
• What to Materialize?
• Algorithms
17
Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
Managing
• Metadata
• Warehouse Design
• Tools
18
Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
Warehouse Design
• What data is needed?
• Where does it come from?
• How to clean data?
• How to represent in warehouse (schema)?
• What to summarize?
• What to materialize?
• What to index?
19
Conclusions
• Massive amounts of data and complexity of queries will push limits of current warehouses
• Need better systems:– easier to use
– provide quality information
– scalability
CS 245 Notes12 20
Data Mining (Overview)
What is data mining?
A process of examining data and finding simple rules or models that summarize the data.
Mining Techniques:
• Decision Trees
• Clustering
• Association Rules
21
Decision Trees
22
sale custId car age city newCarc1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no
Example:• Conducted survey to see what customers were interested in new model car• Want to select customers for advertising campaign
trainingset
trainingset
One Possibility
23
sale custId car age city newCarc1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no
car=taurus
city=sf age<45
likely likelyunlikely unlikely
YY
Y
NN
N
Another Possibility
24
sale custId car age city newCarc1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no
age<30
city=sf car=van
likely likelyunlikely unlikely
YY
Y
NN
N
Issues
• Decision tree should not be “too deep”– would not have statistically significant amounts of data for
lower decisions
• Need to select tree that most reliably predicts outcomes– automatic decision tree construction from training data
(“unsupervised learning”)
– exploit training data statistics to detect most ”discriminative” attribute/value conditions at each level
25
Clustering
26
age
inco
me
educ
ation
Another Example: Text
• Each document is a vector
• Clusters contain “similar” documents
• Useful for understanding, searching documents
27
internationalnews
sports
business
Issues
• Given desired number of clusters?
• Finding “best” clusters
• Are clusters semantically meaningful?
• Using clusters for disk storage
28
Association Rule Mining
29
tran1 cust33 p2, p5, p8tran2 cust45 p5, p8, p11tran3 cust12 p1, p9tran4 cust40 p5, p8, p11tran5 cust12 p2, p9tran6 cust12 p9
transa
ction
id custo
mer
id products
bought
salesrecords:
• Trend 1) Products p5, p8 often bought together• Trend 2) Customer 12 likes product p9
market-basketdata
market-basketdata
Association Rule
• Rule: {p5, p8}, {cust12, p9}, …
• Support: number of “baskets” where these products appear
• High-support set: support threshold s
• Problem: find all high support sets
30
Association Rules
• How do we perform rule mining efficiently?
• Observation: – If set X has support t, then each X subset must have
at least support t
• For 2-sets:– if we need support s for {i, j}
– then each i, j must appear in at least s baskets
• A-Priori Algorithm31
32
CSCE-608 Course Summary
• Overview of DB and DBMS systems;
• The memory architecture;
• Indexing and hashing;
• Query processing;
• Crash recovery;
• Concurrency control;
• Transaction processing;
• Data integrity and data mining;
33
CSCE-608 Course Summary
• Overview of DB and DBMS systems;
• The memory architecture;
• Indexing and hashing;
• Query processing;
• Crash recovery;
• Concurrency control;
• Transaction processing;
• Data integrity and data mining;
34
Indexing and Hashing
• B+ trees
structure
operations: search, insert, delete
• Hashing
hash table and hash function
operations: search, insert, delete
extensible hashing
linear hashing
35
Query Processing
• Query compiler, parse tree
• Logic query plan, physical query plan
• Disk I/O efficient algorithms
• Cost estimation of query plans
36
Crash Recovery
• Undo logging• Redo logging
• Undo/redo logging
• Recovery algorithms
• Checkpoints
37
Concurrent Control
• Serialization• Locking systems
• Timestamp
• Validation
38
Transaction processing
• Recoverability • Handling deadlocks
Top Related