1. Data Base1. Data Base2 Data Warehouse2. Data Warehouse3. Data Miningg
For Operate business For Analyze business
For Discover business
D W h O i l DBMSData Warehouse vs. Operational DBMS
• OLTP (on‐line transaction processing)Major task of traditional relational DBMS– Major task of traditional relational DBMS
– Day‐to‐day operations: purchasing, inventory, b ki f t i ll i t tibanking, manufacturing, payroll, registration, accounting, etc.
• OLAP (on‐line analytical processing)( y p g)– Major task of data warehouse system– Data analysis and decision makingData analysis and decision making
3
OLTP vs. OLAPOLTP vs. OLAP
OLTP OLAPOLTP OLAPusers clerk, IT professional knowledge workerfunction day to day operations decision supportDB d i li ti i t d bj t i t dDB design application-oriented subject-orienteddata current, up-to-date
detailed, flat relationalhistorical, summarized,detailed, flat relational
isolatedsummarized, multidimensionalintegrated, consolidated
titi d husage repetitive ad-hocaccess read/write
index/hash on prim. keylots of scans
index/hash on prim. keyunit of work short, simple transaction complex query# records
dtens millions
accessed#users thousands hundredsDB size 100MB-GB 100GB-TB
4
DB size 100MB GB 100GB TBmetric transaction throughput query throughput, response
Knowledge DiscoveryKnowledge Discovery
5
Discovering useful patternsDiscovering useful patterns
What is a Data Warehouse?What is a Data Warehouse?
Common definitions of a Data Warehouse
• A decision support database that is maintainedA decision support database that is maintained separately from the organization’s operational databasedatabase
– Support information processing by providing a solid platform of consolidated, historical data for analysis.y
• “A data warehouse is a subject‐oriented, integrated, time variant and nonvolatile collection of data intime‐variant, and nonvolatile collection of data in support of management’s decision‐making
” W H I6
process.”—W. H. Inmon
Data Warehouse Implementation Road Map
Extract and transform
ETL ImportanceExtraction
ETL ImportanceEnsure data is1 R l t1. Relevant 2. Useful3 Quality3. Quality4. Accurate5 Accessible5. Accessible
TransformationTransformation1. Anomalies exist in operational
data ‐ inconsistent development
Online Analytical Processing : OLAP• extend the capabilities of query
and reportingdata ‐ inconsistent development approaches
2 Eliminates anomalies
and reporting• enables users to view the data in
complex relationships (Multi-2. Eliminates anomalies • Cleans• Standardizes
p p (dimensions)
• provides drill down and roll upbe able to slice and dice Standardizes
• Presents subject oriented data• be able to slice and dice• What if analysis
D t W h A hit tData Warehouse Architecture
Figure1. Basic Architecture Figure2. With a Stage Architecture
Business IntelligenceBusiness Intelligence
10
Data Warehouse D t T f ti S iData Transformation Services
Fact constellations
Star Schema
Snow‐flakeSnow flake
T i l OLAP O iTypical OLAP Operations
• Roll up (drill‐up): summarize data
– by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll‐up
– from higher level summary to lower level summary or detailed data, or introducing new dimensions
• Slice and dice: project and select• Pivot (rotate):
– reorient the cube, visualization, 3D to series of 2D planes
• Other operations
– drill across: involving (across) more than one fact table
– drill through: through the bottom level of the cube to its back‐end relational tables (using SQL)
16
Ten Common Mistakes1 St ti ith1. Starting with wrong sponsors
a. Data Warehousing Managerb. Executive sponsor with great deal of moneyc. Project “driver”
a. Has already earned the respect of the other executivesb. Has healthy skepticism about technologyy p gyc. Is decisive but flexible
2. Setting unrealistic expectations that can’t be meta Data warehousing has two phases:a. Data warehousing has two phases:
Selling Phase – persuade peopleStruggle Phase –meet the expectation
b. Frustrates executives at the moment of truth
3. Promoting wrong value of their Data Warehouse
i i liti ll ï b h ia. engaging in politically‐naïve behavior
a. help managers make better decisions
b. lose potential supportersb. lose potential supporters4. Loading Data Warehouse with unnecessary information
a. sends a list of table and data elements to the end user along with requestb. get back long lists of unnecessary informationc. slows responsiveness and increase the data warehouse storage requirements
5. Data Warehouse Database Design vs. Transactional Database Designa. Transaction processing:
‐ a programmer develops a query that will be used many timesa programmer develops a query that will be used many times‐ usually contains only the basic data
b. Data warehousing:‐ an end‐user develops the query and may use it only one time
fi d d i i i f i l d l l d‐ expect to find aggregates – sums, averages, trends, time‐series information already calculated for them and ready for immediate display
6. Data Warehousing Manager: Technology‐oriented rather than User‐orienteda user hostile project manager puts entire project in danger of being scrappeda. user hostile project manager puts entire project in danger of being scrappedb. Data Warehousing is a service business and not a storage business. c. Don’t make clients angry!!!
7 Too much emphasis on traditional internal record‐oriented data7. Too much emphasis on traditional internal record oriented dataa. senior executives see data warehouses as irrelevantb. consider including images, graphics, audio or video, etc…
8. Delivering data with overlapping and confusing definitionsg pp g ga. Finance manager – sales means net of revenue less returnsb. Distribution people – sales means what needs to be deliveredc. Sales person – sales means amount committed by clients
9 f C i d S l bili9. Performance, Capacity, and Scalability a. within 4 month, purchase at least one additional processor equal or larger than the current
computer.b b d t f dditi l h db. budget for additional hardwarec. budget for unforeseen difficultiesd. network overloads are a very common
10 Believing that once the Data Warehouse is up and running your problems are finished10. Believing that once the Data Warehouse is up and running, your problems are finisheda. data warehousing project team needs to maintain high energy over long periods of time.b. Data warehousing is a journey not a destination
Data MiningData Mining
วัฎจักรขั้นตอนการทํางานของ CRISP-DMD t Mi i Pวฎจกรขนตอนการทางานของ
Data Mining ประกอบไปดวย 4 ขั้นตอนหลักๆ ดงันี้
Data Mining ProcessProblem formulationๆ
1. เขาใจธุรกิจนั้น เพือ่ระบุโอกาสทางธุรกิจหรือการระบปญหาทีเ่กิดขึ้นกับธรกิจ Data Selectionหรอการระบุปญหาทเกดขนกบธุรกจ
2. ตองเขาใจขอมูลและแหลงขอมูล เพื่อระบุขอบเขตของขอมูลที่จะนํามาทาํการวเิคราะห เพื่อนาํมาทาํการแกไขปญหา
Data Selection
Data Cleaningวเคราะห เพอนามาทาการแกไขปญหา3. ทาํการเปลี่ยนแปลงขอมูลดิบใหอยูในรูป
ของขอมูลที่จะนําไปใชไดจริงในทางิ
Data Transformation
ธุรกิจ 4. นาํเทคนิคของ Data Mining ไปใชกับ
ขอมูล เพื่อคนหาความสัมพันธและ้
Data Mining
Result evaluationรูปแบบทั้งหมด 5. วดัประสิทธภิาพของตวัแบบ การวดั
ประสิทธภิาพของเทคนคิของ Data
Result evaluation and Visualization
Mining ที่จะนาํมาใช จากผลลัพธ ซึง่สามารถตรวจสอบไดหลายทาง
6 นาํเอาตวัแบบที่ประเมินแลว ไปปฏิบัติ6. นาเอาตวแบบทประเมนแลว ไปปฏบตจริงกับธุรกิจ
Multi‐Dimensional Major Tasks in Data Multi Dimensional Measure of Data Quality
jPreprocessing
•Data cleaningA well‐accepted multidimensional view:
Data cleaning• Fill in missing values, smooth noisy data, identify or remove outliers, and multidimensional view:
• Intrinsic DQ: Accuracy, objectivity, believability, and reputation.
resolve inconsistencies•Data integration
• Integration of multiple databases datay, p• Accessibility DQ: Accessibility and
access security.
Integration of multiple databases, data cubes, or files
•Data transformation
• Contextual DQ: Relevancy, value added, timeliness, completeness,
•Normalization and aggregation•Data reduction
• Obtains reduced representation inamount of data.
• Representation DQ: Interpretability,
Obtains reduced representation in volume but produces the same or similar analytical results
ease of understanding, concise representation, consistent
•Data discretization• Part of data reduction but with particular importance especially forrepresentation. particular importance, especially for numerical data
Major Tasks in Data Preprocessing
Data integrationData integration
Data cleaningData cleaning
Data transformation -5, 32, 100,59, 45 -0.005, 0.032, .100, .059, 0.045attribute
attributeA1 A2 A3 A226
Data reduction
nsa
ctio
n
nsa
ctio
n
A1 A2 A3 ………… A226A1 A2 …… A105
T1
T2
T1
T2
Data cleaning taskso Fill in missing values
tran tran …
T459
…
T2000
o Fill in missing valueso Identify outliers and smooth out noisy data o Correct inconsistent data
Data Mining Strategies
Predictive or Supervised Modeling
Descriptive or Unsupervised Modelingor Supervised Modeling or Unsupervised Modeling
Classification Prediction Associations Clustering
Estimation/ Regression
ID คืนเงิน
… ราย ได
โกง Predictedตองการทราบ Pattern
Supervised
ID ื โ
N ??
Y ??
ตองการทราบ Pattern ของคนที่โกงภาษี
ID คืนเงิน
… ราย ได
โกง
Y
… Y ??
Testing d t t
Y
… … Ndataset
Training dataset
Learning Classifier
ModelModel
Model
Predicted Class
(Y /N )dataset Classifier Model (Yes/No)
ID คืนเงิน
… ราย ได
PredictedNew Case เงน ได cted
??
New Case
Top Related