Data warehouse,data mining & Big Data

89
Data Warehouse, Data Mart, OLTP, OLAP, Data Mining & Big Data 1

Transcript of Data warehouse,data mining & Big Data

Page 1: Data warehouse,data mining & Big Data

1

Data Warehouse, Data Mart, OLTP, OLAP, Data

Mining & Big Data

Page 2: Data warehouse,data mining & Big Data

2

Part 1: Data Warehouses Part 2: OLAP Part 3: Data Mining Part 4: Big Data

Overview

Page 3: Data warehouse,data mining & Big Data

3

Part 1: Data Warehouses

Page 4: Data warehouse,data mining & Big Data

4

Data, Data everywhereyet ...

I can’t find the data I need◦ data is scattered over the

network◦ many versions, subtle differences I can’t get the data I need need an expert to get the data

I can’t understand the data I found available data poorly

documented I can’t use the data I found results are unexpected data needs to be transformed

from one form to other

Page 5: Data warehouse,data mining & Big Data

5

What is a Data Warehouse? A single, complete and

consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

[Barry Devlin]

Page 6: Data warehouse,data mining & Big Data

6

Why Data Warehousing?Which are our

lowest/highest margin customers ?

Who are my customers and what products are they buying?

Which customers are most likely to go to the competition ?

What impact will new products/services

have on revenue and margins?

What product prom--otions have the biggest

impact on revenue?

What is the most effective distribution

channel?

Page 7: Data warehouse,data mining & Big Data

7

Used to manage and control business Data is historical or point-in-time Optimized for inquiry rather than update Used by managers and end-users to

understand the business and make judgements

Decision Support

Page 8: Data warehouse,data mining & Big Data

8

Since 1970s, organizations gained competitive advantage through systems that automate business processes to offer more efficient and cost-effective services to the customer.

This resulted in accumulation of growing amounts of data in operational databases.

The Evolution of Data Warehousing

Page 9: Data warehouse,data mining & Big Data

9

A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process (Inmon, 1993).

Data Warehousing Concepts

Page 10: Data warehouse,data mining & Big Data

10

The warehouse is organized around the major subjects of the enterprise (e.g. customers, products, and sales) rather than the major application areas (e.g. customer invoicing, stock control, and product sales).

This is reflected in the need to store decision-support data rather than application-oriented data.

Subject-oriented Data

Page 11: Data warehouse,data mining & Big Data

11

The data warehouse integrates corporate application-oriented data from different source systems, which often includes data that is inconsistent.

The integrated data source must be made consistent to present a unified view of the data to the users.

Integrated Data

Page 12: Data warehouse,data mining & Big Data

12

Data in the warehouse is only accurate and valid at some point in time or over some time interval.

Time-variance is also shown in the extended time that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots.

Time-variant Data

Page 13: Data warehouse,data mining & Big Data

13

Data in the warehouse is not updated in real-time but is refreshed from operational systems on a regular basis.

New data is always added as a supplement to the database, rather than a replacement.

Non-volatile Data

Page 14: Data warehouse,data mining & Big Data

14

Potential high returns on investment

Competitive advantage

Increased productivity of corporate decision-makers

Benefits of Data Warehousing

Page 15: Data warehouse,data mining & Big Data

15

Comparison of OLTP Systems and Data Warehousing

Page 16: Data warehouse,data mining & Big Data

16

The types of queries that a data warehouse is expected to answer ranges from the relatively simple to the highly complex and is dependent on the type of end-user access tools used.

End-user access tools include:◦Reporting, query, and application

development tools◦Executive information systems (EIS)◦OLAP tools◦Data mining tools

Data Warehouse Queries

Page 17: Data warehouse,data mining & Big Data

17

What was the total revenue for Scotland in the third quarter of 2004?

What was the total revenue for property sales for each type of property in Great Britain in 2003?

What are the three most popular areas in each city for the renting of property in 2004 and how does this compare with the figures for the previous two years?

What is the monthly revenue for property sales at each branch office, compared with rolling 12-monthly prior figures?

What would be the effect on property sales in the different regions of Britain if legal costs went up by 3.5% and Government taxes went down by 1.5% for properties over £100,000?

Which type of property sells for prices above the average selling price for properties in the main cities of Great Britain and how does this correlate to demographic data?

What is the relationship between the total annual revenue generated by each branch office and the total number of sales staff assigned to each branch office?

Examples of Typical Data Warehouse Queries

Page 18: Data warehouse,data mining & Big Data

18

Underestimation of resources for data loading

Hidden problems with source systems

Required data not captured

Increased end-user demands

Data homogenization

Problems of Data Warehousing

Page 19: Data warehouse,data mining & Big Data

19

High demand for resources

Data ownership

High maintenance

Long duration projects

Complexity of integration

Problems of Data Warehousing

Page 20: Data warehouse,data mining & Big Data

20

Typical Architecture of a Data Warehouse

Page 21: Data warehouse,data mining & Big Data

21

A subset of a data warehouse that supports the requirements of a particular department or business function.

Characteristics include◦Focuses on only the requirements of one

department or business function.◦Do not normally contain detailed

operational data unlike data warehouses.◦More easily understood and navigated.

Data Mart

Page 22: Data warehouse,data mining & Big Data

22

To give users access to the data they need to analyze most often.

To provide data in a form that matches the collective view of the data by a group of users in a department or business function area.

To improve end-user response time due to the reduction in the volume of data to be accessed.

Reasons for Creating a Data Mart

Page 23: Data warehouse,data mining & Big Data

23

To provide appropriately structured data as dictated by the requirements of the end-user access tools.

Building a data mart is simpler compared with establishing a corporate data warehouse.

The cost of implementing data marts is normally less than that required to establish a data warehouse.

Reasons for Creating a Data Mart

Page 24: Data warehouse,data mining & Big Data

24

The potential users of a data mart are more clearly defined and can be more easily targeted to obtain support for a data mart project rather than a corporate data warehouse project.

Reasons for Creating a Data Mart

Page 25: Data warehouse,data mining & Big Data

25

From the Data Warehouse to Data Marts

DepartmentallyStructured

IndividuallyStructured

Data WarehouseOrganizationallyStructured

Less

More

HistoryNormalizedDetailed

Data

Information

Page 26: Data warehouse,data mining & Big Data

26

Part 2: OLAP

Page 27: Data warehouse,data mining & Big Data

27

Aggregation -- (total sales, percent-to-total) Comparison -- Budget vs. Expenses Ranking -- Top 10, quartile analysis Access to detailed and aggregate data Complex criteria specification Visualization Need interactive response to aggregate

queries

Nature of OLAP Analysis

Page 28: Data warehouse,data mining & Big Data

28

Accompanying the growth in data warehousing is an ever-increasing demand by users for more powerful access tools that provide advanced analytical capabilities.

There are two main types of access tools available to meet this demand, namely Online Analytical Processing (OLAP) and data mining.

Business Intelligence Technologies OLAP & Data Mining

Page 29: Data warehouse,data mining & Big Data

29

OLAP and Data Mining differ in what they offer the user and because of this they are complementary technologies.

An environment that includes a data warehouse (or more commonly one or more data marts) together with tools such as OLAP and /or data mining are collectively referred to as Business Intelligence (BI) technologies.

Business Intelligence Technologies

Page 30: Data warehouse,data mining & Big Data

30

The dynamic synthesis, analysis, and consolidation of large volumes of multi-dimensional data, Codd (1993).

Describes a technology that uses a multi-dimensional view of aggregate data to provide quick access to strategic information for the purposes of advanced analysis.

Online Analytical Processing (OLAP)

Page 31: Data warehouse,data mining & Big Data

31

Enables users to gain a deeper understanding and knowledge about various aspects of their corporate data through fast, consistent, interactive access to a wide variety of possible views of the data.

Allows users to view corporate data in such a way that it is a better model of the true dimensionality of the enterprise.

Online Analytical Processing (OLAP)

Page 32: Data warehouse,data mining & Big Data

32

Can easily answer ‘who?’ and ‘what?’ questions, however, ability to answer ‘what if?’ and ‘why?’ type questions distinguishes OLAP from general-purpose query tools.

Types of analysis ranges from basic navigation and browsing (slicing and dicing) to calculations, to more complex analyses such as time series and complex modeling.

Online Analytical Processing (OLAP)

Page 33: Data warehouse,data mining & Big Data

33

Examples of OLAP applications in various functional areas

Page 34: Data warehouse,data mining & Big Data

34

Although OLAP applications are found in widely divergent functional areas, they all have the following key features:◦ multi-dimensional views of data◦ support for complex calculations◦ time intelligence

OLAP Applications

Page 35: Data warehouse,data mining & Big Data

35

Must provide a range of powerful computational methods such as that required by sales forecasting, which uses trend algorithms such as moving averages and percentage growth.

OLAP Applications - support for complex calculations

Page 36: Data warehouse,data mining & Big Data

36

Key feature of almost any analytical application as performance is almost always judged over time.

Time hierarchy is not always used in the same manner as other hierarchies.

Concepts such as year-to-date and period-over-period comparisons should be easily defined.

OLAP Applications – time intelligence

Page 37: Data warehouse,data mining & Big Data

37

Increased productivity of end-users. Reduced backlog of applications

development for IT staff. Retention of organizational control

over the integrity of corporate data. Reduced query drag and network

traffic on OLTP systems or on the data warehouse.

Improved potential revenue and profitability.

OLAP Benefits

Page 38: Data warehouse,data mining & Big Data

38

Example of two-dimensional query. What is the total revenue generated by property

sales in each city, in each quarter of 2004?’

Choice of representation is based on types of queries end-user may ask.

Compare representation - three-field relational table versus two-dimensional matrix.

Representation of Multi-dimensional Data

Page 39: Data warehouse,data mining & Big Data

39

Multi-dimensional Data as Three-field table versus Two-dimensional Matrix

Page 40: Data warehouse,data mining & Big Data

40

Example of three-dimensional query.◦ ‘What is the total revenue generated by

property sales for each type of property (Flat or House) in each city, in each quarter of 2004?’

Compare representation - four-field relational table versus three-dimensional cube.

Representation of Multi-dimensional Data

Page 41: Data warehouse,data mining & Big Data

41

Multi-dimensional Data as Four-field Table versus Three-dimensional Cube

Page 42: Data warehouse,data mining & Big Data

42

Cube represents data as cells in an array.

Relational table only represents multi-dimensional data in two dimensions.

Representation of Multi-dimensional Data

Page 43: Data warehouse,data mining & Big Data

43

Measure - sales (actual, plan, variance)

Multi-dimensional Data

Month1 2 3 4 76 5

Prod

uct

Toothpaste

JuiceColaMilk

Cream

Soap

Region

WS

N

Dimensions: Product, Region, TimeHierarchical summarization paths

Product Region TimeIndustry Country Year

Category Region Quarter

Product City Month week

Office Day

Page 44: Data warehouse,data mining & Big Data

44

Strengths of OLAP

It is a powerful visualization tool

It provides fast, interactive response times

It is good for analyzing time series

It can be useful to find some clusters and outliners

Many vendors offer OLAP tools

Page 45: Data warehouse,data mining & Big Data

45

Andyne Computing -- Pablo

Arbor Software -- Essbase Cognos -- PowerPlay Comshare -- Commander

OLAP Holistic Systems -- Holos Information Advantage --

AXSYS, WebOLAP Informix -- Metacube Microstrategies

--DSS/Agent

Oracle -- Express Pilot -- LightShip Planning Sciences --

Gentium Platinum Technology

-- ProdeaBeacon, Forest & Trees

SAS Institute -- SAS/EIS, OLAP++

Speedware -- Media

OLAP and Executive Information Systems

Page 46: Data warehouse,data mining & Big Data

46

Part 3: Data Mining

Page 47: Data warehouse,data mining & Big Data

47

The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions, (Simoudis,1996).

Involves the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data.

Data Mining

Page 48: Data warehouse,data mining & Big Data

48

Reveals information that is hidden and unexpected, as little value in finding patterns and relationships that are already intuitive.

Patterns and relationships are identified by examining the underlying rules and features in the data.

Data Mining

Page 49: Data warehouse,data mining & Big Data

49

Most accurate results normally require large volumes of data to deliver reliable conclusions.

Starts by developing an optimal representation of structure of sample data

Data Mining

Page 50: Data warehouse,data mining & Big Data

50

Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing.

Relatively new technology, however already used in a number of industries.

Data Mining

Page 51: Data warehouse,data mining & Big Data

51

Retail / Marketing◦ Identifying buying patterns of customers◦ Finding associations among customer

demographic characteristics◦ Predicting response to mailing campaigns◦ Market basket analysis

Examples of Applications of Data Mining

Page 52: Data warehouse,data mining & Big Data

52

Banking ◦ Detecting patterns of fraudulent credit card

use◦ Identifying loyal customers◦ Predicting customers likely to change their

credit card affiliation◦ Determining credit card spending by

customer groups

Examples of Applications of Data Mining

Page 53: Data warehouse,data mining & Big Data

53

Insurance◦Claims analysis◦Predicting which customers will buy new

policies

Medicine◦Characterizing patient behavior to predict

surgery visits◦ Identifying successful medical therapies

for different illnesses

Examples of Applications of Data Mining

Page 54: Data warehouse,data mining & Big Data

54

Four main operations include:◦Predictive modeling◦Database segmentation◦Link analysis◦Deviation detection

There are recognized associations between the applications and the corresponding operations. ◦e.g. Direct marketing strategies use database

segmentation.

Data Mining Operations

Page 55: Data warehouse,data mining & Big Data

55

Techniques are specific implementations of the data mining operations.

Each operation has its own strengths and weaknesses.

Data Mining Techniques

Page 56: Data warehouse,data mining & Big Data

56

Data Mining Operations and Associated Techniques

Page 57: Data warehouse,data mining & Big Data

57

Similar to the human learning experience◦uses observations to form a model of the

important characteristics of some phenomenon.

Uses generalizations of ‘real world’ and ability to fit new data into a general framework.

Can analyze a database to determine essential characteristics (model) about the data set.

Predictive Modeling

Page 58: Data warehouse,data mining & Big Data

58

Model is developed using a supervised learning approach, which has two phases: training and testing. ◦Training builds a model using a large

sample of historical data called a training set.

◦Testing involves trying out the model on new, previously unseen data to determine its accuracy and physical performance characteristics.

Predictive Modeling

Page 59: Data warehouse,data mining & Big Data

59

Applications of predictive modeling include customer retention management, credit approval, cross selling, and direct marketing.

There are two techniques associated with predictive modeling: classification and value prediction, which are distinguished by the nature of the variable being predicted.

Predictive Modeling

Page 60: Data warehouse,data mining & Big Data

60

Example of Classification using Tree Induction

Page 61: Data warehouse,data mining & Big Data

61

Used to estimate a continuous numeric value that is associated with a database record.

Uses the traditional statistical techniques of linear regression and nonlinear regression.

Relatively easy-to-use and understand.

Predictive Modeling - Value Prediction

Page 62: Data warehouse,data mining & Big Data

62

Linear regression attempts to fit a straight line through a plot of the data, such that the line is the best representation of the average of all observations at that point in the plot.

Problem is that the technique only works well with linear data and is sensitive to the presence of outliers (that is, data values, which do not conform to the expected norm).

Predictive Modeling - Value Prediction

Page 63: Data warehouse,data mining & Big Data

63

Data mining requires statistical methods that can accommodate non-linearity, outliers, and non-numeric data.

Applications of value prediction include credit card fraud detection or target mailing list identification.

Predictive Modeling - Value Prediction

Page 64: Data warehouse,data mining & Big Data

64

Aim is to partition a database into an unknown number of segments, or clusters, of similar records.

Uses unsupervised learning to discover homogeneous sub-populations in a database to improve the accuracy of the profiles.

Database Segmentation

Page 65: Data warehouse,data mining & Big Data

65

Less precise than other operations thus less sensitive to redundant and irrelevant features.

Applications of database segmentation include customer profiling, direct marketing, and cross selling.

Database Segmentation

Page 66: Data warehouse,data mining & Big Data

66

Example of Database Segmentation using a Scatterplot

Page 67: Data warehouse,data mining & Big Data

67

Aims to establish links (associations) between records, or sets of records, in a database.

There are three specializations◦ Associations discovery◦ Sequential pattern discovery◦ Similar time sequence discovery

Applications include product affinity analysis, direct marketing, and stock price movement.

Link Analysis

Page 68: Data warehouse,data mining & Big Data

68

Finds items that imply the presence of other items in the same event.

Affinities between items are represented by association rules. ◦e.g. ‘When a customer rents property for

more than 2 years and is more than 25 years old, in 40% of cases, the customer will buy a property. This association happens in 35% of all customers who rent properties’.

Link Analysis - Associations Discovery

Page 69: Data warehouse,data mining & Big Data

69

Finds patterns between events such that the presence of one set of items is followed by another set of items in a database of events over a period of time. ◦ e.g. Used to understand long term customer

buying behavior.

Link Analysis - Sequential Pattern Discovery

Page 70: Data warehouse,data mining & Big Data

70

Finds links between two sets of data that are time-dependent, and is based on the degree of similarity between the patterns that both time series demonstrate. ◦ e.g. Within three months of buying

property, new home owners will purchase goods such as cookers, freezers, and washing machines.

Link Analysis - Similar Time Sequence Discovery

Page 71: Data warehouse,data mining & Big Data

71

Relatively new operation in terms of commercially available data mining tools.

Often a source of true discovery because it identifies outliers, which express deviation from some previously known expectation and norm.

Deviation Detection

Page 72: Data warehouse,data mining & Big Data

72

Can be performed using statistics and visualization techniques or as a by-product of data mining.

Applications include fraud detection in the use of credit cards and insurance claims, quality control, and defects tracing.

Deviation Detection

Page 73: Data warehouse,data mining & Big Data

73

Example of Database Segmentation using a Visualization

Page 74: Data warehouse,data mining & Big Data

74

Introduction to Big Data

What is Big Data?What makes data, “Big” Data?

Page 75: Data warehouse,data mining & Big Data

75

Big Data Definition No single standard definition…

“Big Data” is data whose scale, diversity, and complexity require new architecture,

techniques, algorithms, and analytics to manage it and extract value and hidden

knowledge from it…

Page 76: Data warehouse,data mining & Big Data

76

Characteristics of Big Data: 1-Scale (Volume)

Data Volume◦ 44x increase from 2009 2020◦ From 0.8 zettabytes to 35zb

Data volume is increasing exponentially

Exponential increase in collected/generated data

Page 77: Data warehouse,data mining & Big Data

77

Characteristics of Big Data: 2-Complexity (Varity) Various formats, types, and

structures Text, numerical, images, audio,

video, sequences, time series, social media data, multi-dim arrays, etc…

Static data vs. streaming data A single application can be

generating/collecting many types of data To extract knowledge all

these types of data need to linked together

Page 78: Data warehouse,data mining & Big Data

78

Characteristics of Big Data: 3-Speed (Velocity) Data is begin generated fast and need to be

processed fast Online Data Analytics Late decisions missing opportunities Examples

◦ E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you

◦ Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction

Page 79: Data warehouse,data mining & Big Data

79

Big Data: 3V’s

Page 80: Data warehouse,data mining & Big Data

80

Some Make it 4V’s

Page 81: Data warehouse,data mining & Big Data

81

Harnessing Big Data

OLTP: Online Transaction Processing (DBMSs) OLAP: Online Analytical Processing (Data Warehousing) RTAP: Real-Time Analytics Processing (Big Data Architecture &

technology)

Page 82: Data warehouse,data mining & Big Data

82

Who’s Generating Big Data

Social media and networks(all of us are generating data)

Scientific instruments(collecting all sorts of data)

Mobile devices (tracking all objects all the time)

Sensor technology and networks(measuring all kinds of data)

The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover

knowledge from the collected data in a timely manner and in a scalable fashion

Page 83: Data warehouse,data mining & Big Data

83

The Model Has Changed… The Model of Generating/Consuming Data has

Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming data

Page 84: Data warehouse,data mining & Big Data

84

What’s driving Big Data

- Ad-hoc querying and reporting- Data mining techniques- Structured data, typical sources- Small to mid-size datasets

- Optimizations and predictive analytics- Complex statistical analysis- All types of data, and many sources- Very large datasets- More of a real-time

Page 85: Data warehouse,data mining & Big Data

85

Value of Big Data Analytics

Big data is more real-time in nature than traditional DW applications

Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data apps

Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps

Page 86: Data warehouse,data mining & Big Data

86

Challenges in Handling Big Data

The Bottleneck is in technology◦ New architecture, algorithms, techniques are needed

Also in technical skills◦ Experts in using the new technology and dealing with big

data

Page 87: Data warehouse,data mining & Big Data

87

What Technology Do We HaveFor Big Data ??

Page 88: Data warehouse,data mining & Big Data

88

Page 89: Data warehouse,data mining & Big Data

89

Big Data Technology