Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical...

54
Data-Intensive Distributed Computing Part 5: Analyzing Relational Data (1/3) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details CS 431/631 451/651 (Winter 2020) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451 1

Transcript of Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical...

Page 1: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Data-Intensive Distributed Computing

Part 5: Analyzing Relational Data (1/3)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 431/631 451/651 (Winter 2020)

Ali Abedi

These slides are available at https://www.student.cs.uwaterloo.ca/~cs451

1

Page 2: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Structure of the Course

“Core” framework features

and algorithm design

Analy

zin

gT

ext

Analy

zin

gG

raphs

Analy

zin

g

Rela

tional D

ata

Data

Min

ing

2

Page 3: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Evolution of Enterprise Architectures

Next two sessions: techniques, algorithms, and optimizations for relational processing

3

Page 4: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

MonolithicApplication

users

4

Page 5: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

5

Page 6: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

6

Edgar F. Codd

• Inventor of the relational model for DBs

• SQL was created based on his work

• Turing award winner in 1981

Page 7: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

database

7

Page 8: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

An organization should retain data that result from carrying out its mission and exploit those data to generate insights that benefit the organization, for example, market analysis, strategic planning, decision making, etc.

Business Intelligence

8

Page 9: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

database

BI tools

analysts

9

Page 10: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

database

BI tools

analysts

Why is myapplication so slow?

Why does my analysis take so

long?

10

Page 11: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Database Workloads

OLTP (online transaction processing)Typical applications: e-commerce, banking, airline reservations

User facing: real-time, low latency, highly-concurrentTasks: relatively small set of “standard” transactional queries

Data access pattern: random reads, updates, writes (small amounts of data)

OLAP (online analytical processing)Typical applications: business intelligence, data mining

Back-end processing: batch workloads, less concurrencyTasks: complex analytical queries, often ad hoc

Data access pattern: table scans, large amounts of data per query

11

Page 12: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

OLTP and OLAP Together?

Downsides of co-existing OLTP and OLAP workloadsPoor memory management

Conflicting data access patternsVariable latency

Solution?

users and analysts

12

Page 13: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Source: Wikipedia (Warehouse)

Build a data warehouse!

13

Page 14: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

BI tools

analysts

ETL(Extract, Transform, and Load)

Data Warehouse

OLTP database

OLTP database for user-facing transactions

OLAP database for data warehousing

14

Page 15: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Customer Billing

OrderInventory

OrderLine

A Simple OLTP Schema

15

Page 16: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Dim_Customer

Dim_Date

Dim_ProductFact_Sales

Dim_Store

A Simple OLAP Schema

16

Page 17: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

ETL

TransformData cleaning and integrity checking

Schema conversionField transformations

When does ETL happen?

Extract

Load

17

Page 18: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

BI tools

analysts

ETL(Extract, Transform, and Load)

Data Warehouse

OLTP database

My data is a day old… Meh.

18

Page 19: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

BI tools

analysts

ETL(Extract, Transform, and Load)

Data Warehouse

OLTP database

Frontend

Backend

users

Frontend

Backend

external APIs

OLTP database

OLTP database

19

Page 20: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

What do you actually do?

Dashboards

Report generation

Ad hoc analyses

20

Page 21: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

slice and dice

Common operations

roll up/drill down

pivot

OLAP Cubes

21

Page 22: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

OLAP Cubes: Challenges

Fundamentally, lots of joins, group-bys and aggregationsHow to take advantage of schema structure to avoid repeated work?

Cube materializationRealistic to materialize the entire cube?If not, how/when/what to materialize?

22

Page 23: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

BI tools

analysts

ETL(Extract, Transform, and Load)

Data Warehouse

OLTP database

Frontend

Backend

users

Frontend

Backend

external APIs

OLTP database

OLTP database

23

Page 24: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Fast forward…

24

Page 25: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

“On the first day of logging the Facebook clickstream, more than 400 gigabytes of data was collected. The load, index, and aggregation processes for this data set really taxed the Oracle data warehouse. Even after significant tuning, we were unable to aggregate a day of clickstream data in less than 24 hours.”

Jeff Hammerbacher, Information Platforms and the Rise of the Data Scientist. In, Beautiful Data, O’Reilly, 2009.

25

Page 26: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

BI tools

analysts

ETL(Extract, Transform, and Load)

Data Warehouse

OLTP database

Facebook context?

26

Page 27: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

BI tools

analysts

ETL(Extract, Transform, and Load)

Data Warehouse

“OLTP”

Adding friendsUpdating profilesLikes, comments…

Feed rankingFriend recommendationDemographic analysis…

27

Page 28: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

analysts

ETL(Extract, Transform, and Load)

“OLTP” PHP/MySQL

data scientists✗

Hadoop

or ELT?

28

Page 29: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

What’s changed?

Dropping cost of disksCheaper to store everything than to figure out what to throw away

29

Page 30: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

What’s changed?

Dropping cost of disksCheaper to store everything than to figure out what to throw away

Rise of social media and user-generated contentLarge increase in data volume

Growing maturity of data mining techniquesDemonstrates value of data analytics

Types of data collectedFrom data that’s obviously valuable to data whose value is less apparent

30

Page 31: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

a useful service

analyze user behavior to extract insights

transform insights into action

$(hopefully)

Google. Facebook. Twitter. Amazon. Uber.

Virtuous Product Cycle

31

Page 32: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

What do you actually do?

Dashboards

Report generation

Ad hoc analyses“Descriptive”“Predictive”

Data products

32

Page 33: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

a useful service

analyze user behavior to extract insights

transform insights into action

$(hopefully)

Google. Facebook. Twitter. Amazon. Uber.

data sciencedata products

Virtuous Product Cycle

33

Page 34: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

“On the first day of logging the Facebook clickstream, more than 400 gigabytes of data was collected. The load, index, and aggregation processes for this data set really taxed the Oracle data warehouse. Even after significant tuning, we were unable to aggregate a day of clickstream data in less than 24 hours.”

Jeff Hammerbacher, Information Platforms and the Rise of the Data Scientist. In, Beautiful Data, O’Reilly, 2009.

34

Page 35: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

data scientists

ETL(Extract, Transform, and Load)

“OLTP”

Hadoop

35

Page 36: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

ETL(Extract, Transform, and Load)

Hadoop

Wait, so why not use a database to begin with?

The Irony…

“OLTP”

data scientists

36

Page 37: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Why not just use a database?

Scalability. Cost.

SQL is awesome

37

Page 38: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Databases are great…

If your data has structure (and you know what the structure is)

If you know what queries you’re going to run ahead of timeIf your data is reasonably clean

Databases are not so great…

If your data has little structure (or you don’t know the structure)

If you don’t know what you’re looking forIf your data is messy and noisy

38

Page 39: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

“there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are unknown unknowns – the ones we don't know we don't know…” – Donald Rumsfeld

Source: Wikipedia39

Page 40: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

One who knows and knows that he knows

His horse of wisdom will reach the skies

One who doesn't know, but knows that he doesn't know

His limping mule will eventually get him home

One who doesn't know and doesn't know that he doesn't know

He will be eternally lost in his hopeless ignorance!

Ibn Yamin (1286-1368)

Page 41: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Databases are great…

If your data has structure (and you know what the structure is)

If you know what queries you’re going to run ahead of timeIf your data is reasonably clean

Databases are not so great…

If your data has little structure (or you don’t know the structure)

If you don’t know what you’re looking forIf your data is messy and noisy

41

Page 42: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Don’t need to know the schema ahead of time

Many analyses are better formulated imperatively

Raw scans are the most common operations

Much faster data ingest rate

Advantages of Hadoop dataflow languages

42

Page 43: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

What do you actually do?

Dashboards

Report generation

Ad hoc analyses“Descriptive”“Predictive”

Data products

43

Page 44: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

BI tools

analysts

ETL(Extract, Transform, and Load)

Data Warehouse

OLTP database

Frontend

Backend

users

Frontend

Backend

external APIs

OLTP database

OLTP database

44

Page 45: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

Frontend

Backend

users

Frontend

Backend

external APIs

“Traditional”BI tools

SQL on Hadoop

Othertools

Data Warehouse“Data Lake”

data scientists

OLTP database

ETL(Extract, Transform, and Load)

OLTP database

OLTP database

45

Page 46: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

What’s Next?

46

Page 47: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

database

BI tools

analysts

47

Page 48: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

BI tools

analysts

ETL(Extract, Transform, and Load)

Data Warehouse

OLTP database

Frontend

Backend

users

Frontend

Backend

external APIs

OLTP database

OLTP database

48

Page 49: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

Frontend

Backend

users

Frontend

Backend

external APIs

“Traditional”BI tools

SQL on Hadoop

Othertools

Data Warehouse“Data Lake”

data scientists

OLTP database

ETL(Extract, Transform, and Load)

OLTP database

OLTP database

My data is a day old… I refuse to

accept that!49

Page 50: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

ETL

OLAPOLTP

What if you didn’t have to do this?

50

Page 51: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

HTAP

Hybrid Transactional/Analytical Processing (HTAP)

51

Page 52: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

Frontend

Backend

users

Frontend

Backend

external APIs

“Traditional”BI tools

SQL on Hadoop

Othertools

Data Warehouse“Data Lake”

data scientists

OLTP database

ETL(Extract, Transform, and Load)

OLTP database

OLTP database

52

Page 53: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

Frontend

Backend

users

Frontend

Backend

external APIs

“Traditional”BI tools

SQL on Hadoop

Othertools

Data Warehouse“Data Lake”

data scientists

HTAP database

ETL(Extract, Transform, and Load)

HTAP database

HTAP database

Analyticstools

data scientists

Analyticstools

data scientists

53

Page 54: Data-Intensive Distributed Computingcs451/slides/big... · 2020-02-13 · OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing:

Frontend

Backend

users

Frontend

Backend

users

Frontend

Backend

external APIs

“Traditional”BI tools

SQL on Hadoop

Othertools

Data Warehouse“Data Lake”

data scientists

ETL(Extract, Transform, and Load)

Everything In the cloud!

IaaS / Load balance aaS

OLTP database

OLTP database

OLTP database

DBaaS (e.g., RDS)

DBaaS (e.g., RedShift)

S3

“Cloudified” tools

ELT aaS

54