Data-Intensive Distributed Computing

Part 5: Analyzing Relational Data (1/3)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 451/651 (Fall 2018)

Jimmy LinDavid R. Cheriton School of Computer Science

University of Waterloo

October 16, 2018

These slides are available at http://lintool.github.io/bigdata-2018f/

Structure of the Course

“Core” framework features and algorithm design

Evolution of Enterprise Architectures

Next two sessions: techniques, algorithms, and optimizations for relational processing

MonolithicApplication

Frontend

Backend

Source: Wikipedia

Frontend

Backend

database

Why is this a good idea?

An organization should retain data that result from carrying out its mission and exploit those data to generate insights that benefit the organization, for example, market analysis, strategic planning, decision making, etc.

Business Intelligence

Frontend

Backend

database

BI tools

analysts

Frontend

Backend

database

BI tools

analysts

Why is myapplication so slow?

Why does my analysis take so long?

Database Workloads

OLTP (online transaction processing)Typical applications: e-commerce, banking, airline reservations

User facing: real-time, low latency, highly-concurrentTasks: relatively small set of “standard” transactional queries

Data access pattern: random reads, updates, writes (small amounts of data)

OLAP (online analytical processing)Typical applications: business intelligence, data mining

Back-end processing: batch workloads, less concurrencyTasks: complex analytical queries, often ad hoc

Data access pattern: table scans, large amounts of data per query

OLTP and OLAP Together?

Downsides of co-existing OLTP and OLAP workloadsPoor memory management

Conflicting data access patternsVariable latency

Solution?

users and analysts

Source: Wikipedia (Warehouse)

Build a data warehouse!

Frontend

Backend

BI tools

analysts

ETL(Extract, Transform, and Load)

Data Warehouse

OLTP database

OLTP database for user-facing transactions

OLAP database for data warehousing

What’s special about OLTP vs. OLAP?

Customer Billing

OrderInventory

OrderLine

A Simple OLTP Schema

Dim_Customer

Dim_Date

Dim_ProductFact_Sales

Dim_Store

Stars and snowflakes, oh my!

A Simple OLAP Schema

TransformData cleaning and integrity checking

Schema conversionField transformations

When does ETL happen?

Extract

Frontend

Backend

BI tools

analysts

Data Warehouse

OLTP database

My data is a day old… Meh.

Frontend

Backend

BI tools

analysts

Data Warehouse

OLTP database

Frontend

Backend

Frontend

Backend

external APIs

OLTP database

What do you actually do?

Dashboards

Report generation

Ad hoc analyses

slice and dice

Common operations

roll up/drill down

OLAP Cubes

OLAP Cubes: Challenges

Fundamentally, lots of joins, group-bys and aggregationsHow to take advantage of schema structure to avoid repeated work?

Cube materializationRealistic to materialize the entire cube?If not, how/when/what to materialize?

Frontend

Backend

BI tools

analysts

Data Warehouse

OLTP database

Frontend

Backend

Frontend

Backend

external APIs

OLTP database

Fast forward…

“On the first day of logging the Facebook clickstream, more than 400 gigabytes of data was collected. The load, index, and aggregation processes for this data set really taxed the Oracle data warehouse. Even after significant tuning, we were unable to aggregate a day of clickstream data in less than 24 hours.”

Jeff Hammerbacher, Information Platforms and the Rise of the Data Scientist. In, Beautiful Data, O’Reilly, 2009.

Frontend

Backend

BI tools

analysts

Data Warehouse

OLTP database

Facebook context?

Frontend

Backend

BI tools

analysts

Data Warehouse

“OLTP”

Adding friendsUpdating profilesLikes, comments…

Feed rankingFriend recommendationDemographic analysis…

Frontend

Backend

analysts

“OLTP” PHP/MySQL

data scientists✗Hadoop

or ELT?

What’s changed?

Dropping cost of disksCheaper to store everything than to figure out what to throw away

5 MB hard drive in 1956

What’s changed?

Dropping cost of disksCheaper to store everything than to figure out what to throw away

Rise of social media and user-generated contentLarge increase in data volume

Growing maturity of data mining techniquesDemonstrates value of data analytics

Types of data collectedFrom data that’s obviously valuable to data whose value is less apparent

a useful service

analyze user behavior to extract insights

transform insights into action

$(hopefully)

Google. Facebook. Twitter. Amazon. Uber.

Virtuous Product Cycle

Dashboards

Report generation

Ad hoc analyses“Descriptive”“Predictive”

Data products

a useful service

analyze user behavior to extract insights

transform insights into action

$(hopefully)

Google. Facebook. Twitter. Amazon. Uber.

data sciencedata products

Virtuous Product Cycle

“On the first day of logging the Facebook clickstream, more than 400 gigabytes of data was collected. The load, index, and aggregation processes for this data set really taxed the Oracle data warehouse. Even after significant tuning, we were unable to aggregate a day of clickstream data in less than 24 hours.”

Jeff Hammerbacher, Information Platforms and the Rise of the Data Scientist. In, Beautiful Data, O’Reilly, 2009.

Frontend

Backend

data scientists

“OLTP”

Hadoop

Frontend

Backend

Hadoopreduce

reduce

reducereduce

reduce

mapmap

reducereducereduce

mapmapmap…

reduce

Wait, so why not use a database to begin with?

The Irony…

“OLTP”

data scientists

Why not just use a database?

Scalability. Cost.

SQL is awesome

Databases are great…If your data has structure (and you know what the structure is)

If you know what queries you’re going to run ahead of timeIf your data is reasonably clean

Databases are not so great…If your data has little structure (or you don’t know the structure)

If you don’t know what you’re looking forIf your data is messy and noisy

“there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are unknown unknowns – the ones we don't know we don't know…” – Donald Rumsfeld

Source: Wikipedia

Databases are great…If your data has structure (and you know what the structure is)

If you know what queries you’re going to run ahead of timeIf your data is reasonably clean

Databases are not so great…If your data has little structure (or you don’t know the structure)

If you don’t know what you’re looking forIf your data is messy and noisy

Known unknowns!

Unknown unknowns!

Don’t need to know the schema ahead of time

Many analyses are better formulated imperatively

Raw scans are the most common operations

Much faster data ingest rate

Advantages of Hadoop dataflow languages

Dashboards

Report generation

Ad hoc analyses“Descriptive”“Predictive”

Data products

Which are known unknowns and

unknown unknowns?

Frontend

Backend

BI tools

analysts

Data Warehouse

OLTP database

Frontend

Backend

Frontend

Backend

external APIs

OLTP database

Frontend

Backend

Frontend

Backend

Frontend

Backend

external APIs

“Traditional”BI tools

SQL on Hadoop

Othertools

Data Warehouse“Data Lake”

data scientists

OLTP database

Twitter’s data warehousing architecture (circa 2012)

circa ~2010~150 people total

~60 Hadoop nodes~6 people use analytics stack daily

circa ~2012~1400 people total

10s of Ks of Hadoop nodes, multiple DCs10s of PBs total Hadoop DW capacity

~100 TB ingest dailydozens of teams use Hadoop daily

10s of Ks of Hadoop jobs daily

How does ETL actually happen?

Twitter’s data warehousing architecture (circa 2012)

Scribe Daemons(Production Hosts)

Main HadoopDW

Main Datacenter

Staging Hadoop Cluster

ScribeAggregators

Datacenter

ScribeAggregators

Datacenter

ScribeAggregators

Importing Log Data

What’s Next?Two developing trends…

Frontend

Backend

database

BI tools

analysts

Frontend

Backend

BI tools

analysts

Data Warehouse

OLTP database

Frontend

Backend

Frontend

Backend

external APIs

OLTP database

Frontend

Backend

Frontend

Backend

Frontend

Backend

external APIs

SQL on Hadoop

Othertools

data scientists

OLTP database

My data is a day old… I refuse to

accept that!

ETLOLAPOLTP

What if you didn’t have to do this?

Hybrid Transactional/Analytical Processing (HTAP)

Coming back full circle?

Frontend

Backend

Frontend

Backend

Frontend

Backend

external APIs

SQL on Hadoop

Othertools

data scientists

OLTP database

Frontend

Backend

Frontend

Backend

Frontend

Backend

external APIs

SQL on Hadoop

Othertools

data scientists

HTAP database

Analyticstools

data scientists

Analyticstools

data scientists

Frontend

Backend

Frontend

Backend

Frontend

Backend

external APIs

SQL on Hadoop

Othertools

data scientists

Everything In the cloud!

IaaS / Load balance aaS

OLTP database

DBaaS (e.g., RDS)

DBaaS (e.g., RedShift)

“Cloudified” tools

ELT aaS

Source: Wikipedia (Japanese rock garden)

Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 5: Analyzing...

Documents

Transcript of Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 5: Analyzing...

Beyond the Storage Area Network: Data Intensive Computing ...Beyond the Storage Area Network: Data Intensive Computing in a Distributed Environment* Daniel Duffy, Nicko Acks, and Vaughn

Scheduling Distributed Data-Intensive Applications on ... · data-intensive, that is, they access and process distributed datasets to generate results. These applications need to

Greener Big Data: Optimizing Data Exchange and Power ... · fashion. In particular, the rise of the cloud and distributed data-intensive (\big data") applications puts pressure on

Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 7: Mutable State (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Distributed Energy-Efficient Scheduling for Data-Intensive

Foundations of data-intensive science: Technology and practice for high throughput, widely distributed, data management and analysis systems William Johnston.

DiscFinder: A Data-Intensive Scalable Cluster Finder for ...jclopez/ref/discfinder-tr.pdf · DiscFinder is a scalable, distributed, data-intensive group ﬁnder for analyzing observation

Integrated e-Infrastructure for Distributed, Data-driven, Data- intensive High Performance Computing: Biomedical Requirements Peter V Coveney Centre for.

End-to-end Data-flow Parallelism for Throughput Optimization in High-speed Networks Esma Yildirim Data Intensive Distributed Computing Laboratory University.

Toward Efficient and Simplified Distributed Data Intensive Computing IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011PPT.

Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Nebula: Distributed Edge Cloud for Data-Intensive …dcsg.cs.umn.edu/Projects/Nebula/posters/Nebula-Poster...Nebula for aggregation and decomposition. Expand the range of data-intensive

Distributed Energy-Efficient Scheduling for Data-Intensive ...xqin/pubs/ipccc08.pdf · effective, have been commonly applied in distributed data centers [15][12]. ... of massive data

Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

The Quest for Scalable Support of Data-Intensive Workloads ... · The Quest for Scalable Support of Data-Intensive Workloads in Distributed Systems Ioan Raicu,1 Ian T. Foster,1,2,3

A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store

Middlewarefordataminingapplicationsonclustersandgridsjin/Papers/JPDC08.pdfgeographically distributed scientiﬁc data sets, also referred to as distributed data-intensive science [10],

Aspects of Data-Intensive Cloud Computing€¦ · Aspects of Data-Intensive Cloud Computing Sebastian Frischbier and Ilia Petrov Databases and Distributed Systems Group Technische

Infrastructure to match your ambitions · intelligence. To do this, it must connect distributed data over networks, compute intensive workload data, store expanding data volumes,