Technologies for Data Analytics Platform

76
Technologies for Data Analytics Platform YAPC::Asia Tokyo 2015 - Aug 22, 2015

Transcript of Technologies for Data Analytics Platform

Page 1: Technologies for Data Analytics Platform

Technologies for Data Analytics PlatformYAPC::Asia Tokyo 2015 - Aug 22, 2015

Page 2: Technologies for Data Analytics Platform

Who are you?

• Masahiro Nakagawa • github: @repeatedly

• Treasure Data Inc. • Fluentd / td-agent developer • https://jobs.lever.co/treasure-data

• I love OSS :) • D Language, MessagePack, The organizer of several meetups, etc…

Page 3: Technologies for Data Analytics Platform

Why do we analyze data?

Page 4: Technologies for Data Analytics Platform

Reporting Monitoring Exploratory data analysis Confirmatory data analysis etc…

Page 5: Technologies for Data Analytics Platform

Need data, data, data!

Page 6: Technologies for Data Analytics Platform

It means we need data analysis platform for own requirements

Page 7: Technologies for Data Analytics Platform

Data Analytics Flow

Collect Store Process Visualize

Data source

Reporting

Monitoring

Page 8: Technologies for Data Analytics Platform

Let’s launch platform!

Page 9: Technologies for Data Analytics Platform

• Easy to use and maintain • Single server • RDBMS is popular and has huge ecosystem

RDBMS

ETL QueryExtract + Transformation + Load

Page 10: Technologies for Data Analytics Platform

×

Oops! RDBMS is not good for data analytics against large data volume. We need more speed and scalability!

Page 11: Technologies for Data Analytics Platform

Let’s consider Parallel RDBMS instead!

Page 12: Technologies for Data Analytics Platform

Parallel RDBMS

• Optimized for OLAP workload • Columnar storage, Shared nothing, etc… • Netezza, Teradata, Vertica, Greenplum, etc…

Compute Node

Leader Node

Compute Node

Compute Node

Query

Page 13: Technologies for Data Analytics Platform

time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

• Good data format for analytics workload • Read only selected columns, efficient compression • Not good for insert / update

Columnar Storage

time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

Row ColumnarUnit

Unit

Page 14: Technologies for Data Analytics Platform

Okay, query is now processed normally.

L

C C C

Page 15: Technologies for Data Analytics Platform

No silver bullet

• Performance depends on data modeling and query • distkey and sortkey are important

• should reduce data transfer and IO Cost • query should take advantage of these keys

• There are some problems • Cluster scaling, metadata management, etc…

Page 16: Technologies for Data Analytics Platform

Performance is good :) But we often want to change schema for new workloads. Now, hard to maintain schema and its data…

L

C C C

Page 17: Technologies for Data Analytics Platform

Okay, let’s separate data sources into multiple layers for reliable platform

Page 18: Technologies for Data Analytics Platform

Schema on Write(RDBMS)• Writing data using schema

for improving query performance

• Pros: • minimum query overhead

• Cons: • Need to design schema and workload before • Data load is expensive operation

Page 19: Technologies for Data Analytics Platform

Schema on Read(Hadoop)• Writing data without schema and

map schema at query time

• Pros: • Robust over schema and workload change • Data load is cheap operation

• Cons: • High overhead at query time

Page 20: Technologies for Data Analytics Platform

Data Lake• Schema management is hard

• Volume is increasing and format is often changed • There are lots of log types

• Feasible approach is storing raw data and converting it before analyze data

• Data Lake is a single storage for any logs • Note that no clear definition for now

Page 21: Technologies for Data Analytics Platform

Data Lake Patterns

• Use DFS, e.g. HDFS, for log storage • ETL or data processing by Hadoop ecosystem • Can convert logs via ingestion tools before

• Use Data Lake storage and related tools • These storages support Hadoop ecosystem

Page 22: Technologies for Data Analytics Platform

Apache Hadoop• Distributed computing framework

• First implementation based on Google MapReduce

http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/

Page 23: Technologies for Data Analytics Platform

HDFS

http://nosqlessentials.com/

Page 24: Technologies for Data Analytics Platform

MapReduce

Page 25: Technologies for Data Analytics Platform

Cool! Data load becomes robust!

EL

T

Raw data Transformed data

Page 26: Technologies for Data Analytics Platform

Apache Tez• Low level framework for YARN Applications

• Hive, Pig, new query engine and more

• Task and DAG based processing flow

ProcessorInput Output

Task DAG

Page 27: Technologies for Data Analytics Platform

MapReduce vs Tez

MapReduce Tez

M

HDFS

R

R

M M

HDFS HDFS

R

M M

R

M M

R

M

R

M MM

M M

R

R

R

SELECT g1.x, g2.avg, g2.cntFROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1 JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2 ON (g1.x = g2.x) ORDER BY avg;

GROUP b BY b.xGROUP a BY a.x

JOIN (a, b)

ORDER BY

GROUP BY x

GROUP BY a.x JOIN (a, b)

ORDER BY

http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9

Page 28: Technologies for Data Analytics Platform

Superstition• HDFS and YARN have SPOF

• Recent version doesn’t have SPOF on both MapReduce 1 and MapReduce 2

• Can’t build from a scratch • Really? Treasure Data builds Hadoop on CircleCI.

Cloudera, Hortonworks and MapR too. • They also check its dependent toolchain.

Page 29: Technologies for Data Analytics Platform

Which Hadoop packageshould we use?• Distribution by Hadoop distributor is better

• CDH by Cloudera • HDP by Hortonworks • MapR distribution by MapR

• If you are familiar with Hadoop and its ecosystem, Apache community edition becomes an option. • For example, Treasure Data has patches and

they want to use patched version.

Page 30: Technologies for Data Analytics Platform

Good :) In addition, we want to collect data in efficient way!

Page 31: Technologies for Data Analytics Platform

Ingestion tools• There are two execution model!

• Bulk load: • For high-throughput • Almost tools transfer data in batch and parallel

• Streaming load: • For low-latency • Almost tools transfer data in micro-batch

Page 32: Technologies for Data Analytics Platform

Bulk load tools• Embulk

• Pluggable bulk data loader for various inputs and outputs

• Write plugins using Java and JRuby

• Sqoop • Data transfer between Hadoop and RDBMS • Included in some distributions

• Or each bulk loader for each data store

Page 33: Technologies for Data Analytics Platform

Streaming load tools• Fluentd

• Pluggable and json based streaming collector • Lots of plugins in rubygems

• Flume • Mainly for Hadoop ecosystem, HDFS, HBase, … • Included in some distributions

• Or Logstash, Heka, Splunk and etc…

Page 34: Technologies for Data Analytics Platform

Data ingestion also becomes robust and efficient!

Raw data Transformed data

Page 35: Technologies for Data Analytics Platform

It works! but…we want to issue ad-hoc query to entire data. We can’t wait loading data into database.

Page 36: Technologies for Data Analytics Platform

You can use MPP query engine for data stores.

Page 37: Technologies for Data Analytics Platform

MPP query engine• It doesn’t have own storage unlike parallel RDBMS

• Follow “Schema on Read” approach • data distribution depends on backend • data schema also depends on backend

• Some products are called “SQL on Hadoop” • Presto, Impala, Apache Drill, etc… • It has own execution engine, not use MapReduce.

Page 38: Technologies for Data Analytics Platform

• Distributed Query Engine for interactive queries against various data sources and large data.

• Pluggable connector for joining multiple backends • You can join MySQL and HDFS data in one query

• Lots of useful functions for data analytics • window functions, approximate query,

machine learning, etc…

Page 39: Technologies for Data Analytics Platform

HDFS

Hive

PostgreSQL, etc.

Daily/Hourly BatchInteractive query

CommercialBI Tools

Batch analysis platform Visualization platform

Dashboard

Page 40: Technologies for Data Analytics Platform

HDFS

Hive

Daily/Hourly BatchInteractive query

✓ Less scalable ✓ Extra cost

CommercialBI Tools

Dashboard

✓ More work to manage 2 platforms

✓ Can’t query against “live” data directly

Batch analysis platform Visualization platform

PostgreSQL, etc.

Page 41: Technologies for Data Analytics Platform

HDFS

Hive Dashboard

Presto

PostgreSQL, etc.

Daily/Hourly Batch

HDFS

HiveDashboard

Daily/Hourly Batch

Interactive query

Interactive query

Page 42: Technologies for Data Analytics Platform

Presto

HDFS

HiveDashboard

Daily/Hourly BatchInteractive query

Cassandra MySQL Commertial DBs

SQL on any data sets CommercialBI Tools

✓ IBM Cognos✓ Tableau ✓ ...

Data analysis platform

Page 43: Technologies for Data Analytics Platform

Client

Coordinator ConnectorPlugin

Worker

Worker

Worker

Storage / Metadata

Discovery Service

Page 44: Technologies for Data Analytics Platform

Execution Model

All stages are pipe-lined ✓ No wait time ✓ No fault-tolerance

MapReduce Presto

map map

reduce reduce

task task

task task

task

task

memory-to-memory data transfer ✓ No disk IO ✓ Data chunk must fit in memory

task

disk

map map

reduce reduce

disk

disk

Write datato disk

Wait betweenstages

Page 45: Technologies for Data Analytics Platform

Okay, we have now low latency and batch combination.

Raw data

Page 46: Technologies for Data Analytics Platform

Resolved our concern! But… we also need quick estimation.

Page 47: Technologies for Data Analytics Platform

Currently, we have several stream processing softwares. Let’s try!!

Page 48: Technologies for Data Analytics Platform

Apache Storm• Distributed realtime processing framework

• Low latency: tuple at a time • Trident mode uses micro batch

https://storm.apache.org/

Page 49: Technologies for Data Analytics Platform

Norikra• Schema-less CEP engine for stream processing

• Use SQL like Esper EPL • Not distributed unlike Storm for now

Calculated result

Page 50: Technologies for Data Analytics Platform

Great! We can get insight by streaming and batch way :)

Page 51: Technologies for Data Analytics Platform

One more. We can make data transfer more reliable for multiple data streams with distributed queue

Page 52: Technologies for Data Analytics Platform

• Distributed messaging system • Producer - Broker - Consumer pattern • Pull model, replication, etc…

Apache Kafka

App

PushPull

Page 53: Technologies for Data Analytics Platform

Push vs Pull

• Push: • Easy to transfer data to multiple destinations • Hard to control stream ratio in multiple streams

• Pull: • Easy to control stream ratio • Should manage consumers correctly

Page 54: Technologies for Data Analytics Platform

This is a modern analytics platform

Page 55: Technologies for Data Analytics Platform

Seems complex and hard to maintain? Let’s use useful services!

Page 56: Technologies for Data Analytics Platform

Amazon Redshift• Parallel RDBMS on AWS

• Re-use traditional Parallel RDMBS know-how • Scale is easier than traditional systems

• With Amazon EMR is popular 1. Store data into S3 2. EMR processes S3 data 3. Load processed data into Redshift

• EMR has Hadoop ecosystem

Page 57: Technologies for Data Analytics Platform

Using AWS Services

Page 58: Technologies for Data Analytics Platform

Google BigQuery• Distributed query engine and scalable storage

• Tree model, Columnar storage, etc… • Separate storage from workers

• High performance query by Google infrastructure • Lots of workers • Storage / IO layer on Colossus

• Can’t manage Parallel RDBMS properties like distkey, but it works on almost cases.

Page 59: Technologies for Data Analytics Platform

BigQuery architecture

Page 60: Technologies for Data Analytics Platform

Using GCP Services

Page 61: Technologies for Data Analytics Platform

Treasure Data• Cloud based end-to-end data analytics service

• Hive, Presto, Pig and Hivemall for one big repository • Lots of ingestion and output way, scheduling, etc… • No stream processing for now

• Service concept is Data Lake • JSON based schema-less storage

• Execution model is similar to BigQuery • Separate storage from workers • Can’t specify Parallel RDBMS properties

Page 62: Technologies for Data Analytics Platform

Using Treasure Data Service

Page 63: Technologies for Data Analytics Platform

Resource Model Trade-off

Pros Cons

Fully Guaranteed Stable execution Easy to control resource Non boost mechanizm

Guaranteed with multi-tenanted

Stable execution Good scalability less controllable resource

Fully multi-tenanted Boosted performance Great scalability Unstable execution

Page 64: Technologies for Data Analytics Platform

MS Azure also has useful services: DataHub, SQL DWH, DataLake, Stream Analytics, HDInsight…

Page 65: Technologies for Data Analytics Platform

Use service or build a platform?

• Should consider using service first • AWS, GCP, MS Azure, Treasure Data, etc… • Important factor is data analytics, not platform

• Do you have enough resources to maintain it?

• If specific analytics platform is a differentiator, building a platform is better • Use state-of-the-art technologies • Hard to implement on existing platforms

Page 66: Technologies for Data Analytics Platform

Conclusion• Many softwares and services for data analytics

• Lots of trade-off, performance, complexity, connectivity, execution model, etc

• SQL is a primary language on data analytics

• Should focus your goal! • data analytics platform is your business core?

If not, consider using services first.

Page 67: Technologies for Data Analytics Platform

Cloud service for entire data pipeline!

Page 68: Technologies for Data Analytics Platform

Appendix

Page 69: Technologies for Data Analytics Platform

Apache Spark• Another Distributed computing framework

• Mainly for in-memory computing with DAG • RDD and DataFrame based clean API

• Combination with Hadoop is popular

http://slidedeck.io/jmarin/scala-talk

Page 70: Technologies for Data Analytics Platform

Apache Flink• Streaming based execution engine

• Support batch and pipelined processing • Hadoop and Spark are batch based •

https://ci.apache.org/projects/flink/flink-docs-master/

Page 71: Technologies for Data Analytics Platform

Batch vs Pipelined

All stages are pipe-lined ✓ No wait time ✓ fault-tolerance with

check pointing

Batch(Staged) Pipelined

task task

task task

task

task

memory-to-memory data transfer ✓ use disk if needed

task

disk

disk

Wait betweenstagestask

task task

task task

task task stage3

stage2

stage1

Page 72: Technologies for Data Analytics Platform

Visualization• Tableau

• Popular BI tool in many area • Awesome GUI, easy to use, lots of charts, etc

• Metric Insights • Dashboard for many metrics • Scheduled query, custom handler, etc

• Chartio • Cloud based BI tool

Page 73: Technologies for Data Analytics Platform

How to manage job dependency? We want to issue Job X after Job A and Job B are finished.

Page 74: Technologies for Data Analytics Platform

Data pipeline tool• There are some important features

• Manage job dependency • Handle job failure and retry • Easy to define topology • Separate tasks into sub-tasks

• Apache Oozie, Apache Falcon, Luigi, Airflow, JP1, etc…

Page 75: Technologies for Data Analytics Platform

Luigi

• Python module for building job pipeline • Write python code and run it.

• task is defined as Python class • Easy to manage by VCS

• Need some extra tools • scheduled job, job hisotry, etc…

class T1(luigi.task): def requires(self): # dependencies

def output(self): # store result

def run(self): # task body

Page 76: Technologies for Data Analytics Platform

Airflow• Python and DAG based workflow

• Write python code but it is for defining ADAG • Task is defined by Operator

• There are good features • Management web UI • Task information is stored into database • Celery based distributed execution

dag = DAG('example') t1 = Operator(..., dag=dag) t2 = Operator(..., dag=dag) t2.set_upstream(t1)