Technologies for Data Analytics Platform

Technologies for Data Analytics PlatformYAPC::Asia Tokyo 2015 - Aug 22, 2015

Who are you?

• Masahiro Nakagawa • github: @repeatedly

• Treasure Data Inc. • Fluentd / td-agent developer • https://jobs.lever.co/treasure-data

• I love OSS :) • D Language, MessagePack, The organizer of several meetups, etc…

https://jobs.lever.co/treasure-data

Why do we analyze data?

Reporting Monitoring Exploratory data analysis Confirmatory data analysis etc…

Need data, data, data!

It means we need data analysis platform for own requirements

Data Analytics Flow

Collect Store Process Visualize

Data source

Reporting

Monitoring

Let’s launch platform!

• Easy to use and maintain • Single server • RDBMS is popular and has huge ecosystem

RDBMS

ETL QueryExtract + Transformation + Load

×

Oops! RDBMS is not good for data analytics against large data volume. We need more speed and scalability!

Let’s consider Parallel RDBMS instead!

Parallel RDBMS

• Optimized for OLAP workload • Columnar storage, Shared nothing, etc… • Netezza, Teradata, Vertica, Greenplum, etc…

Compute Node

Leader Node

Compute Node

Compute Node

Query

time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

• Good data format for analytics workload • Read only selected columns, efficient compression • Not good for insert / update

Columnar Storage

time code method

2015-12-01 10:02:36 200 GET

2015-12-01 10:22:09 404 GET

2015-12-01 10:36:45 200 GET

2015-12-01 10:49:21 200 POST

… … …

Row ColumnarUnit

Unit

Okay, query is now processed normally.

L

C C C

No silver bullet

• Performance depends on data modeling and query • distkey and sortkey are important

• should reduce data transfer and IO Cost • query should take advantage of these keys

• There are some problems • Cluster scaling, metadata management, etc…

Performance is good :) But we often want to change schema for new workloads. Now, hard to maintain schema and its data…

L

C C C

Okay, let’s separate data sources into multiple layers for reliable platform

Schema on Write(RDBMS)• Writing data using schema

for improving query performance

• Pros: • minimum query overhead

• Cons: • Need to design schema and workload before • Data load is expensive operation

Schema on Read(Hadoop)• Writing data without schema and

map schema at query time

• Pros: • Robust over schema and workload change • Data load is cheap operation

• Cons: • High overhead at query time

Data Lake• Schema management is hard

• Volume is increasing and format is often changed • There are lots of log types

• Feasible approach is storing raw data and converting it before analyze data

• Data Lake is a single storage for any logs • Note that no clear definition for now

Data Lake Patterns

• Use DFS, e.g. HDFS, for log storage • ETL or data processing by Hadoop ecosystem • Can convert logs via ingestion tools before

• Use Data Lake storage and related tools • These storages support Hadoop ecosystem

Apache Hadoop• Distributed computing framework

• First implementation based on Google MapReduce

http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/

http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/

HDFS

http://nosqlessentials.com/

http://nosqlessentials.com/

MapReduce

Cool! Data load becomes robust!

EL

T

Raw data Transformed data

Apache Tez• Low level framework for YARN Applications

• Hive, Pig, new query engine and more

• Task and DAG based processing flow

ProcessorInput Output

Task DAG

MapReduce vs Tez

MapReduce Tez

M

HDFS

R

R

M M

HDFS HDFS

R

M M

R

M M

R

M

R

M MM

M M

R

R

R

SELECT g1.x, g2.avg, g2.cntFROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1 JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2 ON (g1.x = g2.x) ORDER BY avg;

GROUP b BY b.xGROUP a BY a.x

JOIN (a, b)

ORDER BY

GROUP BY x

GROUP BY a.x JOIN (a, b)

ORDER BY

http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9

http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9

Superstition• HDFS and YARN have SPOF

• Recent version doesn’t have SPOF on both MapReduce 1 and MapReduce 2

• Can’t build from a scratch • Really? Treasure Data builds Hadoop on CircleCI.

Cloudera, Hortonworks and MapR too. • They also check its dependent toolchain.

Which Hadoop packageshould we use?• Distribution by Hadoop distributor is better

• CDH by Cloudera • HDP by Hortonworks • MapR distribution by MapR

• If you are familiar with Hadoop and its ecosystem, Apache community edition becomes an option. • For example, Treasure Data has patches and

they want to use patched version.

Good :) In addition, we want to collect data in efficient way!

Ingestion tools• There are two execution model!

• Bulk load: • For high-throughput • Almost tools transfer data in batch and parallel

• Streaming load: • For low-latency • Almost tools transfer data in micro-batch

Bulk load tools• Embulk

• Pluggable bulk data loader for various inputs and outputs

• Write plugins using Java and JRuby

• Sqoop • Data transfer between Hadoop and RDBMS • Included in some distributions

• Or each bulk loader for each data store

Streaming load tools• Fluentd

• Pluggable and json based streaming collector • Lots of plugins in rubygems

• Flume • Mainly for Hadoop ecosystem, HDFS, HBase, … • Included in some distributions

• Or Logstash, Heka, Splunk and etc…

Data ingestion also becomes robust and efficient!

Raw data Transformed data

It works! but…we want to issue ad-hoc query to entire data. We can’t wait loading data into database.

You can use MPP query engine for data stores.

MPP query engine• It doesn’t have own storage unlike parallel RDBMS

• Follow “Schema on Read” approach • data distribution depends on backend • data schema also depends on backend

• Some products are called “SQL on Hadoop” • Presto, Impala, Apache Drill, etc… • It has own execution engine, not use MapReduce.

• Distributed Query Engine for interactive queries against various data sources and large data.

• Pluggable connector for joining multiple backends • You can join MySQL and HDFS data in one query

• Lots of useful functions for data analytics • window functions, approximate query,

machine learning, etc…

HDFS

Hive

PostgreSQL, etc.

Daily/Hourly BatchInteractive query

CommercialBI Tools

Batch analysis platform Visualization platform

Dashboard

HDFS

Hive


✓ Less scalable ✓ Extra cost

CommercialBI Tools

Dashboard

✓ More work to manage 2 platforms

✓ Can’t query against “live” data directly

Batch analysis platform Visualization platform

PostgreSQL, etc.

HDFS

Hive Dashboard

Presto

PostgreSQL, etc.

Daily/Hourly Batch

HDFS

HiveDashboard

Daily/Hourly Batch

Interactive query

Interactive query

Presto

HDFS

HiveDashboard


Cassandra MySQL Commertial DBs

SQL on any data sets CommercialBI Tools

✓ IBM Cognos✓ Tableau ✓ ...

Data analysis platform

Client

Coordinator ConnectorPlugin

Worker

Worker

Worker

Storage / Metadata

Discovery Service

Execution Model

All stages are pipe-lined ✓ No wait time ✓ No fault-tolerance

MapReduce Presto

map map

reduce reduce

task task

task task

task

task

memory-to-memory data transfer ✓ No disk IO ✓ Data chunk must fit in memory

task

disk

map map

reduce reduce

disk

disk

Write datato disk

Wait betweenstages

Okay, we have now low latency and batch combination.

Raw data

Resolved our concern! But… we also need quick estimation.

Currently, we have several stream processing softwares. Let’s try!!

Apache Storm• Distributed realtime processing framework

• Low latency: tuple at a time • Trident mode uses micro batch

https://storm.apache.org/

https://storm.apache.org/

Norikra• Schema-less CEP engine for stream processing

• Use SQL like Esper EPL • Not distributed unlike Storm for now

Calculated result

Great! We can get insight by streaming and batch way :)

One more. We can make data transfer more reliable for multiple data streams with distributed queue

• Distributed messaging system • Producer - Broker - Consumer pattern • Pull model, replication, etc…

Apache Kafka

App

PushPull

Push vs Pull

• Push: • Easy to transfer data to multiple destinations • Hard to control stream ratio in multiple streams

• Pull: • Easy to control stream ratio • Should manage consumers correctly

This is a modern analytics platform

Seems complex and hard to maintain? Let’s use useful services!

Amazon Redshift• Parallel RDBMS on AWS

• Re-use traditional Parallel RDMBS know-how • Scale is easier than traditional systems

• With Amazon EMR is popular 1. Store data into S3 2. EMR processes S3 data 3. Load processed data into Redshift

• EMR has Hadoop ecosystem

Using AWS Services

Google BigQuery• Distributed query engine and scalable storage

• Tree model, Columnar storage, etc… • Separate storage from workers

• High performance query by Google infrastructure • Lots of workers • Storage / IO layer on Colossus

• Can’t manage Parallel RDBMS properties like distkey, but it works on almost cases.

BigQuery architecture

Using GCP Services

Treasure Data• Cloud based end-to-end data analytics service

• Hive, Presto, Pig and Hivemall for one big repository • Lots of ingestion and output way, scheduling, etc… • No stream processing for now

• Service concept is Data Lake • JSON based schema-less storage

• Execution model is similar to BigQuery • Separate storage from workers • Can’t specify Parallel RDBMS properties

Using Treasure Data Service

Resource Model Trade-off

Pros Cons

Fully Guaranteed Stable execution Easy to control resource Non boost mechanizm

Guaranteed with multi-tenanted

Stable execution Good scalability less controllable resource

Fully multi-tenanted Boosted performance Great scalability Unstable execution

MS Azure also has useful services: DataHub, SQL DWH, DataLake, Stream Analytics, HDInsight…

Use service or build a platform?

• Should consider using service first • AWS, GCP, MS Azure, Treasure Data, etc… • Important factor is data analytics, not platform

• Do you have enough resources to maintain it?

• If specific analytics platform is a differentiator, building a platform is better • Use state-of-the-art technologies • Hard to implement on existing platforms

Conclusion• Many softwares and services for data analytics

• Lots of trade-off, performance, complexity, connectivity, execution model, etc

• SQL is a primary language on data analytics

• Should focus your goal! • data analytics platform is your business core?

If not, consider using services first.

Cloud service for entire data pipeline!

Appendix

Apache Spark• Another Distributed computing framework

• Mainly for in-memory computing with DAG • RDD and DataFrame based clean API

• Combination with Hadoop is popular

http://slidedeck.io/jmarin/scala-talk

http://slidedeck.io/jmarin/scala-talk

Apache Flink• Streaming based execution engine

• Support batch and pipelined processing • Hadoop and Spark are batch based •

https://ci.apache.org/projects/flink/flink-docs-master/

https://ci.apache.org/projects/flink/flink-docs-master/

Batch vs Pipelined

All stages are pipe-lined ✓ No wait time ✓ fault-tolerance with

check pointing

Batch(Staged) Pipelined

task task

task task

task

task

memory-to-memory data transfer ✓ use disk if needed

task

disk

disk

Wait betweenstagestask

task task

task task

task task stage3

stage2

stage1

Visualization• Tableau

• Popular BI tool in many area • Awesome GUI, easy to use, lots of charts, etc

• Metric Insights • Dashboard for many metrics • Scheduled query, custom handler, etc

• Chartio • Cloud based BI tool

How to manage job dependency? We want to issue Job X after Job A and Job B are finished.

Data pipeline tool• There are some important features

• Manage job dependency • Handle job failure and retry • Easy to define topology • Separate tasks into sub-tasks

• Apache Oozie, Apache Falcon, Luigi, Airflow, JP1, etc…

Luigi

• Python module for building job pipeline • Write python code and run it.

• task is defined as Python class • Easy to manage by VCS

• Need some extra tools • scheduled job, job hisotry, etc…

class T1(luigi.task): def requires(self): # dependencies

def output(self): # store result

def run(self): # task body

Airflow• Python and DAG based workflow

• Write python code but it is for defining ADAG • Task is defined by Operator

• There are good features • Management web UI • Task information is stored into database • Celery based distributed execution

dag = DAG('example') t1 = Operator(..., dag=dag) t2 = Operator(..., dag=dag) t2.set_upstream(t1)

Technologies for Data Analytics Platform

Technology

Transcript of Technologies for Data Analytics Platform