Treasure Data and OSS

39
Masahiro Nakagawa Feb 7, 2015 dots. Summit 2015 Treasure Data and OSS

Transcript of Treasure Data and OSS

Page 1: Treasure Data and OSS

Masahiro NakagawaFeb 7, 2015

dots. Summit 2015

Treasure Data and OSS

Page 2: Treasure Data and OSS

Who are you?

> Masahiro Nakagawa > github/twitter: @repeatedly

> Treasure Data, Inc. > Senior Software Engineer > Fluentd / td-agent developer

> I love OSS :) > D language - Phobos committer > Fluentd - Main maintainer > MessagePack / RPC - D and Python (only RPC) > The organizer of several meetups (Presto, DTM, etc) > etc…

Page 3: Treasure Data and OSS

Company background•  Founded 2011 in Mountain View, CA !

–  The first cloud service for the entire data pipeline!

–  Including: Acquisition, Storage, & Analysis !

•  Provide a “Cloud Data Service” !–  Fast Time to Value!–  Cloud Flexibility and Economics !–  Simple and Well Supported !

•  Treasure Data has over 100+ customers in production!–  Incl. Fortune 500 companies !–  400k new records / second !–  Almost 9 Trillion records loaded !–  Variety of use cases and verticals !

The Treasure Data Team Hiro Yoshikawa – CEO Open source business veteran

Kaz Ohta – CTO Founder of world’s largest Hadoop Group

Sada Furuhashi – Software Architect MessagaPack / Fluentd Author

Notable Investors

Othman Laraki Ex-VP of Growth at Twitter

Jerry Yang Founder of Yahoo!

Yukihiro “Matz” Matusmoto Creator of “Ruby” programming language

James Lindenbaum Founder of Heroku

Sierra Ventures - Tim Guleri Leading venture capital firm in Big Data

Page 4: Treasure Data and OSS

TD Service Architecture

Time to Value

Send query result Result Push

Acquire Analyze Store

Plazma DB Flexible, Scalable, Columnar Storage

Web Log

App Log

Censor

CRM

ERP

RDBMS

Treasure Agent(Server) SDK(JS, Android, iOS, Unity)

Streaming Collector

Batch / Reliability

Ad-hoc /Low latency

KPI$

KPI Dashboard

BI Tools

Other Products

RDBMS, Google Docs, AWS S3, FTP Server, etc.

Metric Insights

Tableau, Motion Board�����etc.

POS

REST API ODBC / JDBC �SQL, Pig�

Bulk Uploader

Embulk,TD Toolbelt

SQL-based query

@AWS or @IDCF

Connectivity

Economy & Flexibility Simple & Supported

Page 5: Treasure Data and OSS

Data Acquisition

Page 6: Treasure Data and OSS

Log collecting in TD

> Treasure Agent > Fluentd based log collector

> Embulk > JavaScript SDK > Mobile SDK (iOS, Android, Unity)

Page 7: Treasure Data and OSS

Structured logging !

Reliable forwarding !

Pluggable architecture

http://fluentd.org/

Page 8: Treasure Data and OSS

Fluentd

> Data collector for unified logging layer > Streaming data transfer based on JSON > Written in Ruby

> Gem based various plugins > http://www.fluentd.org/plugins

> Working in production > http://www.fluentd.org/testimonials

Page 9: Treasure Data and OSS

Data Analytics Flow

Collect Store Process Visualize

Data source

Reporting

Monitoring

Page 10: Treasure Data and OSS

Data Analytics Flow

Store Process

Cloudera

Horton Works

Treasure Data

Collect Visualize

Tableau

Excel

R

easier & shorter time

???

Page 11: Treasure Data and OSS

Divide & Conquer & Retry

error retry

error retry retry

retryBatch

Stream

Other stream

Page 12: Treasure Data and OSS

Core Plugins

> Divide & Conquer

> Buffering & Retrying

> Error handling

> Message routing

> Parallelism

> read / receive data > from API, database,

command, etc… > write / send data

> to API, database, alert, graph, etc…

Page 13: Treasure Data and OSS

Architecture (v0.12 or later)

EngineInput

Filter Output

Buffer

> grep > record_transfomer > …

> Forward > File tail > ...

> Forward > File > ...

Output

> File > Memory

not pluggable

FormatterParser

Page 14: Treasure Data and OSS

Before (M x N)

Page 15: Treasure Data and OSS

After (M + N)

or Embulk

Page 16: Treasure Data and OSS

Other Fluentd related OSS> Treasure Agent

> https://github.com/treasure-data/omnibus-td-agent

> Fluentd Forwarder > https://github.com/fluent/fluentd-forwarder

> Simple forwarder for Windows / Leaf node

> Fluentd UI > https://github.com/fluent/fluentd-ui

> Management web UI

Page 17: Treasure Data and OSS

Other OSS products

> Scribed (C++) > Developed by Facebook > No maintained

> Apache Flume (Java) > Mainly for Hadoop HDFS / HBase

> Logstash (JRuby) > Mainly for Elasticsearch

Page 18: Treasure Data and OSS

Embulk

> Bulk Loader version of Fluentd > Pluggable architecture

> JRuby, JVM languages (TBD) > High performance parallel processing

> Share your script as a plugin > https://github.com/embulk

http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed

Page 19: Treasure Data and OSS

HDFS

MySQL

Amazon S3

Embulk

CSV Files

SequenceFile

Salesforce.com

Elasticsearch

Cassandra

Hive

Redis

✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behaviour ✓ Idempotent retrying

Plugins Plugins

bulk load

Page 20: Treasure Data and OSS

Computing Framework

Page 21: Treasure Data and OSS

3 query engines in TD

> Hive (HiveQL, Batch) > for ETL and large jobs > Hivemall for machine learning

> Pig (Pig Latin, Batch) > DataFu for data mining and statistics

> Presto (SQL, Short batch) > for Ad hoc queries

Page 22: Treasure Data and OSS

Hadoop

> Distributed computing framework > Consist of many components…

http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/

Page 23: Treasure Data and OSS

http://nosqlessentials.com/

Page 24: Treasure Data and OSS

http://nosqlessentials.com/

Page 25: Treasure Data and OSS

> Low level framework for YARN applications > New Query Engine > Provide good IR for Hive, Pig and more

> Task and DAG based pipelining

Apache Tez

ProcessorInput Output

Task DAGhttp://tez.apache.org/

Page 26: Treasure Data and OSS

Hive on MR vs. Hive on Tez

MapReduce Tez

http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9

M

HDFS

R

R

M M

HDFS HDFS

R

M M

R

M M

R

M

R

M MM

M M

R

R

R

Avoid unnecessary HDFS write!

SELECT g1.x, g2.avg, g2.cnt FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1"JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2"ON (g1.x = g2.x) ORDER BY avg;

GROUP b BY b.xGROUP a BY a.x

JOIN (a, b)

ORDER BY

GROUP BY x

GROUP BY a.x"JOIN (a, b)

ORDER BY

Page 27: Treasure Data and OSS

Other OSS products

> Apache Spark > Mainly for on-memory processing > Spark ecosystem is now growing

> Apache Flink > Mainly for iterative processing

> Microsoft’s Dryad > This was premature for human being…

Page 28: Treasure Data and OSS

Presto

A distributed SQL query engine for interactive data analisys against GBs to PBs of data.

Page 29: Treasure Data and OSS

Presto overview> Open sourced by Facebook

> http://prestodb.io/ > written in Java

> Built-in useful features > Connectors > Machine Learning > Window function > Approximate query > etc…

> Used by Netflix, Dropbox, Treasure Data, Qubole, Airbnb, LINE, GREE, Scaleout, etc

Page 30: Treasure Data and OSS

HDFS

Hive

PostgreSQL, etc.

Daily/Hourly BatchInteractive query

CommercialBI Tools

Batch analysis platform Visualization platform

Dashboard

Page 31: Treasure Data and OSS

HDFS

Hive

Daily/Hourly BatchInteractive query

✓ Less scalable ✓ Extra cost

CommercialBI Tools

Dashboard

✓ More work to manage 2 platforms

✓ Can’t query against “live” data directly

Batch analysis platform Visualization platform

PostgreSQL, etc.

Page 32: Treasure Data and OSS

HDFS

Hive Dashboard

Presto

PostgreSQL, etc.

Daily/Hourly Batch

HDFS

HiveDashboard

Daily/Hourly Batch

Interactive query

Interactive query

Page 33: Treasure Data and OSS

Presto

HDFS

HiveDashboard

Daily/Hourly BatchInteractive query

Cassandra MySQL Commertial DBs

SQL on any data sets CommercialBI Tools

✓ IBM Cognos✓ Tableau ✓ ...

Data analysis platform

Page 34: Treasure Data and OSS

All stages are pipe-lined ✓ No wait time ✓ No fault-tolerance

MapReduce vs. Presto

MapReduce Presto

map map

reduce reduce

task task

task task

task

task

memory-to-memory data transfer ✓ No disk IO ✓ Data chunk must fit in memory

task

disk

map map

reduce reduce

disk

disk

Write data to disk

Wait betweenstages

Page 35: Treasure Data and OSS

Other OSS products

> Cloudera Impala > Mainly for HDFS / HBase

> Apache Drill > More flexible architecture

> Apache Tajo > For building data warehouse

Page 36: Treasure Data and OSS

Visualization

Page 37: Treasure Data and OSS

Hmm…

> There are no popular OSS products > We don’t focus on developing

visualization tool for now > Commercial BI tools are popular

> Tableau, Motion board and etc > Maybe, next presentation talk about

this area deeply

Page 38: Treasure Data and OSS

Treasure Data resources

> https://github.com/treasure-data > perfectqueue, perfectsched, etc

> https://sql.treasuredata.com/ > HiveQL syntax checker

> https://examples.treasuredata.com/ > Query catalog

http://blog.treasuredata.com/2014/11/26/12-open-source-software-innovations-from-treasure-data-engineers/

Page 39: Treasure Data and OSS

Check: treasuredata.comCloud service for the entire data pipeline