Treasure Data Exciting Coding!
Transcript of Treasure Data Exciting Coding!
www.treasuredata.com
Treasure Data Exciting Coding!Nov 2013
Presented by
Masahiro NakagawaSenior Software Engineer
1
• Masahiro Nakagawa
– @repeatedly
– [email protected] or d@
• Treasure Data, Inc
– Senior Software Engineer
• Fluentd / Client libraries / etc...
– Since 2012/11
• Open Source projects– D Programming Language
– MessagePack: D, Python, etc…
– Fluentd: Core, Mongo, Logger, etc…
– Etc…
Who are you
2
www.treasuredata.com
Board Meeting PresentationAugust 15th, 2013 - 3:30PM PDT
Presented by
Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, MarketingKeith Goldstein – VP, SalesKengo Hirouchi – Director, JapanAnkush Rustagi – Director, Marketing
Company & ServiceIntroduction
3
• Founded 2011 in Mountain View, CA
– The first cloud service for the entire data pipeline
– Including: Acquisition, Storage, & Analysis
• Provide a “Cloud Data Service”
– Fast Time to Value
– Cloud Flexibility and Economics
– Simple and Well Supported
• Treasure Data has over 100+ customers in production
– Incl. Fortune 500 companies
– 500+ Billion new records / month
– Around 2 Trillion records under management
– Variety of use cases and verticals
Company Background
4
The Treasure Data Team
Hiro Yoshikawa – CEOOpen source business veteran
Kaz Ohta – CTOFounder of world’s largest Hadoop Group
Jeff Yuan – Director, EngineeringLinkedIn, MIT / Michale Stonrebrraker Lab
Keith Goldstein – VP Sales & Bus DevVP of Bus Dev from Tibco and Talend
Rich Ghiossi – VP MarketingVP of Marketing from ParAccel
Notable Investors
Othman LarakiEx-VP of Growth at Twitter
Jerry YangFounder of Yahoo!
Yukihiro “Matz” MatusmotoCreator of “Ruby” programming language
James LindenbaumFounder of Heroku
• Lots of companies today produce Big Data by having “New Data Sources” (Sensor, Weblog, etc)– But few have the resources to build a
Big Data Analytics system
• 60-70% of a company’s Big Data time & budget consumed by: – Infrastructure setup & Maintenance
– Building Collection & Storage Flows
– Hiring/Training Hadoop Expertise
• On average, it takes 6 months to get a Hadoop environment into production
Problem Statement
5
BI ToolsTableau, Metric Insights,
QlikView, Excel, etc.
Treasure Data Service: Overview
9
Web logs
App logs
Sensor
CRM
ERP
RDBMS
Streaming Log Collector (JSON)
Treasure Agent
Parallel Upload from CSV, MySQL, etc.
Bulk Import
Treasure Data Cloud
Flexible, Scalable, Columnar Storage
REST API, SQL, Pig, JDBC / ODBC
BI Connectivity
REST API, SQL, Pig
Result Push
Dashboards
Custom App, Local DB, FTP Server, etc.
Time to Value Economy & Flexibility Simple & Supported
Acquire AnalyzeStore
10
Our Value Propositions • Faster time to value
On-demand cloud infrastructure & versatile streaming data collection agent – Instantly provision a fully tuned & managed infrastructure– Go live into production on average in 14 days (collection, analytics, & BI)
• Cloud flexibility and economicsFraction of the cost of traditional solutions by leveraging cloud storage and processing, which scales to meet your needs– Leverage the cost-advantage of the cloud– Leverage the elasticity of the cloud – scale on demand– Predictable monthly subscription fee– No upfront costs & no long-term commitment
• Simple and well supportedWe are passionate about simplicity, and customer support excellence– Focus your time on analyzing your data– Rely on us to keep your data secure & online– We love making customers successful & building long-term relationships
Initial Setup & Onboarding – Two Weeks
11
1. Data Collection 2. Data Storage
3. Data Analysis 4. Service & Support
• Setup, tuning, and monitoring of Treasure Agent
• Embed Treasure Agent code into applications
• Basic log templates (register, pay, login, etc.)
• Basic KPI queries (DAU, MAU, ARPU, etc.)
• Setup dashboards with basic KPIs• Training on creating customized
reports and ad‐hoc querying
• Assigned a dedicated technical account manager
• Real‐time support via email, online chat, and call
12
Solutions Accelerators
Treasure Data Platform
Out‐of‐the Box Reporting
Configured Treasure Agent
Solution Components:
- Treasure Data Platform
- Event Collection Template
- Pre-configuredTreasure Agent Configuration
- BI Dashboard with KPIs
…
www.treasuredata.com
Board Meeting PresentationAugust 15th, 2013 - 3:30PM PDT
Presented by
Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, MarketingKeith Goldstein – VP, SalesKengo Hirouchi – Director, JapanAnkush Rustagi – Director, Marketing
Treasure Data PlatformArchitecture Overview
14
Treasure Data Cloud
Data Acquisition – Streaming Capture
15
# Application Code......
# Post event to Treasure DataTD.event.post('access', {:uid=>123})
...
...
Treasure Data LibraryJava, Ruby, PHP, Perl, Python, Scala, Node.js
Application Server
Treasure Agent (local)
• Automatic Micro‐batching• Local buffering Fall‐back• Network Tolerance
Open‐Sourced as Fluentd Project ( http://fluentd.org/ )
Data Acquisition – Bulk Loader
16
Treasure Data Cloud
RDBMS AppSaaS
FTP
CSV, TSV, JSON,MessagePack, Apache,regex, MySQL, FTP
Bulk LoaderPrepare Upload Perform Commit
Data Storage
17
Treasure Data Cloud
• Stored “schema-less” as JSON
– Schema can be applied/updatedAFTER storage
• Compressed & columnar format
– For higher query performance
• Optimized for time-based filtering
• Quickly scale-up processing power
– WITHOUT reloading/redistributing the data
time v
1384160400 {“ip”:”135.52.211.23”, “code”:”0”}
1384162200 {“ip”:”45.25.38.156”, “code”:”‐1”}
1384164000 {“ip”:”97.12.76.55”, “code”:”99”}
time ip : string code : int
1384160400 135.52.211.23 0
1384162200 45.25.38.156 ‐1
1384164000 97.12.76.55 99
Default (schema‐less)
Schema applied
SELECT v[‘ip’] as ip, v[‘code’] as code …
SELECT ip, code …
~30% Faster
Data Analysis
18
Treasure Data Cloud
Scripted Processing (Pig):‐ DataFu (LinkedIn)‐ Piggybank (Apache)
Heavy Lifting SQL (Hive):‐ Hive’s Built‐in UDFs‐ TD Added Functions:
‐ Time Functions‐ First, Last, Rank‐ Sessionize
JDBC Connectivity:‐ Custom Java Apps‐ Standards‐based‐ BI Tool Integration
Tableau ODBC connector‐ Leverages Impala
Push Query Results:‐ MySQL, PostgreSQL‐ Google Spreadsheet‐ Web, FTP, S3‐ Leftronic, Indicee‐ Treasure Data Table
Interactive SQLTreasure Query Accelerator (Impala)
Scheduled Jobs‐ SQL, Pig Scripts‐ Data PushesREST API
www.treasuredata.com
Board Meeting PresentationAugust 15th, 2013 - 3:30PM PDT
Presented by
Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, MarketingKeith Goldstein – VP, SalesKengo Hirouchi – Director, JapanAnkush Rustagi – Director, Marketing
Treasure DataGeneral Use Cases
19
20
A case: “14 Days” from Signup to Success
1. Europe’s largest mobile ad exchange.
2. Serving >60 billionimps/month for >30,000 mobile apps (Q4 2013)
3. Immediate need of analytics infrastructure: ASAP!
4. With TD, MobFox got into production only in 14 days, by one engineer.
"Time is the most precious asset in our fast-moving business,and Treasure Data saved us a lot of it."
Julian Zehetmayr, CEO & Founder
21
A case: “Replace” in-house Hadoop to TD
1. Global “Hulu” - Online Video Service with millions of users
2. Video contents are distributed to over 150 languages.
3. Had hard time maintaining Hadoop cluster
4. With TD, Viki deprecated their in-house Hadoop cluster and use engineer for core businesses.
Before
After
“Treasure Data has always given us thorough and timely support peppered with insightful tips to make the best use of their service."
Huy Nguyen, Software Engineer
22
A case: Treasure Data with BI Tool (Tableau)
1. World’s largest android application market
2. Serving >3 billion app downloads for >100 millionusers
3. Only one engineer managing the data infrastructure
4. With TD, the data engineer can focus on analyzing data with existing BI tool
"I will recommend Treasure Data to my friends in a heartbeat because it benefits all three stakeholders: Operations, Engineering and Business."
Simon Dong, Principal Architect - Data Engineering
www.treasuredata.com
Board Meeting PresentationAugust 15th, 2013 - 3:30PM PDT
Presented by
Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, MarketingKeith Goldstein – VP, SalesKengo Hirouchi – Director, JapanAnkush Rustagi – Director, Marketing
Treasure Data PlatformFluentd Overview
23
• Open sourced log collector written in Ruby
– Easy to use, reliable and well performance
– Streaming event processing
• Using rubygems ecosystem to distribute plugins
What is Fluentd?
24
fluentd.org
Data processing pipeline
26
Collect Store Process Visualize
Data source
Reporting Monitoring
Important but no defacto middleware!
Fluentd general example
27
t ail
insert
eventbuffering
127.0.0.1 - - [11/Dec/2012:07:26:27] "GET / ...127.0.0.1 - - [11/Dec/2012:07:26:30] "GET / ...127.0.0.1 - - [11/Dec/2012:07:26:32] "GET / ...127.0.0.1 - - [11/Dec/2012:07:26:40] "GET / ...127.0.0.1 - - [11/Dec/2012:07:27:01] "GET / ...
...
Fluentd
Web Server
2012-02-04 01:33:51
apache.log
{ "host": "127.0.0.1", "method": "GET", ...}
Pluggable Architecture
28
Buffer Output
Input
> Forward> H T T P> File t ail> dstat> ...
> Forward> File> MongoDB> ...
> File> Memory
Engine
Output
> rewrite> ...
Pluggable Pluggable
Resolve your requirement by writing plugin
29
Nag ios
Mong oDB
Hadoop
Alerting
Am azon S3
Analysis
Archiving
MySQL
Apache
Frontend
Access logs
syslog d
App logs
System logs
Backend
Databasesfilter / buffer / routing
• Open sourced distribution package of Fluentd
– ETL part of Treasure Data
– deb / rpm / homebrew
• Including useful components
– Ruby, jemalloc, fluentd
– 3rd party gems: td, mongo, webhdfs, etc…
– Init script
•
Treasure Agent (td-agent)
30
http://packages.treasure‐data.com/
www.treasuredata.com
Board Meeting PresentationAugust 15th, 2013 - 3:30PM PDT
Presented by
Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, MarketingKeith Goldstein – VP, SalesKengo Hirouchi – Director, JapanAnkush Rustagi – Director, Marketing
Treasure Data PlatformBackend Overview
32
• RDS
– Store user information, job, status, etc…
– Queue Worker / Scheduler
• EC2
– API Server, Hadoop Cluster, Job Worker / Scheduler
• S3
– Columnar storage• Realtime / Archive storage
• MessagePack columnar
• ELB
AWS components
33
Plazma(Hadoop, Storage, Queue and Workers)
34
FrontendQueue
WorkerHadoop
Fluentd
Applications push metrics to Fluentd(via local Fluentd)
Librato Metricsfor realtime analysis
Treasure Data
for historical analysis
Fluentd sums up data minutes(partial aggregation)
Hadoop
www.treasuredata.com
Board Meeting PresentationAugust 15th, 2013 - 3:30PM PDT
Presented by
Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, MarketingKeith Goldstein – VP, SalesKengo Hirouchi – Director, JapanAnkush Rustagi – Director, Marketing
Treasure DataDevelopment Philosophy
35
• TD prefers engineers, who are contributing to the OSS products– MessagePack, Fluentd, ZeroMQ, Hadoop,
MongoDB, Angular.js, Huahin, D-Lang, etc.
– https://github.com/treasure-data?tab=members
• Reasons– Fixing & Improving the other people’s code is
crucial for the distributed team.
– TD’s engineering workflow is really similar with OSS product workflow.
– A+ OSS engineers will bring another A+ OSS engineer!
Open-Source Culture
36
• OSS Everything on the Client Side– http://github.com/treasure-data/
– http://fluentd.org/• TD is helping the world to collect more data in an analytics-
ready format
• 2000+ companies (e.g. Nintendo, SlideShare/LinkedIn) are using as OSS product. 3-4% of the users are TD’s customer.
• We also leverage other OSS products as much as possible.
• Closed Source on the Cloud Side– The core value must be a proprietary to sustain as a
business.
– The components can be OSS, but the most of the system will remain proprietary to create the value chain.
OSS v.s. Proprietary
37
• Solving the Customer Pain is the #1 Priority– Developers directly provide the support for customers, and spending
30%-40% of the development time to talk with customers– Developers are the BEST person to come up with the solution.– # of code lines != value
• Suffering Oriented Development– First, make it possible– Then, make it beautiful– Then, make it fast
• The Largest Customer Pain is NOT always applicable to other customers.– Need to be brave to say NO. NO. NO. NO. NO….
• TD doesn’t have 1-year Product Roadmap. Having 3-months roadmap accelerates the development, and other teams (marketing / sales), too.
How to decide Product Roadmap?
38
• 13 Engineers as of Nov. 2013– 5 Engineers in Tokyo, Japan
– 8 Engineers in Mountain View, USA
– 40% of the whole company
• Asynchronous Communication– Use async communication tools as much as possible:
Chat, JIRA, Email, Github, etc.
– Use video conferencing for weekly sync-up
• English is the primary communication language– If you cannot speak English, your value is nearly zero at
Treasure Data engineering team.
Distributed Team (International)
39
• Predictable Deployment Cycle– Weekly Deployment
• Continuous Deployment didn’t fit into B2B SaaS application, our customers want predictability of the changes.
• As a distributed team, it’s hard to track the every changes + deployment status.
– Track every changes on JIRA, and QA engineer is responsible for the deployment too.
• Continuous Deployment for Staging– Single branch, always automatically deployed to the staging
environment– Monitoring is a continuous testing
• On-Call Alert Schedule, based on the Timezone– No need to get up around 3am
Distributed Team (Deployment)
40
• Use Cloud Services as Much as Possible– Don’t hire people, use cloud services.– Out source everything, except your core value.– Developers tend to forget his own cost. If you spend 1-hour, it
already costs around $50 as a company.
• Examples– EC2 (IaaS)– CopperEgg (Infrastructure Monitoring)– NewRelic (Application Performance Management)– Hosted Chef (Configuration Management)– Librato Metrics (Application Metrics)– Pager Duty (Alerting)– Logentries (Log Search)– CircleCI, TravisCI (Continuous Integration)– HipChat, JIRA, Confluence (Development Tool)– Etc….
Leverage Cloud Services
41
www.treasuredata.com
Board Meeting PresentationAugust 15th, 2013 - 3:30PM PDT
Presented by
Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, MarketingKeith Goldstein – VP, SalesKengo Hirouchi – Director, JapanAnkush Rustagi – Director, Marketing
Treasure DataConclusion
42