Post on 14-Apr-2017
Qubole Click to Query your Big Data on the Cloud
A company like Facebook provides Data infrastructure as a service (created by the founders of Qubole)
- More than 30% of the company uses this infrastructure every month
- Users range from developers, analysts, business analysts or business users
- Manages over an Exabyte of data
- Has made the company more data driven and agile with data use
-It took the founders a team of over 30 people to create this infrastructure and currently the team managing this infrastructure has more than 100 people
2
Operations Analyst
Marketing Ops
Analyst
Data Architect
Business Users
Product Support
Customer Support
Developer
Sales Ops
Product Managers
Data Infrastructure
QUBOLE VISION DATA FOR ALL CLICK-T0-QUERY
3
~ 170+ PB of data processed per month
10 – 3000 node clusters on a daily basis
300,000 machines per month
20,000 jobs on a daily basis
AGILITY TIME-TO-INSIGHT CLICK-T0-QUERY
CONFIDENTIAL. SUBJECT TO NDA PROVISIONS.
Industries and Use Cases
Media & Advertising
Oil & Gas Retail Life Sciences Financial Services
SecuritySocial
Networking & Gaming
Targeted Advertising
Seismic Analysis
Image and Video
Processing
Customer Profile
Transaction Analysis
Genome Analysis
Monte Carlo Simulations
Risk Analysis
Fraud Detection
Anti-virus
Image Recognition
In-game Metrics
Usage Analysis
User Demographics
Predefined Reporting
Ad Hoc Analytics
Statistical Analytics
Predictive Analytics
Machine Learning MapReduce Streaming
Workload Classifications
Match Your Processing Engines to Your Workload ParametersSQL Data Pipeline MapReduce Spark NoSQL Store
AGILITY TIME-TO-INSIGHT CLICK-T0-QUERY
55
• 10-1000+ Nodes in <5min • Flexible - different nodes for different loads • Data For All - usable by many • Low TCO - Only ON when needed
• Extensive planning required - Inflexible and Static. • Not built for Cloud. • Need Hadoop experts to install, maintain and use. • High TCO - Always ON
Qubole UI via Browser
SDK
ODBC
User Access
Qubole’sAWS Account
Customer’s AWS Account
REST API (HTTPS)
SSH
Ephemeral Hadoop Clusters, Managed by Qubole
Slave
Master
Data Flow within Customer’s AWS
(optional) Other RDS, Redshift
Ephemeral Web Tier
Web Servers
Encrypted Result Cache
Encrypted HDFS
Slave
Encrypted HDFS
RDS – Qubole User, Account Configurations
(Encrypted credentials
Amazon S3 No HDFS Load
w/S3 Server Side Encryption
Default Hive Metastore
Encryption Options: a) Qubole can encrypt the result cache b) Qubole supports encryption of the ephemeral drives used for HDFS c) Qubole supports S3 Server Side Encryption
(c)
(b)
(a)
(optional) Custom
Hive Metastore
SSH
BUILT FOR CLOUD PERFORMANCE COST-EFFICIENT
Ephemeral Clusters: • Auto-Scaling - both up and down • Spot Instances - data management and back-fill • VMs deployed with awareness of time
Demo
7
Why Qubole?
8
“Qubole has enabled more users within Pinterest to get to the data and has made the data platform lot more scalable and
stable”
Mohammad Shahangian - Lead, Data Science and Infrastructure
Moved to Qubole from Amazon EMR because of stability and rapidly expanded big data usage by giving access to data to users beyond developers.
Rapid expansion of big data beyond developers (240 users out of 600 person company)
Use CasesUser and Query Growth
Rapid expansion in use cases ranging from ETL, search, adhoc querying, product analytics etc.
Rock solid infrastructure sees 50% less failures as compared to AWS Elastic Map/Reduce
Enterprise scale processing and data access
Why Qubole?
9
“We needed something that was reliable and easy to learn, setup, use and put into production without the risk and high
expectations that comes with committing millions of dollars in upfront investment. Qubole was that thing.”
Marc Rosen - Sr. Director, Data Analytics
Moved to Big data on the cloud (from internal Oracle clusters) because getting to analysis was much quicker than operating infrastructure themselves. Used to answer client queries and power client dashboards.
Use Cases# Commands Per Month
0
1250
2500
3750
5000
Aug-13
Sept-13
Oct-13
Nov-13
Dec-13
Jan-14
Feb-14
Number of queries Segment audiences based on their behavior including such topics as user pathway and multi-dimensional recency analysis
Build customer profiles (both uni/multivariate) across thousands of first party (i.e., client CRM files) and third party (i.e., demographic) segments
Simplify attribution insights showing the effects of upper funnel prospecting on lower funnel remarketing media strategies