The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

29
© 2014 MapR Technologies 1 © 2014 MapR Technologies The Future of Hadoop: Data Agility Tomer Shiran VP Product Management, MapR Technologies Co-Founder and PMC Member, Apache Drill June 22, 2014

Transcript of The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

Page 1: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 1© 2014 MapR Technologies

The Future of Hadoop: Data AgilityTomer ShiranVP Product Management, MapR TechnologiesCo-Founder and PMC Member, Apache Drill

June 22, 2014

Page 2: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 2

Data is doubling in size every two years

Page 3: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 3

44 ZETTABYTES

4.4 ZETTABYTES

2011 2013

1.8 ZETTABYTES

IDC estimates that in 2020, there will be 44 zettabytes

of data in the world

2020

Source: IDC Digital Universe

Page 4: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 4

UNSTRUCTURED DATA

STRUCTURED DATA

1980 2000 20101990 2020

Unstructured data will account for more than 80% of the data

collected by organizations

Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data

Total Data S

tored

Page 5: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 5

Unstructured Data is Ubiquitous

Social Media

Messages

Audio

Sensors

Mobile Data

Email

Clickstream

Page 6: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 6

Hadoop Adoption is ExplodingJOB TRENDS FROM INDEED.COM

Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13

Page 7: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 7

The MapR Distribution for Hadoop

Best Product Exponential Growth

3X bookings Q1 ‘13 – Q1 ‘14

80% of accounts expand 3X

90% software licenses

< 1% lifetime churn

> $1B in incremental revenuegenerated by 1 customer

500+ CustomersBig Data

Riding the Wave with

HadoopThe Big Data

Platform of Choice

Page 8: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 8

360° Customer View

5PBCUSTOMER DATA

Page 9: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 9PEOPLE

1.2BPEOPLE

Largest Biometric Database in the World

Page 10: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 10© 2014 MapR Technologies

The Future of Hadoop: Data Agility

Page 11: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 11

Distance to Data

Business(analysts, developers)

“Plumbing” developmentMapReduce

Business(analysts, developers)

Modeling and transformations

Hive and other SQL-on-Hadoop

Existing approaches require a middleman (IT)

Data

Data

Page 12: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 12

Real-World Data Modeling and Transformations

Page 13: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 13

“We just can’t continue to manage data the “old way” by throwing more DBA’s at the problem and waiting for data to be accessible.” – Fortune 100 CIO

“Our data and business needs are constantly changing. Traditional data management processes simply don’t work in this new world.” – Large Web 2.0 Hadoop user

“If source data is not easy to access, self-service BI won’t happen” - TWDI

Page 14: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 14

Distance to Data

Business(analysts, developers)

“Plumbing” developmentMapReduce

Hive and other SQL-on-Hadoop

Business(analysts, developers)Data Agility

Existing approaches require a middleman (IT)

Data

Data

Data

Business(analysts, developers)

Modeling and transformations

Page 15: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 15

Why Improve Distance to Data?

• Enable rapid data exploration and application development

• IT should provide a valuable service without “getting in the way”

• Can’t add DBAs to keep up with the exponential data growth

• Minimize “unnecessary work” so IT can focus on value-added activities and become a partner to the business users

2Reduce the burden on ITImprove time to value

Page 16: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 16

• Pioneering Data Agility for Hadoop• Apache open source project• Scale-out execution engine for low-latency queries• Unified SQL-based API for analytics & operational applications

APACHE DRILL

40+ contributors150+ years of experience buildingdatabases and distributed systems

Page 17: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 17

Evolution Towards Self-Service Data Exploration

Data Modeling and Transformation

Data Visualization

IT-driven

IT-driven

IT-driven

Self-service

IT-driven

Self-service

Not needed

Self-service

Traditional BIw/ RDBMS

Self-Service BIw/ RDBMS SQL-on-Hadoop

Self-Service Data Exploration

Zero-day analytics

Page 18: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 18

(1) Self-Describing Data is Ubiquitous

Flat files in DFS• Complex data (Thrift, Avro, protobuf)• Columnar data (Parquet, ORC)• Loosely defined (JSON)• Traditional files (CSV, TSV)

Data stored in NoSQL stores• Relational-like (rows, columns)• Sparse data (NoSQL maps)• Embedded blobs (JSON)• Document stores (nested objects)

{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}

Page 19: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 19

(2) Drill’s Data Model is Flexible

HBase

JSONBSON

CSVTSV

ParquetAvro

Schema-lessFixed schema

Flat

Complex

Flexibility

Flexibility

Name Gender AgeMichael M 6Jennifer F 3

{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}

RDBMS/SQL-on-Hadoop table

Apache Drill table

Page 20: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 20

(3) Drill Supports Schema Discovery On-The-Fly

• Fixed schema• Leverage schema in centralized

repository (Hive Metastore)

• Fixed schema, evolving schema or schema-less

• Leverage schema in centralized repository or self-describing data

2Schema Discovered On-The-FlySchema Declared In Advance

SCHEMA ON WRITE

SCHEMA BEFORE READ

SCHEMA ON THE FLY

Page 21: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 21© 2014 MapR Technologies

Quick TourSelf-Service Data Exploration with Apache Drill

Page 22: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 22

• d

Page 23: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 23

Zero to Results in 2 Minutes (3 Commands)$ tar xzf apache-drill.tar.gz

$ apache-drill/bin/sqlline -u jdbc:drill:zk=local

0: jdbc:drill:zk=local> SELECT count(*) AS incidents, columns[1] AS category FROM dfs.`/tmp/SFPD_Incidents_-_Previous_Three_Months.csv` GROUP BY columns[1] ORDER BY incidents DESC;+------------+------------+| incidents | category |+------------+------------+| 8372 | LARCENY/THEFT || 4247 | OTHER OFFENSES || 3765 | NON-CRIMINAL || 2502 | ASSAULT |...35 rows selected (0.847 seconds)

Install

Launch shell (embedded mode)

Query

Results

Page 24: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 24

A storage engine instance- DFS- HBase- Hive Metastore/HCatalog

A workspace- Sub-directory- Hive database

A table- pathnames- HBase table- Hive table

Data Source is in the Query

SELECT timestamp, messageFROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` WHERE errorLevel > 2

Page 25: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 25

Query Directory Trees# Query file: How many errors per level in Jan 2014?

SELECT errorLevel, count(*)FROM dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet`GROUP BY errorLevel;

# Query directory sub-tree: How many errors per level?

SELECT errorLevel, count(*)FROM dfs.logs.`/AppServerLogs`GROUP BY errorLevel;

# Query some partitions: How many errors per level by month from 2012?

SELECT errorLevel, count(*)FROM dfs.logs.`/AppServerLogs`WHERE dirs[1] >= 2012GROUP BY errorLevel, dirs[2];

Page 26: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 26

Works with HBase and Embedded Blobs# Query an HBase table directly (no schemas)

SELECT cf1.month, cf1.year FROM hbase.table1;

# Embedded JSON value inside column profileBlob inside column family cf1 of the HBase table users

SELECT profile.name, count(profile.children)FROM ( SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profile FROM hbase.users)

Page 27: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 27

Combine Data Sources on the Fly# Join log directory with JSON file (user profiles) to identify the name and email address for anyone associated with an error message.

SELECT DISTINCT users.name, users.emails.workFROM dfs.logs.`/data/logs` logs, dfs.users.`/profiles.json` usersWHERE logs.uid = users.id AND logs.errorLevel > 5;

# Join a Hive table and an HBase table (without Hive metadata) to determine the number of tweets per user

SELECT users.name, count(*) as tweetCountFROM hive.social.tweets tweets, hbase.users usersWHERE tweets.userId = convert_from(users.rowkey, 'UTF-8')GROUP BY tweets.userId;

Page 28: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 28

Summary• Enable rapid data exploration and application development while

reducing the burden on IT

• Apache Drill beta coming soon– Email [email protected]

• Get involved– Download and play: http://incubator.apache.org/drill/– Ask questions: [email protected]– Contribute: http://github.com/apache/incubator-drill/

Page 29: The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

© 2014 MapR Technologies 29

Thank You@mapr maprtech

[email protected]

Tomer Shiran, VP Product Management

MapRTechnologies

maprtech

mapr-technologies