Experiences Evolving a New Analytical Platform: What Works and What's Missing

69
Saturday, June 12, 2010

Transcript of Experiences Evolving a New Analytical Platform: What Works and What's Missing

Saturday, June 12, 2010

Evolving a New Analytical PlatformWhat Works and What’s Missing

Jeff HammerbacherChief Scientist, ClouderaJune 8, 2010

Saturday, June 12, 2010

My BackgroundThanks for Asking

[email protected]▪ Studied Mathematics at Harvard▪ Worked as a Quant on Wall Street▪ Conceived, built, and led Data team at Facebook▪ Nearly 30 amazing engineers and data scientists▪ Several open source projects and research papers

▪ Founder of Cloudera▪ Chief Scientist▪ Also, check out the book “Beautiful Data”

Saturday, June 12, 2010

Presentation Outline▪ BI: Science for Profit▪ Need tools for whole research cycle▪ SQL Server 2008 R2: defining the platform

▪ State of the Platform Ecosystem▪ New Foundations: Hadoop▪ Boiling the Frog▪ Future developments

▪ Questions and Discussion

Saturday, June 12, 2010

BI is looking more like science (for profit)

Saturday, June 12, 2010

Jim Gray: Science entering Fourth Paradigm“We have to do better at producing tools to

support the whole research cycle”

Saturday, June 12, 2010

RDBMS only a small part of this tool set

Saturday, June 12, 2010

Example: SQL Server 2008 R2

Saturday, June 12, 2010

RDBMS: SQL Server

Saturday, June 12, 2010

RDBMS: SQL ServerETL: SQL Server Integration Services

Saturday, June 12, 2010

RDBMS: SQL ServerETL: SQL Server Integration Services

Reporting: SQL Server Reporting Services

Saturday, June 12, 2010

RDBMS: SQL ServerETL: SQL Server Integration Services

Reporting: SQL Server Reporting ServicesAnalysis: SQL Server Analysis Services

Saturday, June 12, 2010

RDBMS: SQL ServerETL: SQL Server Integration Services

Reporting: SQL Server Reporting ServicesAnalysis: SQL Server Analysis Services

Search: Full-Text Search

Saturday, June 12, 2010

RDBMS: SQL ServerETL: SQL Server Integration Services

Reporting: SQL Server Reporting ServicesAnalysis: SQL Server Analysis Services

Search: Full-Text Search

CEP: StreamInsight

Saturday, June 12, 2010

RDBMS: SQL ServerETL: SQL Server Integration Services

Reporting: SQL Server Reporting ServicesAnalysis: SQL Server Analysis Services

Search: Full-Text Search

CEP: StreamInsight

OLAP: PowerPivot

Saturday, June 12, 2010

RDBMS: SQL ServerETL: SQL Server Integration Services

Reporting: SQL Server Reporting ServicesAnalysis: SQL Server Analysis Services

Search: Full-Text Search

CEP: StreamInsight

OLAP: PowerPivot

MDM: Master Data Services

Saturday, June 12, 2010

RDBMS: SQL ServerETL: SQL Server Integration Services

Reporting: SQL Server Reporting ServicesAnalysis: SQL Server Analysis Services

Search: Full-Text Search

CEP: StreamInsight

OLAP: PowerPivot

MDM: Master Data ServicesCollaboration: SharePoint

Saturday, June 12, 2010

What do we call this unified suite?

Saturday, June 12, 2010

For today: Analytical Data Platform

Saturday, June 12, 2010

Who makes up the platform ecosystem?

Saturday, June 12, 2010

Platform Providers

Saturday, June 12, 2010

Platform ProvidersInfrastructure Providers

Saturday, June 12, 2010

Platform ProvidersInfrastructure Providers

Application Developers

Saturday, June 12, 2010

Platform ProvidersInfrastructure Providers

Application Developers

Content Providers

Saturday, June 12, 2010

Platform ProvidersInfrastructure Providers

Application DevelopersEnd Users

Content Providers

Saturday, June 12, 2010

What is new about the ecosystem today?

Saturday, June 12, 2010

Content Providers1. > 95% of enterprise data is unstructured

2. Data volumes growing rapidly

Saturday, June 12, 2010

Infrastructure Providers1. Cloud

2. Warehouse-Scale Computers

Saturday, June 12, 2010

Platform Providers1. Open source

2. Driven by consumer web properties

Saturday, June 12, 2010

Application Developers1. Data Scientists

2. Diversity of languages

Saturday, June 12, 2010

End Users1. Move beyond reporting to analytics2. Make use of all enterprise data

Saturday, June 12, 2010

New foundations: HDFS and MapReduce

Saturday, June 12, 2010

(This is what boiling a frog feels like)

Saturday, June 12, 2010

2005: Doug/Mike start project inside Nutch

Saturday, June 12, 2010

2006: Doug joins Yahoo!

Saturday, June 12, 2010

2007: Make Hadoop scale

Saturday, June 12, 2010

2007: Make Hadoop scaleYahoo! makes Pig open source

Saturday, June 12, 2010

2007: Make Hadoop scaleJim Gray’s “Fourth Paradigm” lecture

Yahoo! makes Pig open source

Saturday, June 12, 2010

2007: Make Hadoop scaleJim Gray’s “Fourth Paradigm” lecture

Yahoo! makes Pig open source

Randy Bryant’s “DISC” lecture

Saturday, June 12, 2010

2007: Make Hadoop scaleJim Gray’s “Fourth Paradigm” lecture

Yahoo! makes Pig open source

Randy Bryant’s “DISC” lecture

Powerset makes HBase open source

Saturday, June 12, 2010

2008: Make Hadoop fast

Saturday, June 12, 2010

2008: Make Hadoop fastYahoo! wins Daytona terabyte sort benchmark

Saturday, June 12, 2010

2008: Make Hadoop fastFirst Hadoop Summit

Yahoo! wins Daytona terabyte sort benchmark

Saturday, June 12, 2010

2008: Make Hadoop fastFirst Hadoop Summit

Yahoo! wins Daytona terabyte sort benchmarkYahoo! builds production webmap with Hadoop

Saturday, June 12, 2010

2008: Make Hadoop fastFirst Hadoop Summit

Yahoo! wins Daytona terabyte sort benchmarkYahoo! builds production webmap with Hadoop

Facebook makes Hive open source

Saturday, June 12, 2010

2008: Make Hadoop fastFirst Hadoop Summit

Yahoo! wins Daytona terabyte sort benchmarkYahoo! builds production webmap with Hadoop

Facebook makes Hive open source“MapReduce: A Major Step Backwards”

Saturday, June 12, 2010

2009: Insert Hadoop into the enterprise

Saturday, June 12, 2010

2009: Insert Hadoop into the enterpriseCloudera releases CDH

Saturday, June 12, 2010

2009: Insert Hadoop into the enterpriseCloudera releases CDH

First Hadoop World NYC

Saturday, June 12, 2010

2009: Insert Hadoop into the enterpriseCloudera releases CDH

First Hadoop World NYCYahoo! sorts a petabyte with Hadoop

Saturday, June 12, 2010

2009: Insert Hadoop into the enterpriseCloudera releases CDH

First Hadoop World NYCYahoo! sorts a petabyte with Hadoop

Cloudera adds training, support, services

Saturday, June 12, 2010

2009: Insert Hadoop into the enterpriseCloudera releases CDH

First Hadoop World NYCYahoo! sorts a petabyte with Hadoop

Cloudera adds training, support, services

“The Unreasonable Effectiveness of Data”

Saturday, June 12, 2010

2010: Integrate Hadoop into the enterprise

Saturday, June 12, 2010

2010: Integrate Hadoop into the enterpriseIBM announces InfoSphere BigInsights

Saturday, June 12, 2010

2010: Integrate Hadoop into the enterpriseIBM announces InfoSphere BigInsights

Yahoo! completes enterprise-class security

Saturday, June 12, 2010

2010: Integrate Hadoop into the enterpriseIBM announces InfoSphere BigInsights

Yahoo! completes enterprise-class security

Datameer and Karmasphere funded

Saturday, June 12, 2010

2010: Integrate Hadoop into the enterpriseIBM announces InfoSphere BigInsights

Yahoo! completes enterprise-class security

Datameer and Karmasphere funded

Teradata, Pentaho, and others integrate

Saturday, June 12, 2010

2010: Integrate Hadoop into the enterpriseIBM announces InfoSphere BigInsights

Yahoo! completes enterprise-class security

Datameer and Karmasphere funded

Teradata, Pentaho, and others integrateHive adds JDBC and ODBC

Saturday, June 12, 2010

Hadoop will be an Analytical Data Platform

Saturday, June 12, 2010

What’s Next?

Saturday, June 12, 2010

Capture: Log collection and CEP

Saturday, June 12, 2010

Curate: Workflow and Scheduling

Saturday, June 12, 2010

Curate: Secondary and Full-Text Indexing

Saturday, June 12, 2010

Curate: Learn Structure from Data

Saturday, June 12, 2010

Analyze: Mesos-enabled frameworks

Saturday, June 12, 2010

Analyze: Link local and global data

Saturday, June 12, 2010

All behind a single pane of glass

Saturday, June 12, 2010

Cloudera DesktopMaking Many Computers Feel Like One

Saturday, June 12, 2010

(c) 2010 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0

Saturday, June 12, 2010