Big Data in the Microsoft Platform

Building Big Data Solutions in the Microsoft Platform

Jesus RodriguezCo-Founder Tellago, Inc

Co-Founder Tellago Studios, Inc

About Me….

• Hackerpreneur• Co-Founder Tellago, Tellago Studios, Inc• Microsoft Architect Advisor• Microsoft MVP• Oracle ACE• Speaker, Author• http://weblogs.asp.net/gsusx • http://jrodthoughts.com • http://moesion.com

http://weblogs.asp.net/gsusx

http://jrodthoughts.com/

http://moesion.com/

Agenda

• Big Data Overview• MS HDInsight

– Map Reduce– HDFS– Hive– Pig – Sqoop

• HDInsight Service• The Hadoop Ecosystem• The Future….

Big Data?

A Crowded Ecosystem

Or Worse...

Big Data?

• A bunch of data?• An industry?• An expertise?• A trend?• A cliché?

A Clue?

• 2008: Google processes 20 PB a day• 2009: Facebook has 2.5 PB user data

+ 15 TB/day • 2009: eBay has 6.5 PB user data +

50 TB/day• 2011: Yahoo! has 180-200 PB of data• 2012: Facebook ingests 500 TB/day

We Love Data!

But...

Processing Large Amounts of Data is Complicated....

Sucessful Big Data = Scalable Computing + Large Storage

A Trivial Model

Not So Fast....

Parallel Data Computing is Complicated

So Is Large Data Storage

Enter the World of Hadoop...

Hadoop Design Principles

• System Shall Manage and Heal Itself• Performance Shall Scale Linearly • Compute Shall Move to Data• Simple Core, Modular and Extensible

Hadoop History

• 2002-2004: Doug Cutting and Mike Cafarella started working on Nutch• 2003-2004: Google publishes GFS and MapReduce papers • 2004: Cutting adds DFS & MapReduce support to Nutch• 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch• 2007: NY Times converts 4TB of archives over 100 EC2s• 2008: Web-scale deployments at Y!, Facebook, Last.fm• April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes• May 2009:

– Yahoo does fastest sort of a TB, 62secs over 1460 nodes– Yahoo sorts a PB in 16.25hours over 3658 nodes

• June 2009, Oct 2009: Hadoop Summit, Hadoop World• September 2009: Doug Cutting joins Cloudera

Hadoop Ecosystem

HDFS(Hadoop Distributed File System)

HBase (key-value store)

MapReduce (Job Scheduling/Execution System)

Pig (Data Flow) Hive (SQL)

BI ReportingETL Tools

Avr

o (S

eri

aliz

atio

n)

Zo

oke

ep

r (C

oo

rdin

atio

n)

Sqoop

RDBMS

(Streaming/Pipes APIs)

Microsoft & Hadoop

HDInsight

HDFS Is…

• A distributed file system• Redundant storage• Designed to reliably store data using commodity hardware• Designed to expect hardware failures• Intended for large files• Designed for batch inserts• The Hadoop Distributed File System

HDFS at a Glance

Block Size = 64MBReplication Factor = 3

Cost/GB is a few ¢/month vs $/month

HDInsight HDFS Demo

Map Reduce

Map Reduce Is…

• A programming model for expressing distributed computations at a massive scale

• An execution framework for organizing and performing such computations

• An open-source implementation called Hadoop

Map Reduce At a Glance

HDInsight Map Reduce Demo

Hive Is…

• A system for managing and querying structured data built on top of Hadoop– Map-Reduce for execution– HDFS for storage– Metadata on raw files

• Key Building Principles:– SQL as a familiar data warehousing tool– Extensibility – Types, Functions, Formats, Scripts– Scalability and Performance

Hive Architecture

HDInsight Hacking with Hive

Pig Is…

Apache Pig is a platform for analyzing large data sets that consists of a high-level language (PigLatin) for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

• Ease of programming

• Optimization opportunities

• Extensibility

• Built upon Hadoop

Pig Architecture

Parser (PigLatinLogicalPlan)

Optimizer (LogicalPlan LogicalPlan)

Compiler (LogicalPlan PhysiclaPlan MapReducePlan)

ExecutionEngine

Pig Context

Hadoop

Grunt (Interactive shell) PigServer (Java API)

HDInsight

Rocking Data Processing with Pig

Sqoop Is…

• Easy import of data from many databases to HDFS• Generates code for use in MapReduce applications• Integrates with Hive

Sqoop Architecture

HDInsight

Bulk Data Loading Using Sqoop

HDInsight Service

HDInsight Service Architecture

HDInsight

HDInsight Service Overview

Hadoop Considerations

Super Crowded Ecosystem

The Hadoop Ecosystem

Hadoop is not a silver bullet...

Some Challenges

• Hadoop doesn’t power big data applications– Not a transactional datastore. Slosh back and forth via ETL

• Processing latency– Non-incremental, must re-slurp entire dataset every pass

• Ad-Hoc queries– Bare metal interface, data import

• Graphs– Only a handful of graph problems amenable to MR

Beyond Hadoop

• Percolator(incremental processing)http://research.google.com/pubs/pub36726.html • Dremel(ad-hoc analysis queries)http://research.google.com/pubs/pub36632.html • Pregel (Big graphs)http://dl.acm.org/citation.cfm?id=1807184

http://research.google.com/pubs/pub36726.html




http://dl.acm.org/citation.cfm?id=1807184

http://dl.acm.org/citation.cfm?id=1807184

In the Meantime...

Takeaways

• Hadoop provides the foundation of big data solutions• Computing and storage are the fundamental

components of Hadoop• HDInsight Server and Service are Microsoft’s

distributions of Hadoop• HDInsight is just one component of Microsoft’s BI

strategy

[email protected]

http://www.tellagostudios.com http://twitter.com/#!/jrodthoughts

http://jrodthoughts.com http://weblogs.asp.net/gsusx

mailto:[email protected]

http://www.tellago.com/

http://twitter.com/

http://twitter.com/

http://jrodthoughts.com/

http://weblogs.asp.net/gsusx

Big Data in the Microsoft Platform

Documents

Transcript of Big Data in the Microsoft Platform