Big Data in the Microsoft Platform
description
Transcript of Big Data in the Microsoft Platform
![Page 1: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/1.jpg)
Building Big Data Solutions in the Microsoft Platform
Jesus RodriguezCo-Founder Tellago, Inc
Co-Founder Tellago Studios, Inc
![Page 2: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/2.jpg)
About Me….
• Hackerpreneur• Co-Founder Tellago, Tellago Studios, Inc• Microsoft Architect Advisor• Microsoft MVP• Oracle ACE• Speaker, Author• http://weblogs.asp.net/gsusx • http://jrodthoughts.com • http://moesion.com
![Page 3: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/3.jpg)
Agenda
• Big Data Overview• MS HDInsight
– Map Reduce– HDFS– Hive– Pig – Sqoop
• HDInsight Service• The Hadoop Ecosystem• The Future….
![Page 4: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/4.jpg)
Big Data?
![Page 5: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/5.jpg)
A Crowded Ecosystem
![Page 6: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/6.jpg)
Or Worse...
![Page 7: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/7.jpg)
![Page 8: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/8.jpg)
Big Data?
• A bunch of data?• An industry?• An expertise?• A trend?• A cliché?
![Page 9: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/9.jpg)
A Clue?
• 2008: Google processes 20 PB a day• 2009: Facebook has 2.5 PB user data
+ 15 TB/day • 2009: eBay has 6.5 PB user data +
50 TB/day• 2011: Yahoo! has 180-200 PB of data• 2012: Facebook ingests 500 TB/day
![Page 10: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/10.jpg)
We Love Data!
![Page 11: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/11.jpg)
But...
![Page 12: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/12.jpg)
Processing Large Amounts of Data is Complicated....
![Page 13: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/13.jpg)
Sucessful Big Data = Scalable Computing + Large Storage
![Page 14: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/14.jpg)
A Trivial Model
![Page 15: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/15.jpg)
Not So Fast....
![Page 16: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/16.jpg)
Parallel Data Computing is Complicated
![Page 17: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/17.jpg)
So Is Large Data Storage
![Page 18: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/18.jpg)
Enter the World of Hadoop...
![Page 19: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/19.jpg)
Hadoop Design Principles
• System Shall Manage and Heal Itself• Performance Shall Scale Linearly • Compute Shall Move to Data• Simple Core, Modular and Extensible
![Page 20: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/20.jpg)
Hadoop History
• 2002-2004: Doug Cutting and Mike Cafarella started working on Nutch• 2003-2004: Google publishes GFS and MapReduce papers • 2004: Cutting adds DFS & MapReduce support to Nutch• 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch• 2007: NY Times converts 4TB of archives over 100 EC2s• 2008: Web-scale deployments at Y!, Facebook, Last.fm• April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes• May 2009:
– Yahoo does fastest sort of a TB, 62secs over 1460 nodes– Yahoo sorts a PB in 16.25hours over 3658 nodes
• June 2009, Oct 2009: Hadoop Summit, Hadoop World• September 2009: Doug Cutting joins Cloudera
![Page 21: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/21.jpg)
Hadoop Ecosystem
HDFS(Hadoop Distributed File System)
HBase (key-value store)
MapReduce (Job Scheduling/Execution System)
Pig (Data Flow) Hive (SQL)
BI ReportingETL Tools
Avr
o (S
eri
aliz
atio
n)
Zo
oke
ep
r (C
oo
rdin
atio
n)
Sqoop
RDBMS
(Streaming/Pipes APIs)
![Page 22: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/22.jpg)
Microsoft & Hadoop
![Page 23: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/23.jpg)
HDInsight
![Page 24: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/24.jpg)
HDFS
![Page 25: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/25.jpg)
HDFS Is…
• A distributed file system• Redundant storage• Designed to reliably store data using commodity hardware• Designed to expect hardware failures• Intended for large files• Designed for batch inserts• The Hadoop Distributed File System
![Page 26: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/26.jpg)
HDFS at a Glance
Block Size = 64MBReplication Factor = 3
Cost/GB is a few ¢/month vs $/month
![Page 27: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/27.jpg)
HDInsight HDFS Demo
![Page 28: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/28.jpg)
Map Reduce
![Page 29: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/29.jpg)
Map Reduce Is…
• A programming model for expressing distributed computations at a massive scale
• An execution framework for organizing and performing such computations
• An open-source implementation called Hadoop
![Page 30: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/30.jpg)
Map Reduce At a Glance
![Page 31: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/31.jpg)
HDInsight Map Reduce Demo
![Page 32: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/32.jpg)
Hive
![Page 33: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/33.jpg)
Hive Is…
• A system for managing and querying structured data built on top of Hadoop– Map-Reduce for execution– HDFS for storage– Metadata on raw files
• Key Building Principles:– SQL as a familiar data warehousing tool– Extensibility – Types, Functions, Formats, Scripts– Scalability and Performance
![Page 34: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/34.jpg)
Hive Architecture
![Page 35: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/35.jpg)
HDInsight Hacking with Hive
![Page 36: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/36.jpg)
Pig
![Page 37: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/37.jpg)
Pig Is…
Apache Pig is a platform for analyzing large data sets that consists of a high-level language (PigLatin) for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
• Ease of programming
• Optimization opportunities
• Extensibility
• Built upon Hadoop
![Page 38: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/38.jpg)
Pig Architecture
Parser (PigLatinLogicalPlan)
Optimizer (LogicalPlan LogicalPlan)
Compiler (LogicalPlan PhysiclaPlan MapReducePlan)
ExecutionEngine
Pig Context
Hadoop
Grunt (Interactive shell) PigServer (Java API)
![Page 39: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/39.jpg)
HDInsight
Rocking Data Processing with Pig
![Page 40: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/40.jpg)
Sqoop
![Page 41: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/41.jpg)
Sqoop Is…
• Easy import of data from many databases to HDFS• Generates code for use in MapReduce applications• Integrates with Hive
![Page 42: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/42.jpg)
Sqoop Architecture
![Page 43: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/43.jpg)
HDInsight
Bulk Data Loading Using Sqoop
![Page 44: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/44.jpg)
HDInsight Service
![Page 45: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/45.jpg)
HDInsight Service Architecture
![Page 46: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/46.jpg)
HDInsight
HDInsight Service Overview
![Page 47: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/47.jpg)
Hadoop Considerations
![Page 48: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/48.jpg)
Super Crowded Ecosystem
![Page 49: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/49.jpg)
The Hadoop Ecosystem
![Page 50: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/50.jpg)
Hadoop is not a silver bullet...
![Page 51: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/51.jpg)
Some Challenges
• Hadoop doesn’t power big data applications– Not a transactional datastore. Slosh back and forth via ETL
• Processing latency– Non-incremental, must re-slurp entire dataset every pass
• Ad-Hoc queries– Bare metal interface, data import
• Graphs– Only a handful of graph problems amenable to MR
![Page 52: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/52.jpg)
Beyond Hadoop
• Percolator(incremental processing)http://research.google.com/pubs/pub36726.html • Dremel(ad-hoc analysis queries)http://research.google.com/pubs/pub36632.html • Pregel (Big graphs)http://dl.acm.org/citation.cfm?id=1807184
![Page 53: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/53.jpg)
In the Meantime...
![Page 54: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/54.jpg)
Takeaways
• Hadoop provides the foundation of big data solutions• Computing and storage are the fundamental
components of Hadoop• HDInsight Server and Service are Microsoft’s
distributions of Hadoop• HDInsight is just one component of Microsoft’s BI
strategy
![Page 55: Big Data in the Microsoft Platform](https://reader037.fdocuments.us/reader037/viewer/2022103110/5491ecb9b47959324b8b4980/html5/thumbnails/55.jpg)
http://www.tellagostudios.com http://twitter.com/#!/jrodthoughts
http://jrodthoughts.com http://weblogs.asp.net/gsusx