Post on 30-Jul-2015
Who am I?
My name is Stéphane Fréchette
SQL Server MVP | Consultant | Speaker | Data & BI Architect | Big Data |NoSQL | Data Science. Drums, good food and fine wine. Founder @TEDxGatineau
I have a passion for architecting, designing and building solutions that matter.
Twitter: @sfrechetteBlog: stephanefrechette.comEmail: stephanefrechette@ukubu.com
Topics
• What is Big Data?• Apache Hadoop• Hadoop Ecosystem• Microsoft Azure HDInsight• Demos• Summary• Resources• Q&A
“Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time…” - Wikipedia
Internet of thingsAudio / Video
Log Files
Text/Image
Social Sentiment
Data Market FeedseGov Feeds
Weather
Wikis / Blogs
Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
WEB 2.0Mobile
Advertising Collaboration
eCommerce
Digital Marketing
Search Marketing
Web Logs
Recommendations
ERP / CRM
Sales Pipeline
PayablesPayroll
Inventory
Contacts
Deal Tracking
Terabytes(10E12)
Gigabytes(10E9)
Exabytes(10E18)
Petabytes(10E15)
Velocity - Variety
Volu
me
1980190,000$
20100.07$
19909,000$
200015$
Storage/GB
ERP / CRM WEB 2.0
Internet of things
What is Big Data?
Common Scenarios
Clickstream Analysis Sensor/Machine
Time and Place Server Logs
Sentiment
Text
What is Big Data?
Hadoop
• Apache Hadoop is for big data• Open-source software framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming models• Designed to scale up from single servers to thousands of machines, each
offering local computation and storage
TRADITIONAL RDBMS HADOOP
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
DBA Ratio 1:40 1:3000
Reference: Tom White’s Hadoop: The Definitive Guide
Hadoop
HDFS
• Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.
HDFS ≠ Database
MapReduce
• MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
Processing function:- Mapping- Reducing
Second, take the processing to the data…
// Map Reduce function in JavaScript
var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {
if (words[i] !== "")context.write(words[i].toLowerCase(),1);}}};
var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());
}context.write(key, sum);};
ServerServer
ServerServer
Runtime
Code
How it works?
Distributed Storage(HDFS)
Query(Hive)
Distributed Processing(MapReduce)
Scripting(Pig)
NoSQ
L Database(HBase)
Metadata(HCatalog)
Data Integration( O
DBC / SQO
OP/ REST)
Relational(SQ
L Server)
Machine Learning(Mahout)
Graph(Pegasus)
Stats processing(RHadoop)
Event Pipeline(Flum
e)
Active Directory (Security)
Monitoring & Deployment
(System Center)
C#, F#, .NETPowerShell
Pipeline / workflow
(Oozie)
Azure Storage Vault (ASV)
APS | Polybase
Business Intelligence
(Excel, Power
View, SSAS)World's Data (Azure Data
Marketplace)
Event Driven Processing
LegendRed = Core HadoopBlue = Data processingPurple = Microsoft integration points and value addsOrange = Data MovementGreen = Packages
Hadoop Ecosystem
HDInsight
• HDInsight is a Hadoop-based service that brings a 100 % Apache Hadoop solution that runs on the Microsoft Azure platform• Based on the Hortonworks Data Platform (HDP)• Scalable, on-demand service
Now what?
Working with your HDInsight cluster - running jobs, import/export data, viewing and consuming data…
• .NET• Java• Pig• Hive• Sqoop• Excel• Others
What is Hive?
• A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis• Provides an SQL-Like language called HiveQL to query data• Integration between Hadoop and BI and visualization tools
http://hive.apache.org
What is Pig?
• Write complex MapReduce jobs using a simple script language (Pig Latin)• A platform for analyzing large data sets that consists of high-level language
for expressing data analysis programs• Pig translates and compiles complex MapReduce jobs on the fly
http://pig.apache.org
What is Sqoop?
• Command-line interface application to transfer bulk data between Hadoop and relational datastores
http://sqoop.apache.org
Machine Learning
Graph Processing
Distributed Compute
Extract Load Transform
Predictive Analysis
Capabilities
Resources
• Apache Projects (list with links) http://bit.ly/MfpLtE• Microsoft Azure HDInsight http://bit.ly/1dnlAX1• HDInsight Documentation & Tutorials http://bit.ly/LWRYol• Hortonworks Sandbox 2.2 & Tutorials http://bit.ly/1gkkCte• Cloudera VMs CDH 5.3.x http://bit.ly/1ENWgHH• Microsoft JDBC Driver 4.1 | 4.0 for SQL Server http://bit.ly/1kEgJ7O• Microsoft Hive ODBC Driver http://bit.ly/NFkhcH• Getting Started with Big Data (MVA) http://bit.ly/1wU90Xd• Big Data and Business Analytics Immersion v3.1 (MVA) http://bit.ly/1unvvX1• Introducing Microsoft Azure HDInsight (free e-book) http://bit.ly/1JOPe5F