Introduction to HDInsight
James SerraBig Data [email protected]
About Me Microsoft, Big Data Evangelist In IT for 30 years, worked on many BI and DW projects Worked as desktop/web/database developer, DBA, BI and DW architect and
developer, MDM architect, PDW/APS developer Been perm employee, contractor, consultant, business owner Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data
World conference Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting
Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions
Blog at JamesSerra.com Former SQL Server MVP Author of book “Reporting with Microsoft SQL Server 2012”
AgendaWhat Is Hadoop?Why Deploy To the Cloud?Microsoft’s SolutionHow Do I Get Started?
What if you could handle big data?
Data complexity: variety and velocity
Terabytes
Gigabytes
Megabytes
Petabytes Big
DataLog filesSpatial & GPS coordinatesData market feedseGov feedsWeather Text/image
Click streamWikis/blogs
Sensors/RFID/devices
Social sentimentAudio/video
Web 2.0
Web LogsDigital MarketingSearch MarketingRecommendations
AdvertisingMobile
CollaborationeCommerce
ERP/CRMPayables
PayrollInventory
ContactsDeal TrackingSales Pipeline
Introducing Apache HadoopApache Open Source ProjectHighly scalable distributed file system (HDFS)Distributed processing on data nodes
Data volumeHadoop stores files in a distributed file systemStorage and computation is distributed across many serversFiles can be spread out over multiple nodesHadoop can store very large amounts of dataCombined storage resource can grow with demand from a few nodes to thousands of nodesScales out linearlyVery large files supported including those larger than the capacity of a single node
Files
Data varietyHadoop stores files (non-relational store)Files could have a variety of semi-structured or unstructured dataPreviously, these files may not have been seen as providing value or insightsToday, new business questions and insights are being uncovered through data science
SentimentUnderstand how your customersfeel about your brand and products—right now
ClickstreamCapture and analyzewebsite visitors’ data trails and optimize your website
SensorsDiscover patterns in data streaming automatically from remote sensors and machines
GeographicAnalyze location-based data to manage operations where they occur
Server logsResearch logs to diagnose process failures and prevent security breaches
UnstructuredUnderstand patterns in files across millions of web pages, emails, and documents
Applications
Devices
HTTP
Inco
min
g
Outg
oing
Data velocityHadoop can stream live data and process them in real-timeHadoop can act as scalable event stream ingestionHadoop can do near real-time in-stream processingData input Event
brokerStream processing Outgoing
Governance and integrationData workflow, lifecycle and governanceFalconAtlas
SqoopFlumeNFSWebHDFS
YARN: data operating system
ScriptPig
SearchSolr
SQLHive/Tez, HCatalog
NosqlHbaseAccumulo
Stream Storm
OthersSpark, in-memory, ISV engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° °
°
°
N
BatchMap reduce
Data access
HDFS (Hadoop Distributed File System)Data management
AuthenticationAuthorizationAccountingData protectionRangerKnoxAtlasHDFS Encryption
Security Operations
Provision, manage, and monitorAmbariZookeeperCloudbreakSchedulingOozie
Hadoop is a platform with portfolio of projectsGoverned by Apache Software Foundation (ASF)Comprises core services of MapReduce, HDFS, and YARNIn addition to the core, includes functions across: Data services which allow you to manipulate and move data (Hive, HBase, Pig, Flume, Sqoop) Operational services which help manage the cluster (Ambari, Falcon, and Oozie)
A Hadoop distribution is a package of projectsTested for consistency across entire package
Knox
Tez
Pig
Hive
Phoe
nix
Accu
mul
o
Stor
m
Mah
out
Solr
Falco
n
Sqoo
p
Flum
e
Amba
ri
Oozie
Zook
eepe
r
HBas
e
Hado
op
and
YARN
Data management
Data access Governance and integration
Operations Security
HDP 2.0 October 2013 2.2.0 0.12.0 0.12.0 0.96.1 0.8.0 1.4.4 1.3.0 1.4.4 3.3.2 3.4.5 .0.4.0
HDP 1.3 May 2013 1.1.2 011.0 0.11.0 0.94.6 0.7.0 1.4.3 1.3.1 1.2.5 3.3.2 3.4.5 .0.4.0
HDP 2.1 April 2014 0.4.0 0.12.1 0.13.0 0.98.0 4.0.0 1.5.1 0.9.1 0.9.0 4.7.2 0.5.0 1.4.4 1.4.0 1.5.1 4.0.0 3.4.5 .0.4.02.4.0
HDP 2.4 May 2016 0.7.0 0.15.0 0.2.1 1.1.2 4.4.0 1.7.0 0.10.0 0.9.0+ 5.2.1 0.6.1 1.4.6 1.5.2 2.2.1 4.2.0 3.4.6 0.6.02.7.1HDP 2.3 July 2015 0.7.0 0.15.0 0.2.1 1.1.1 4.4.0 1.7.0 0.10.0 0.9.0+ 5.2.1 0.6.1 1.4.6 1.5.2 2.1.0 4.2.0 3.4.6 0.6.02.7.1HDP 2.2 Dec 2014 .5.2 0.14.0 0.2.1 .98.4 4.2.0 1.6.1 0.9.3 0.9.0+ 4.10.2 0.6.0 1.4.5 1.5.2 2.0.0 4.1.0 3.4.6 0.5.02.6.0
Retail360°view of the customerAnalyze brand sentimentLocalized, personalized promotionsWebsite optimizationOptimal store layout
Financial servicesNew account risk screensFraud preventionTrading riskMaximize deposit spreadInsurance underwritingAccelerate loan processing
TelecomCall detail records (CDRs)Infrastructure investmentNext product to buy (NPTB)Real-time bandwidth allocationNew product development
Utilities, oil, and gasSmart meter stream analysisSlow oil well decline curvesOptimize lease biddingCompliance reportingProactive equipment repairSeismic image processing
Public sectorAnalyze public sentimentProtect critical networksPrevent fraud and wasteCrowd source reporting for repairs to infrastructureFulfill open records requests
ManufacturingSupplier consolidationSupply chain and logisticsAssembly line quality assurance Proactive maintenanceCrowd source quality assurance
HealthcareGenomic data for medical trialsMonitor patient vitalsReduce re-admittance ratesStore medical research dataRecruit cohorts for pharmaceutical trials
Business applications of Hadoop
New analytic applications from new dataINDUSTRY USE CASE
SENTIMENTAND WEB
CLICKSTREAMAND BEHAVIOR
MACHINE AND SENSOR
GEOGRAPHIC
SERVER LOGS
STRUCTURED AND UNSTRUCTURED
Financial services
New account risk screens ✔ ✔Trading risk ✔Insurance underwriting ✔ ✔ ✔
TelecomCall detail records (CDR) ✔ ✔Infrastructure investment ✔ ✔Real-time bandwidth allocation ✔ ✔ ✔
Retail360° view of the customer ✔ ✔ ✔Localized, personalized promotions ✔Website optimization ✔
ManufacturingSupply chain and logistics ✔Assembly line quality assurance ✔Crowd-sourced quality assurance ✔
Healthcare Use genomic data in medial trials ✔ ✔ ✔Monitor patient vitals in real-time
PharmaceuticalsRecruit and retain patients for drug trials ✔ ✔
Improve prescription adherence ✔ ✔ ✔ ✔
Oil and gas Unify exploration and production data ✔ ✔ ✔ ✔Monitor rig safety in real-time ✔ ✔ ✔
GovernmentETL offload/federal budgetary pressures ✔ ✔
Sentiment analysis for government programs ✔
Main differences vs RDBMS/NoSQLPros• Not a type of database, but rather a open-source software ecosystem that
allows for massively parallel computing• No inherent structure (no conversion to relational or JSON needed)• Good for batch processing, large files, volume writes, parallel scans, sequential
access• Great for large, distributed data processing tasks where time isn’t a constraint
(i.e. end-of-day reports, scanning months of historical data)• Tradeoff: In order to make deep connections between many data points, the
technology sacrifices speed• Some NoSQL databases such as HBase are built on top of HDFS
Main differences vs RDBMS/NoSQLCons• File system, not a database• Not good for millions of users, random access, fast individual record lookups or
updates (OLTP)• Not so great for real-time analytics• Lacks: indexing, metadata layer, query optimizer, memory management• Same cons at non-relational: no ACID support, data integrity, limited indexing,
weak SQL, etc• Security limitations
AgendaWhat Is Hadoop?Why Deploy To the Cloud?Microsoft’s SolutionHow Do I Get Started?
Up-front HW costs Capacity planning Hadoop expertise
Challenges with implementing Hadoop
Why Cloud + Big Data?
Speed Scale Economics
Always Up, Always On
Open and flexibleTime to value
Data of all Volume, Variety, Velocity
Massive Compute and Storage
Deployment expertise
No HW costs
$0
Unlimited scalePay what you need
Deployed in minutes
Why Hadoop in the Cloud?
On-premises Hadoop
SoftwareAppliances
Scenarios For Deploying Hadoop As Hybrid
CloudCloud
Develop/POC
Cloud
Bursting
Cloud
Backup/archive
AgendaWhat Is Hadoop?Why Deploy To the Cloud?Microsoft’s SolutionHow Do I Get Started?
Introducing Azure HDInsight
Hadoop 2.2 and 2.4
80% data compression with ORC
Microsoft contributions to HadoopHadoop on Windows
Hive 100x Query Speed Up
30,000+code linecontributions
HDFS in Cloud (Azure)
REEF for Machine Learning
10,000+engineering hours
Committers
to Hadoop
Microsoft + Hortonworks
Promoting Open Hadoop
Engineering alignmentCorporate alignmentField alignment
HDInsight Built for Windows or LinuxCustomer ChoiceManaged & supported by MicrosoftFamiliarity of WindowsRe-use common tools, documentation, samples from Hadoop/Linux ecosystemAdd Hadoop projects that were authored on Linux to HDInsightEasier transition from on-premise to cloud
HDInsight Supports HiveSQL-like queries on Hadoop data in HDInsightHDInsight provides easy-to-use graphical query interface for HiveHiveQL is a SQL-like language (subset of SQL)Hive structures include well-understood database concepts such as tables, rows, columns, partitionsCompiled into MapReduce jobs that are executed on Hadoop
Dramatic performance gains with Stinger/TezStinger is a Microsoft, Hortonworks and OSS driven initiative to bring interactive queries with HiveBrings query execution engine technology from Microsoft SQL Server to HivePerformance gains up to 100x
Microsoft contribution to Apache code
Hadoop 2.0
1400s44.3s
35.1s
Sample Query
Hive 10 HDP 1.3 /Hive 11
HDP 2.0
32x Speedup40XSpeedup
HDP 2.115s
100xSpeedup
HDInsight Supports HBase
Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker
Name Node
Job Tracker
HMasterCoordination
Region Server Region Server Region Server Region Server
NoSQL database on data in HDInsightColumnar, NoSQL databaseRuns on top of the Hadoop Distributed File System (HDFS)Provides flexibility in that new columns can be added to column families at any time
HDInsight Supports MahoutMachine learning library A library of machine learning algorithms to execute on data in HDFSAlgorithms are not dependent on size of data and can scale with large datasetsLibrary includes: Collaborative Filtering, Classification, Clustering, Dimensionality Reduction, Topic Models
HDInsight Supports StormStream analytics for Near-Real Time ProcessingConsumes millions of real-time events from a scalable event broker (ie. Apache Kafka, Azure Event Hub)Performs time-sensitive computationOutput to persistent stores, dashboards or devicesCustomizable with Java + .NETDeeply integrated to Visual Studio
Event Queuing System
Collection Presentation and action
Event producers
Transformation
Long-term storage
Event Hubs
Storage adapters
Stream processi
ngCloud gateways(web APIs)
Field gateways
Applications
Search and query
Data analytics (Excel)
Web/thick client dashboards
Live Dashboards
Apache Storm on
HDInsight
Devices to take action
Kafka /RabbitMQ /ActiveMQ
Web and Social
Devices
Sensors
Azure Stream
Analytics
HDFS
Azure DBs
Azure storage
HBase
Spark for Azure HDInsight In Memory Processing on Multiple Workloads
Azure HDInsight
Core Engine
Spark SQL
Spark Streaming
Machine Learning
Graph
ScriptPig
SQL
Hive
NoSQL
Hbase
Streaming Storm
Batch
Map reduce
In Memory Spark
Core Engine
• Single execution model for multiple tasks
• Processing up to 100x faster performance
• Developer friendly (Java, Python, Scala)
• BI tool of choice (Power BI, Tabelau, Qlik, SAP)
• Notebook experience (Jupyter/iPython, Zeppelin)
…
Add Hadoop Projects to HDInsightModify HDInsight clusters with custom scriptAdd Apache Hadoop projects to HDInsightDocumented for Spark, R, Giraph, Solr
HDInsight Allows You To Add Hadoop Projects
Easy for DevelopersDeep Visual Studio IntegrationDebug Hive jobs through Yarn logs or troubleshoot Storm topologiesVisualize Hadoop clusters, tables, and storageSubmit Hive queries, Storm topologies (C# or Java spouts/bolts)IntelliSense
IntelliJ IntegrationIntegration with SparkRemote debuggingNative authoring support for Scala and Java
Easy for Data ScientistsOut-of-the-box notebook integrationMost popular OSS notebook, Jupyter out-of-the-boxWorked with Jupyter community to enhance kernel to allow Spark execution through REST endpoint
Designed for Data ScientistsCombine code, statistical equations and visualizations Tell a story with the data
Easy for Business Analysts
Integration with BI toolsPower BI, Tableau, SAP Lumira and Qlik have integration with SparkPower BI offers streaming connector with Spark StreamDo interactive BI with big data
R Server for HDInsight
Only managed, cloud solution for doing R
Familiarity of R (most popular language for data scientists)Scalability of Hadoop and SparkUp to 7x faster using Spark engineTrain and run ML models on datasets of any sizeCloud managed solution (easy setup, elastic, SLA)
Introducing Azure HDInsight
Hyper scale Infrastructure is the enabler32 Regions Worldwide, 24 Generally Available…
100+ datacenters Top 3 networks in the world 2.5x AWS, 7x Google DC Regions G Series – Largest VM in World, 32 cores, 448GB Ram, SSD…
OperationalAnnounced/Not Operational
Central US
Iowa
West USCaliforni
a
East USVirginia
US GovVirginia
North Central US
Illinois
US GovIowa
South Central US
Texas
Brazil SouthSao Paulo
State
West Europe
Netherlands
China North *
BeijingChina
South *Shanghai
Japan EastTokyo,
Saitama
Japan West
OsakaIndia South
Chennai East AsiaHong Kong
SE AsiaSingapo
re
Australia South East
Victoria
Australia EastNew South
Wales
India CentralPune
Canada EastQuebec City
Canada CentralToronto
India West
Mumbai
Germany North East **
Magdeburg
Germany Central **Frankfurt
North EuropeIreland
East US 2
Virginia
United KingdomRegions
United KingdomRegions
US DoD EastTBD
US DoD WestTBD
* Operated by 21Vianet ** Data Stewardship by Deutsche Telekom
SeoulKorea
(2)
Why Microsoft Azure?
Azure Storage
HDInsight
Data Factory
ML
Stream Analytics
Database
DocumentDB
Search
On-premises Hadoop SoftwareAppliances
Azure Facts• >4 trillion objects in Azure• 300,000-1M+ requests per second• Double compute and storage every 6 months
Event Hubs
Azure Blob Storage
Azure Data Lake Store
No hardware challengesHDInsight in the Cloud bypasses hardware costsHardware acquisitionHardware maintenancePerformance tuning
HDInsight in the Cloud bypasses capacity planningSpin up any number of Hadoop nodes on-demandGo from tens of nodes to thousands of nodes
No HW costs
$0
Unlimited scale
Deployed in minutesHDInsight in the Cloud Bypasses deployment expertiseHadoop is non-trivial to install and get up and running on multi-nodesEducation gap in IT community regarding Hadoop
HDInsight is deployed in minutesSpin up any number of Hadoop nodes on-demandUp and running in a few clicks (and within minutes)
Deployed in minutes
Mission Critical, Enterprise ReadyManaged Hadoop, Backed By An SLAThree Nine’s of Availability 99.9% uptime
HDInsight Auto Replicates DataAutomatic geo-replication of dataData only replicates within the same geo-political (i.e., country, region)
Mission Critical Hadoop
Maintenance done for youMinimal IT resources for upgrades/patchingOS patching and security updates done automatically
Minimal IT resources to update Hadoop versions Hadoop versions are rapidly releasing throughout the yearAlways be on the latest version of Hadoop with no effort
HDInsight on Hadoop 2.2April 2014HDInsight on Hadoop 1.1.2Oct 2013
HDInsight on Hadoop 2.4June 2014
O/S Upgrades
O/S Patching
HDInsight adds latest version of Hadoop for you
HDInsight on Hadoop 2.6Feb 2015
HDInsight on Hadoop 2.7.1March 2016
Low Cost HDInsight is billed by usageBilled for usageClusters can be deleted when no longer used
No additional price for supportAzure Support includes Hadoop supportWhat usually costs thousands of dollars per node is included
63% Lower Total Cost of Ownership*418% 5 year ROI*3.9 month payback period*63% TCO savings versus on-premises Hadoop*
$£€¥
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
Introducing Azure HDInsight
Scalable, manageable, trusted
1 Billion Microsoft Office users Connect to HDInsight Analyze Visualize
Office 365 is our fastest-growing commercial product ever Share Ask Access
Bringing Hadoop to a billion peopleExcel as the BI tool for everyone
Power BI for collaboration& new experiences
DevicesApplicationsDashboards
Making advanced analytics accessible to Hadoop Microsoft Azure Machine Learning
Cloud
Desktop
ML API Service
Microsoft Azure PortalPublish API
Publish API in minutes
Web
ML Studio
Workspace
Easily make changes
ResultsRun & refineTest model typesHistorical data
SQL DB Blobs & tables
HDInsight
SQL Server VM
HDInsight vs HDP on Azure VMHDInsight HDP on Azure VMPaaS (setup, scale, manage, patch, etc)
IaaS
Managed by Microsoft Managed by customerStorage separate (Blob or ADLS) Storage in VM (local disk), but can
also have storage in Azure blob or ADLS
Delete VM keeps data Delete VM deletes data (unless external)
Up to 30-days behind latest HDP version
Latest HDP Version
Limited Hadoop projects Unlimited Hadoop projectsMicrosoft supports VM and Hadoop Microsoft: VM, HDP: HadoopNo on-prem version On-prem version
Distributed, parallel analytics framework U-SQL (based on C# and SQL)Dial for scaleHides infrastructure complexityVisual Studio integrationInstant scale on demandReduced learning curve
Azure Data Lake AnalyticsAzure Services for big data analytics
YARNHDFS
HDInsightAnalytics Service
Store
Partners
U-SQL
Clickstream
Sensors
Video
Social
Web
Devices
Relational
Applications
56
AgendaWhat Is Hadoop?Why Deploy To the Cloud?Microsoft’s SolutionHow Do I Get Started?
Get StartedRead documentationhttp://azure.microsoft.com/en-us/documentation/services/hdinsight/
Learning Maphttp://azure.microsoft.com/en-us/documentation/articles/hdinsight-learn-map/
Microsoft Virtual Academyhttp://www.microsoftvirtualacademy.com/training-courses/getting-started-with-microsoft-big-data
Channel 9 Data Exposed Showhttp://channel9.msdn.com/Shows/Data-Exposed
Try 30 day trialhttp://azure.microsoft.com/en-us/pricing/free-trial/
Azure getting started• Free Azure account, $200 in credit, https://azure.microsoft.com/en-us/free/• Startups: BizSpark, $750/month free Azure, BizSpark Plus - $120k/year free Azure,
https://www.microsoft.com/bizspark/ • MSDN subscription, Data Platform MVP, $150/month free Azure,
https://azure.microsoft.com/en-us/pricing/member-offers/msdn-benefits/ • Microsoft Educator Grant Program, faculty - $250/month free Azure for a year,
students - $100/month free Azure for 6 months, https://azure.microsoft.com/en-us/pricing/member-offers/msdn-benefits/
• Microsoft Azure for Research Grant, http://research.microsoft.com/en-us/projects/azure/default.aspx
• DreamSpark for students, https://www.dreamspark.com/Student/Default.aspx • DreamSpark for academic institutions:
https://www.dreamspark.com/Institution/Subscription.aspx • Various Microsoft funds so you can learn the technologies or build a client solution
Pricing for HDInsightCAPABILITIES STANDARD PREMIUM PREVIEW
Big Data WorkloadsStandard Hadoop and Open Source Projects (Core Hadoop & YARN, Hive & HCatalog, Tez, Pig, Sqoop, Oozie, Zookeeper, Phoenix)Columnar NoSQL (HBase)
Stream processing (Storm)
Interactive processing, real-time stream processing & ML (Spark)
Big Data statistics predictive modeling, and machine learning with R Server
Enterprise ReadinessAdministration – manage, monitor & troubleshoot clustersHadoop version upgrades and patching – Automatic patching and upgradesEncryption of data at rest
Price Standard price per Node HDInsight Standard Price + $0.02/Core-hour for each core used in the cluster during preview (75% discount)
Resources What is HDInsight? http://bit.ly/1WpS0at Hadoop and Microsoft http://bit.ly/20Cg2hA Introduction to Hadoop http://bit.ly/1WpTstq
Q & A ?James Serra, Big Data EvangelistEmail me at: [email protected] me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck is posted via the “Presentations” link on the top menu)
Top Related