© 2014 IBM Corporation
Getting started with Hadoop on the Cloud
Mangesh [email protected] Platform Technical Leader – India and South Asia
© 2014 IBM Corporation
Welcome
Goal: Get Started With BigData
Hadoop What technical problem is it helping solve? BIG DATA What is Hadoop? BigInsights (IBM’s Hadoop distro)
Bluemix (IBM’s PaaS cloud solution) What technical problem is it helping solve? Analytics for Hadoop in the Cloud
Social Good Challenge 40000 USD prizes to be won You can participate
Demo & Get hands-on Bluemix: bluemix.net A4H Tutorial :https://developer.ibm.com/hadoop/docs/tutorials/analytics-hadoop-
bluemix/
© 2014 IBM Corporation
What is Big Data?
A way to describe data problems that are unsolvable using traditional tools
More Analytics on More Data for More People
© 2014 IBM Corporation
What Data?
Transactional & Application Data
Machine Data Social Data Enterprise Content
© 2013 IBM Corporation
More Analytics on More Data for More People
© 2014 IBM Corporation
© 2014 IBM Corporation
Welcome to the Instrumented Interconnected World!
12+ TBsof tweet data
every day
25+ TBs oflog data
every day
? TB
s o
fd
ata
ever
y d
ay
2+ billionpeople on the
Web by end
2011
30 billionRFID tags
today(1.3B in 2005)
4.6 billioncamera phones
world wide
100s of millions
of GPS enable
ddevices
sold annually
76 million smart meters in 2009…
200M by 2014
© 2014 IBM Corporation
6,000,000 users on Twitterpushing out 300,000
tweets per day
500,000,000 users on Twitterpushing out 400,000,000
tweets per day
83x
1333x
© 2014 IBM Corporation
Volume
Variety Veracity
We’ve Moved into a New Era of Computing
Velocity
decision makers trust their information.
Only 1 in 3of different types of data.
100’s
of Tweets create daily.
12+ terabytestrade eventsper second.
5+ million
© 2014 IBM Corporation
Imagine the Possibilities of Harnessing Your Data Resources
Retailer reduces time to run queries by 80% to optimize
inventory
Stock Exchange cuts queries from 26 hours to 2
minutes on 2 PB
Government cuts acoustic analysis from hours to
70 Milliseconds
Utility avoids power failures by analyzing
10 PB of data in minutes
Telco analyses streaming network data to reduce hardware costs by 90%
Hospital analyses streaming vitals to detect illness
24 hours earlier
Big data challenges exist in every organization today
© 2014 IBM Corporation
Insurance
360˚ View of Domain or Subject
Catastrophe ModelingFraud & AbuseProducer Performance
AnalyticsAnalytics Sandbox
Banking
Optimizing Offers and Cross-sell
Customer Service and Call Center Efficiency
Fraud Detection & Investigation
Credit & Counterparty Risk
Every Industry can Leverage Big Data and Analytics
Telco
Pro-active Call CenterNetwork AnalyticsLocation Based
Services
Energy & Utilities
Smart Meter AnalyticsDistribution Load
Forecasting/SchedulingCondition Based
MaintenanceCreate & Target
Customer Offerings
Media & Entertainment
Business process transformation
Audience & Marketing Optimization
Multi-Channel Enablement
Digital commerce optimization
Retail
Actionable Customer Insight
Merchandise Optimization
Dynamic Pricing
Travel & Transport
Customer Analytics & Loyalty Marketing
Predictive Maintenance Analytics
Capacity & Pricing Optimization
Consumer Products
Shelf AvailabilityPromotional Spend
OptimizationMerchandising
CompliancePromotion Exceptions
& Alerts
Government
Civilian ServicesDefense & IntelligenceTax & Treasury Services
Healthcare
Measure & Act on Population Health Outcomes
Engage Consumers in their Healthcare
Automotive
Advanced Condition Monitoring
Data Warehouse Optimization
Actionable Customer Intelligence
Life Sciences
Increase visibility into drug safety and effectiveness
Chemical & Petroleum
Operational Surveillance, Analysis & Optimization
Data Warehouse Consolidation, Integration & Augmentation
Big Data Exploration for Interdisciplinary Collaboration
Aerospace & Defense
Uniform Information Access Platform
Data Warehouse Optimization
Airliner Certification Platform
Advanced Condition Monitoring (ACM)
Electronics
Customer/ Channel Analytics
Advanced Condition Monitoring
© 2013 IBM Corporation
© 2014 IBM Corporation
Big Data Myths
Big Data is primarily about large datasets
We will have to replace all older systems
Older transactional data does not matter anymore
Data warehouses are a thing of the past
Big Data is only for internet savvy customers
We do not have the need, budget or skills
Big Data Hadoop
>“There’s a belief that if you want big data, you need to go out and buy Hadoop and then you’re pretty much set. People shouldn’t get ideas about turning off their relational systems and replacing them with Hadoop.”
Ken RudinHead of Analytics at Facebook
11
© 2014 IBM Corporation
Big Data MethodologiesLeverage more of the data being captured
TRADITIONAL APPROACH BIG DATA APPROACH
Analyze small subsets of information
Analyze all information
Analyzedinformation
All available information
All available informationanalyzed
© 2014 IBM Corporation
Big Data MethodologiesReduce effort required to leverage data
TRADITIONAL APPROACH BIG DATA APPROACH
Carefully cleanse information before any analysis
Analyze information as is, cleanse as needed
Small amount of carefully
organized information
Large amount of
messy information
© 2014 IBM Corporation
Big Data MethodologiesData leads the way
TRADITIONAL APPROACH BIG DATA APPROACH
Start with hypothesis andtest against selected data
Explore all data andidentify correlations
Hypothesis Question
DataAnswer
Data Exploration
CorrelationInsight
© 2014 IBM Corporation
Big Data MethodologiesLeverage data as it is captured
TRADITIONAL APPROACH BIG DATA APPROACH
Analyze data after it’s been processed and landed in a warehouse
or mart Analyze data in motion as it’s
generated, in real-time
Repository InsightAnalysisData
Data
Insight
Analysis
© 2014 IBM Corporation
The Information Supply Chain
Actionable insight
Reporting & interactive analysis
Data types
Transaction andapplication data
Predictive analytics and modeling
Reporting and analysis
Operational systems
Archive
Enterprise WarehouseStaging area
16
© 2014 IBM Corporation
The Modernised Environment
17
© 2014 IBM Corporation
Building the business transformation
the right architecture for business and IT
value to business leaders through pilot programs
by expanding to additional use cases
to a data-driven culture
high-value opportunities
18
© 2014 IBM Corporation
What is Hadoop?
Apache open source software framework for reliable, scalable, distributed computing of massive amount of data Hides underlying system details and complexities from user Developed in Java
Core sub projects:− MapReduce− Hadoop Distributed File System a.k.a. HDFS
Supported by several Hadoop-related projects HBase Zookeeper Avro Flume etc
Meant for heterogeneous commodity hardware
© 2014 IBM Corporation
New way of storing and processing the data: Let system handle most of the issues automatically:
Failures Scalability Reduce communications Distribute data and processing power to where the data is Make parallelism part of operating system Relatively inexpensive hardware
Bring processing to Data!
Hadoop = HDFS + MapReduce infrastructure + …
Optimized to handle Massive amounts of data through parallelism A variety of data (structured, unstructured, semi-structured) Using inexpensive commodity hardware
Reliability provided through replication
Design Principles of Hadoop
© 2014 IBM Corporation
Scalable New nodes can be added
on the fly
Affordable Massively parallel computing on
commodity servers
Flexible Hadoop is schema-less, and can absorb
any type of data
Fault Tolerant Through MapReduce
software framework
Innovation Performance & reliability
Adaptive MapReduce, Compression, Indexing, Flexible Scheduler, +++
Enterprise Hardening of Hadoop
Productivity Accelerators Web-based UI’s and tools End-user visualization Analytic Accelerators +++
Enterprise Integration To extend & enrich your information
supply chain
IBM Enriches Hadoop
© 2014 IBM Corporation
IBM BigInsights – Open Source and IBM Value Adds
Real-time Analytics InfoSphere Streams
Enterprise Performance Adaptive Map Reduce & Big SQL
Storage IntegrationGPFS POSIX Distributed Filesystem
Data Governance and SecurityData Click, LDAP and Secured Cluster
SearchBigIndex and Data Explorer
Data ExplorationBigSheets “schema-on-read” tooling
MapReduceHDFS HBase Flume
Pig
Lucene
Jaql ZooKeeperOozie Hive
Sqoop
HCatalog
100% based on Apache Open Source Hadoop Components
Predictive ModelingBigR scalable data mining” on R
Text AnalyticsText processing with AQL
ANSI SQLBigSQL Optimized SQL support
Application Tooling Toolkits and accelerators
© 2014 IBM Corporation
Big SQL
SQL-basedApplication
Big SQL Engine
Data Sources
IBM data server client
SQL MPP Run-time
CSV Seq Parquet RC
ORCAvro CustomJSON
IBM’s SQL engine for Hadoop
Comprehensive, standard SQL • SELECT: joins, unions, aggregates, subqueries . . . • GRANT/REVOKE, INSERT … INTO• PL/SQL• Stored procs, user-defined functions • IBM data server JDBC and ODBC drivers
Optimization and performance • Java MapReduce layer replaced with high performance
IBM MPP engine (C++) • Continuous running daemons (no start up latency) • Message passing allow data to flow between nodes
without persisting intermediate results • In-memory operations with ability to spill to disk (useful
for aggregrations, sorts that exceed available RAM) • Cost-based query optimization with 140+ rewrite rules
Various storage formats supported• Data persisted in DFS, Hive • No IBM proprietary format required
Integration with RDBMSs via LOAD, query federation BigInsights
© 2014 IBM Corporation
Big Data Accelerators Make it Easier than Ever to Build Big Data Applications
Telecommunications Event DataCDR streaming analyticsDeep Customer Event Analytics
Ships with InfoSphere Streams
Social Data AnalyticsSentiment Analytics, Intent to purchase
Ships with InfoSphere BigInsights & Streams
Machine Data AnalyticsOperational data including logs for operations efficiency
Ships with InfoSphere BigInsights
© 2014 IBM Corporation
© 2014 IBM Corporation
Evolution of Cloud Technologies
Virtualization Dynamic Hybrid
“I want to get more out of my existing hardware”
“I want to strategically use public and private cloud together”.
Cloud Native
“I want to rapidly build new, born on the cloud, engaging applications in a continuous delivery model”
Business Services (SaaS)
“I want to use an app without having to own it”
Cloud Enabled
“I want to move my existing middleware workloads to the cloud”
© 2014 IBM Corporation
Networking Networking Networking
Storage Storage Storage
Servers Servers Servers
Virtualization Virtualization Virtualization
O/S O/S O/S
Middleware Middleware Middleware
Runtime Runtime Runtime
Data Data Data
Applications Applications Applications
Infrastructureas a Service
Platformas a Service
Softwareas a Service
Vendor Manages in C
loud
Vendor Manages in C
loud
Vendor Manages in C
loud
Clie
nt M
anag
es
Client M
anages
Customization; higher costs; slower time to valueStandardization; lower costs; faster time to value
IT Admin Developer Business Person
PaaS sits at the center of the cloud delivery model
© 2014 IBM Corporation
Move quickly, see results fast.Learn by tinkering and playing.Needs to learn new skills
through playing and experimenting safely.Needs freedom to experiment
without worrying about pricing right away.
Developers, Developers, Developers!
© 2014 IBM Corporation
Bluemix is an open-standard, cloud-based platform for building, managing, and running applications of all types (web, mobile, big data, new smart devices, and so on).
Go Live in SecondsThe developer can choose any language runtime or bring their own. Zero to production in one command.
DevOpsDevelopment, monitoring, deployment, and logging tools allow the developer to run the entire application.
APIs and ServicesA catalog of IBM, third party, and open source API services allow the developer to stitch an application together in minutes.
On-Prem IntegrationBuild hybrid environments. Connect to on-premise assets plus other public and private clouds.
Flexible Pricing Sign up in minutes. Pay as you go and subscription models offer choice and flexibility.
Layered SecurityIBM secures the platform and infrastructure and provides you with the tools to secure your apps.
What is Bluemix?
© 2014 IBM Corporation
Create apps quickly with prebuilt services
Runtimes, services, and tooling up to you
Choice
Industry Leading IBM CapabilitiesServices leveraging the
depth of IBM softwareFull range of capabilities
CompletenessOpen source platform and
servicesThird party to enable key use
cases
Security Services
Web and application services
CloudIntegration Services
Mobile Services
Database services
Big Data services
Internet of Things
Services
Watson Services
DevOps Services
A full range of capabilities to suit any great idea.
© 2014 IBM Corporation
Infrastructure Services
Virtual Appliance
Metadata
Application
Server
Operatingsystem
Virtual Appliance
Metadata
Application
Server
Operatingsystem
Virtual Appliance
Metadata
HTTP
Server
Operatingsystem
Defined Pattern Services
Systems of Record
Business Services
An Entire Continuum Working Together
Analytics
Composable Services
© 2014 IBM Corporation
IBM Analytics for Hadoop Service
Powered by BigInsights 3.0.0.1 & Bluemix
Get started with Hadoop in Minutes Tutorial: https://developer.ibm.com/hadoop/docs/tutorials/
Dedicated Single Node Env BIAdmin Authority Access to the Web console Secure HTTPS channel powered by SSL
certificates Bluemix Single Sign On (SSO)
© 2014 IBM Corporation
Want to learn more?
Download Quick Start Edition Test drive the technologies
• Follow online tutorials• Enroll in online classes • Watch video demos, read articles, etc.
Links all available from HadoopDev • https://developer.ibm.com/hadoop/
© 2014 IBM Corporation
BigInsights Quick Start Edition
Download: http://ibm.co/QuickStart
© 2014 IBM Corporation
THANK YOU
Top Related