Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on...

35
© 2014 IBM Corporation Getting started with Hadoop on the Cloud Mangesh Surve [email protected] Analytics Platform Technical Leader – India and South Asia

Transcript of Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on...

Page 1: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Getting started with Hadoop on the Cloud

Mangesh [email protected] Platform Technical Leader – India and South Asia

Page 2: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Welcome

Goal: Get Started With BigData

Hadoop What technical problem is it helping solve? BIG DATA What is Hadoop? BigInsights (IBM’s Hadoop distro)

Bluemix (IBM’s PaaS cloud solution) What technical problem is it helping solve? Analytics for Hadoop in the Cloud

Social Good Challenge 40000 USD prizes to be won You can participate

Demo & Get hands-on Bluemix: bluemix.net A4H Tutorial :https://developer.ibm.com/hadoop/docs/tutorials/analytics-hadoop-

bluemix/

Page 3: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

What is Big Data?

A way to describe data problems that are unsolvable using traditional tools

More Analytics on More Data for More People

Page 4: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

What Data?

Transactional & Application Data

Machine Data Social Data Enterprise Content

© 2013 IBM Corporation

More Analytics on More Data for More People

Page 5: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Page 6: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Welcome to the Instrumented Interconnected World!

12+ TBsof tweet data

every day

25+ TBs oflog data

every day

? TB

s o

fd

ata

ever

y d

ay

2+ billionpeople on the

Web by end

2011

30 billionRFID tags

today(1.3B in 2005)

4.6 billioncamera phones

world wide

100s of millions

of GPS enable

ddevices

sold annually

76 million smart meters in 2009…

200M by 2014

Page 7: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

6,000,000 users on Twitterpushing out 300,000

tweets per day

500,000,000 users on Twitterpushing out 400,000,000

tweets per day

83x

1333x

Page 8: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Volume

Variety Veracity

We’ve Moved into a New Era of Computing

Velocity

decision makers trust their information.

Only 1 in 3of different types of data.

100’s

of Tweets create daily.

12+ terabytestrade eventsper second.

5+ million

Page 9: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Imagine the Possibilities of Harnessing Your Data Resources

Retailer reduces time to run queries by 80% to optimize

inventory

Stock Exchange cuts queries from 26 hours to 2

minutes on 2 PB

Government cuts acoustic analysis from hours to

70 Milliseconds

Utility avoids power failures by analyzing

10 PB of data in minutes

Telco analyses streaming network data to reduce hardware costs by 90%

Hospital analyses streaming vitals to detect illness

24 hours earlier

Big data challenges exist in every organization today

Page 10: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Insurance

360˚ View of Domain or Subject

Catastrophe ModelingFraud & AbuseProducer Performance

AnalyticsAnalytics Sandbox

Banking

Optimizing Offers and Cross-sell

Customer Service and Call Center Efficiency

Fraud Detection & Investigation

Credit & Counterparty Risk

Every Industry can Leverage Big Data and Analytics

Telco

Pro-active Call CenterNetwork AnalyticsLocation Based

Services

Energy & Utilities

Smart Meter AnalyticsDistribution Load

Forecasting/SchedulingCondition Based

MaintenanceCreate & Target

Customer Offerings

Media & Entertainment

Business process transformation

Audience & Marketing Optimization

Multi-Channel Enablement

Digital commerce optimization

Retail

Actionable Customer Insight

Merchandise Optimization

Dynamic Pricing

Travel & Transport

Customer Analytics & Loyalty Marketing

Predictive Maintenance Analytics

Capacity & Pricing Optimization

Consumer Products

Shelf AvailabilityPromotional Spend

OptimizationMerchandising

CompliancePromotion Exceptions

& Alerts

Government

Civilian ServicesDefense & IntelligenceTax & Treasury Services

Healthcare

Measure & Act on Population Health Outcomes

Engage Consumers in their Healthcare

Automotive

Advanced Condition Monitoring

Data Warehouse Optimization

Actionable Customer Intelligence

Life Sciences

Increase visibility into drug safety and effectiveness

Chemical & Petroleum

Operational Surveillance, Analysis & Optimization

Data Warehouse Consolidation, Integration & Augmentation

Big Data Exploration for Interdisciplinary Collaboration

Aerospace & Defense

Uniform Information Access Platform

Data Warehouse Optimization

Airliner Certification Platform

Advanced Condition Monitoring (ACM)

Electronics

Customer/ Channel Analytics

Advanced Condition Monitoring

© 2013 IBM Corporation

Page 11: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Big Data Myths

Big Data is primarily about large datasets

We will have to replace all older systems

Older transactional data does not matter anymore

Data warehouses are a thing of the past

Big Data is only for internet savvy customers

We do not have the need, budget or skills

Big Data Hadoop

>“There’s a belief that if you want big data, you need to go out and buy Hadoop and then you’re pretty much set. People shouldn’t get ideas about turning off their relational systems and replacing them with Hadoop.”

Ken RudinHead of Analytics at Facebook

11

Page 12: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Big Data MethodologiesLeverage more of the data being captured

TRADITIONAL APPROACH BIG DATA APPROACH

Analyze small subsets of information

Analyze all information

Analyzedinformation

All available information

All available informationanalyzed

Page 13: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Big Data MethodologiesReduce effort required to leverage data

TRADITIONAL APPROACH BIG DATA APPROACH

Carefully cleanse information before any analysis

Analyze information as is, cleanse as needed

Small amount of carefully

organized information

Large amount of

messy information

Page 14: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Big Data MethodologiesData leads the way

TRADITIONAL APPROACH BIG DATA APPROACH

Start with hypothesis andtest against selected data

Explore all data andidentify correlations

Hypothesis Question

DataAnswer

Data Exploration

CorrelationInsight

Page 15: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Big Data MethodologiesLeverage data as it is captured

TRADITIONAL APPROACH BIG DATA APPROACH

Analyze data after it’s been processed and landed in a warehouse

or mart Analyze data in motion as it’s

generated, in real-time

Repository InsightAnalysisData

Data

Insight

Analysis

Page 16: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

The Information Supply Chain

Actionable insight

Reporting & interactive analysis

Data types

Transaction andapplication data

Predictive analytics and modeling

Reporting and analysis

Operational systems

Archive

Enterprise WarehouseStaging area

16

Page 17: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

The Modernised Environment

17

Page 18: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Building the business transformation

the right architecture for business and IT

value to business leaders through pilot programs

by expanding to additional use cases

to a data-driven culture

high-value opportunities

18

Page 19: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

What is Hadoop?

Apache open source software framework for reliable, scalable, distributed computing of massive amount of data Hides underlying system details and complexities from user Developed in Java

Core sub projects:− MapReduce− Hadoop Distributed File System a.k.a. HDFS

Supported by several Hadoop-related projects HBase Zookeeper Avro Flume etc

Meant for heterogeneous commodity hardware

Page 20: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

New way of storing and processing the data: Let system handle most of the issues automatically:

Failures Scalability Reduce communications Distribute data and processing power to where the data is Make parallelism part of operating system Relatively inexpensive hardware

Bring processing to Data!

Hadoop = HDFS + MapReduce infrastructure + …

Optimized to handle Massive amounts of data through parallelism A variety of data (structured, unstructured, semi-structured) Using inexpensive commodity hardware

Reliability provided through replication

Design Principles of Hadoop

Page 21: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Scalable New nodes can be added

on the fly

Affordable Massively parallel computing on

commodity servers

Flexible Hadoop is schema-less, and can absorb

any type of data

Fault Tolerant Through MapReduce

software framework

Innovation Performance & reliability

Adaptive MapReduce, Compression, Indexing, Flexible Scheduler, +++

Enterprise Hardening of Hadoop

Productivity Accelerators Web-based UI’s and tools End-user visualization Analytic Accelerators +++

Enterprise Integration To extend & enrich your information

supply chain

IBM Enriches Hadoop

Page 22: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

IBM BigInsights – Open Source and IBM Value Adds

Real-time Analytics InfoSphere Streams

Enterprise Performance Adaptive Map Reduce & Big SQL

Storage IntegrationGPFS POSIX Distributed Filesystem

Data Governance and SecurityData Click, LDAP and Secured Cluster

SearchBigIndex and Data Explorer

Data ExplorationBigSheets “schema-on-read” tooling

MapReduceHDFS HBase Flume

Pig

Lucene

Jaql ZooKeeperOozie Hive

Sqoop

HCatalog

100% based on Apache Open Source Hadoop Components

Predictive ModelingBigR scalable data mining” on R

Text AnalyticsText processing with AQL

ANSI SQLBigSQL Optimized SQL support

Application Tooling Toolkits and accelerators

Page 23: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Big SQL

SQL-basedApplication

Big SQL Engine

Data Sources

IBM data server client

SQL MPP Run-time

CSV Seq Parquet RC

ORCAvro CustomJSON

IBM’s SQL engine for Hadoop

Comprehensive, standard SQL • SELECT: joins, unions, aggregates, subqueries . . . • GRANT/REVOKE, INSERT … INTO• PL/SQL• Stored procs, user-defined functions • IBM data server JDBC and ODBC drivers

Optimization and performance • Java MapReduce layer replaced with high performance

IBM MPP engine (C++) • Continuous running daemons (no start up latency) • Message passing allow data to flow between nodes

without persisting intermediate results • In-memory operations with ability to spill to disk (useful

for aggregrations, sorts that exceed available RAM) • Cost-based query optimization with 140+ rewrite rules

Various storage formats supported• Data persisted in DFS, Hive • No IBM proprietary format required

Integration with RDBMSs via LOAD, query federation BigInsights

Page 24: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Big Data Accelerators Make it Easier than Ever to Build Big Data Applications

Telecommunications Event DataCDR streaming analyticsDeep Customer Event Analytics

Ships with InfoSphere Streams

Social Data AnalyticsSentiment Analytics, Intent to purchase

Ships with InfoSphere BigInsights & Streams

Machine Data AnalyticsOperational data including logs for operations efficiency

Ships with InfoSphere BigInsights

Page 25: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Page 26: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Evolution of Cloud Technologies

Virtualization Dynamic Hybrid

“I want to get more out of my existing hardware”

“I want to strategically use public and private cloud together”.

Cloud Native

“I want to rapidly build new, born on the cloud, engaging applications in a continuous delivery model”

Business Services (SaaS)

“I want to use an app without having to own it”

Cloud Enabled

“I want to move my existing middleware workloads to the cloud”

Page 27: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Networking Networking Networking

Storage Storage Storage

Servers Servers Servers

Virtualization Virtualization Virtualization

O/S O/S O/S

Middleware Middleware Middleware

Runtime Runtime Runtime

Data Data Data

Applications Applications Applications

Infrastructureas a Service

Platformas a Service

Softwareas a Service

Vendor Manages in C

loud

Vendor Manages in C

loud

Vendor Manages in C

loud

Clie

nt M

anag

es

Client M

anages

Customization; higher costs; slower time to valueStandardization; lower costs; faster time to value

IT Admin Developer Business Person

PaaS sits at the center of the cloud delivery model

Page 28: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Move quickly, see results fast.Learn by tinkering and playing.Needs to learn new skills

through playing and experimenting safely.Needs freedom to experiment

without worrying about pricing right away.

Developers, Developers, Developers!

Page 29: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Bluemix is an open-standard, cloud-based platform for building, managing, and running applications of all types (web, mobile, big data, new smart devices, and so on).

Go Live in SecondsThe developer can choose any language runtime or bring their own. Zero to production in one command.

DevOpsDevelopment, monitoring, deployment, and logging tools allow the developer to run the entire application.

APIs and ServicesA catalog of IBM, third party, and open source API services allow the developer to stitch an application together in minutes.

On-Prem IntegrationBuild hybrid environments. Connect to on-premise assets plus other public and private clouds.

Flexible Pricing Sign up in minutes. Pay as you go and subscription models offer choice and flexibility.

Layered SecurityIBM secures the platform and infrastructure and provides you with the tools to secure your apps.

What is Bluemix?

Page 30: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Create apps quickly with prebuilt services

Runtimes, services, and tooling up to you

Choice

Industry Leading IBM CapabilitiesServices leveraging the

depth of IBM softwareFull range of capabilities

CompletenessOpen source platform and

servicesThird party to enable key use

cases

Security Services

Web and application services

CloudIntegration Services

Mobile Services

Database services

Big Data services

Internet of Things

Services

Watson Services

DevOps Services

A full range of capabilities to suit any great idea.

Page 31: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Infrastructure Services

Virtual Appliance

Metadata

Application

Server

Operatingsystem

Virtual Appliance

Metadata

Application

Server

Operatingsystem

Virtual Appliance

Metadata

HTTP

Server

Operatingsystem

Defined Pattern Services

Systems of Record

Business Services

An Entire Continuum Working Together

Analytics

Composable Services

Page 32: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

IBM Analytics for Hadoop Service

Powered by BigInsights 3.0.0.1 & Bluemix

Get started with Hadoop in Minutes Tutorial: https://developer.ibm.com/hadoop/docs/tutorials/

Dedicated Single Node Env BIAdmin Authority Access to the Web console Secure HTTPS channel powered by SSL

certificates Bluemix Single Sign On (SSO)

Page 33: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

Want to learn more?

Download Quick Start Edition Test drive the technologies

• Follow online tutorials• Enroll in online classes • Watch video demos, read articles, etc.

Links all available from HadoopDev • https://developer.ibm.com/hadoop/

Page 34: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

BigInsights Quick Start Edition

Download: http://ibm.co/QuickStart

Page 35: Getting started with Hadoop on the Cloudfiles.meetup.com/9505932/Hadoop on Cloud.pdf · Hadoop on the Cloud Mangesh Surve mangesh.surve@in.ibm.com ... Welcome to the Instrumented

© 2014 IBM Corporation

THANK YOU