Introduction to-big data

66
Introduction to Big Data 1

Transcript of Introduction to-big data

Page 1: Introduction to-big data

Introduction to

Big Data

1

Page 2: Introduction to-big data

What is Big Data?

"Big Data are high-volume, high-velocity, and/or high-variety

information assets that require new forms of processing to enable

enhanced decision making, insight discovery and process

optimization.“ – Gartner, 2012

2 Impetus Proprietary

Page 3: Introduction to-big data

Evolution of Big Data

Data explosion!!

48 hours of equities market data ~ 5 TB

3.3 months of OPRA feeds ~ 5 PB

Semi/ Unstructured and real time Data

Google processes PB/hour

Bioinformatics – large datasets of genetics & drug formulations

Money laundering / terror funding, Spatial Data

By 2015, more than 85 percent of Fortune 500 organizations

will fail to effectively exploit big data for competitive advantage.

– Gartner

Already producing more than 1.9 zettabytes of data - Analyst

firm, IDC Australia.

3 Impetus Proprietary

Page 4: Introduction to-big data

4 V's of Big Data

4

VOLUME Velocity

Variety Veracity

BigData comes in One Size – Large

Terabytes/ Petabytes/ Exabytes

Streaming Data, Time sensitive Data

Batch, Near/ Real Time Streams

Structures, Semi-Structured, unstructured

Text, Audio/ Video, Click Streams, log files etc

Data is Doubt, Truthfulness of Data,

Authenticity or correctness of Data

Value to Business Value to Business

Impetus Proprietary

Page 5: Introduction to-big data

Value for Business

5

Finance Healthcare

Telecommunication

Media

Retail

Government

Deeper Analysis to avoid credit risk.

Predict the future and have risk free

investment’s

Targetted Medicines with fewer

complications and side effects

Focus on what customer wants Predict Failures and Reliable networks

Targetted and Focused Content Services based on facts and not

on fiction

Impetus Proprietary

Page 6: Introduction to-big data

BigData - Use Cases

6 Impetus Proprietary

Page 7: Introduction to-big data

Potential Use Cases for Big Data

7 Impetus Proprietary

Page 8: Introduction to-big data

Big Data - Use Cases Telecommunication

Telecom Vendors Subscribers

What Plans should I offer to

my customers?

24x7 Service - predict the Failures?

Why am I loosing Customers?

Can I offer something better?

Which Plans are good for me?

Is network reliable

Better/ Best Deals?

Impetus Proprietary

Page 9: Introduction to-big data

Big Data - Use Cases Financial Services

9

• Are there any offers?

Ah! There are so many offers… hard to

find the relevant ones…

• How do I make the best use of it?

I am launching some new offers • Are these offers good for me?

• which stores are providing these offers?

• How to attract the relevant/

Interested customers?

Wanted to buy some products

Impetus Proprietary

Page 10: Introduction to-big data

Big Data - Use Cases Web & Digital Media

10

Ad Promotions

Paying $$$

• How many visitors are turned into

buyers?

• Am I getting any value?

• Are customers really coming to my site?

• Are there any returning Buyers?

• Market and Customer Segmentation?

Merchant Portal

Click Stream Data

Analytical Reports/ Web

Analytics

Impetus Proprietary

Page 11: Introduction to-big data

BigData Challenges

Data processing: -

Processing & Analyzing large data – Terabyte++

Massively Scalable and Parallel

Moving computation is easy than moving data

Support Partial Failure

Data Storage

Doesn’t fit on 1 node, requires cluster

Flexible and Schema less Structure

Data Replication, Partioning and Sharding

11 Impetus Proprietary

Page 12: Introduction to-big data

What we need? Solution? Data Processing

Distributed/ Grid/ Parallel Computing

Distributed data processing

Scales Linearly and Fault tolerant

Leverage NoSQL storage systems as well

Take computation near data to reduce IO

Merge processed results and serve

Stored results for further analytics

12 Impetus Proprietary

Page 13: Introduction to-big data

What we need? Solution? Data Storage

NoSQL

NoSQL starts from where RDBMS becomes dysfunctional

Flexible Data structure

Linear scaling on commodity Boxes – No SPOF

Real Time and bi-directional Replication

Scales Linearly

Sharding and Partioning of Data

Partition the data in smaller chunks

Store chunks on distributed file system

Replicate them to enable recovery from a node failure

13 Impetus Proprietary

Page 14: Introduction to-big data

Technology Landscape

14 Impetus Proprietary

Page 15: Introduction to-big data

Typical Hadoop Based Solution

15

Call Data Records

Web Clickstreams

Network Logs

Satellite Feeds

GPS Data

Sensor Readings

Sales Data

Emails

Commodity Servers

Big Data HDFS and Map Reduce Jobs Information

XML

CSV

JSON

BINARY

LOG

Impetus Proprietary

Page 16: Introduction to-big data

Deep Dive

Into

BigData Challenges

16 Impetus Proprietary

Page 17: Introduction to-big data

Data Storage

NoSQL

17 Impetus Proprietary

Page 18: Introduction to-big data

Background

RDBMS

Ruled the world for last 3 decades.

Internet changed the world and technology around us.

Scaling up does not work after a certain limit.

Scaling out is not much charming either

Sharding scales but you loose all the useful features of RDBMS

Sharding is operationally difficult

Web2 Apps have different requirement than enterprise apps

18 Impetus Proprietary

Page 19: Introduction to-big data

Today’s Requirement - Data

Data does not fit on one node

Data may not fit in one rack

SAN's are too expensive

Data partitioning - across multiple nodes / racks / datacenter

Evolving schema

19 Impetus Proprietary

Page 20: Introduction to-big data

Today’s Requirement - Reliability

Must be highly available

Commodity nodes - they may crash

Data must survive disk / node failure

Data must survive datacenter failure

20 Impetus Proprietary

Page 21: Introduction to-big data

Introduction to NoSQL

A different thought process

RDBMS vs. NoSQL

How do we store vs. How do we use

Referencing vs. Embedding

Fixed schema vs. Evolving schema

Depth of functionality vs. Scalability + performance

Compute on read vs. Compute on write

21 Impetus Proprietary

Page 22: Introduction to-big data

Introduction to NoSQL

22

{

“_id” : “some unique string that is assigned to the contact”,

“type” : “contact”,

“name” : “contact's name”,

“birth_day” : “a date in string form”,

“address” : “the address in string form”,

“phone_number” : “phone number in string form”

}

ख घ

Document

Graph

Column Based [ Key = “name” ; value = “vinod” ; timestamp = Friday July 22, 2011]

Impetus Proprietary

Page 23: Introduction to-big data

Data Model (Column Based)

Column (Cell)

RDBMS

23

name • emailAddress

value • [email protected]

timestamp • 1311150988226

time to live • 3600

No SQL

Impetus Proprietary

Page 24: Introduction to-big data

Data Model (Column Based)

Table / Column Family

RDBMS

24

Column 1 Column 2 Column n

Column 1

Column 1 Column 2 Column 3

rowKey 1

rowKey 2

rowKey n Column z

Column m

No SQL

Impetus Proprietary

Page 25: Introduction to-big data

Data Model (Column Based)

Super Column

NoSQL

No matching concept in RDBMS

25

Column 1 Column 2 Column n address

name value

Impetus Proprietary

Page 26: Introduction to-big data

Data Model (Column Based)

Super Column Family

NoSQL

No matching concept in RDBMS

26

Super

Column 1

Super

Column 2

Super

Column n

Super

Column 1

Super

Column 1

Super

Column 2

Super

Column 3

rowKey 1

rowKey 2

rowKey n Super

Column z

Super

Column m

Impetus Proprietary

Page 27: Introduction to-big data

Use Cases When to use?

Huge amount of data – distributed across network

High query load – give results quickly

Evolving schema

Changes should happen without restart

Migration is not an option with large amount of data

27 Impetus Proprietary

Page 28: Introduction to-big data

Use Cases When to avoid?

Complex transactions such as in financial & accounting

ACID transactions is must

Small data size

28 Impetus Proprietary

Page 29: Introduction to-big data

NoSQL pros

Massive scalability

High Availability

Lower cost with predictable elasticity

Flexible data structure

29 Impetus Proprietary

Page 30: Introduction to-big data

NoSQL cons

Limited data query possibilities

Lower level of consistency aka Eventual consistency

No support for multi object transactions

No standardization

Ad hoc data fixing & reporting - no query language available

30 Impetus Proprietary

Page 31: Introduction to-big data

Curious case of NoSQL: How To..?

Scale, data growth is ~ 50%?

Migrate massive relational schema data to NoSQL?

Integrate existing application(s) with NoSQL?

Reduce on efforts of learning on NoSQL arena?

31 Impetus Proprietary

Page 32: Introduction to-big data

Kundera

An Open-source project available on

https://github.com/impetus-opensource/Kundera

An OGM (Object – Grid / NoSQL-Datastore) Mapping Tool

JPA 2.0 Compliance (ZERO Diversion)

Developers don’t need to unlearn (and learn)

Easy to use, Less boilerplate code

Drop-dead simple and fun

Relieves developer from the diversity and complexity that

comes with NoSQL Datastores

Up and running in 5 minutes

For Cassandra

For MongoDB

For HBase

… And For Any RDBMS

32 Impetus Proprietary

Page 33: Introduction to-big data

Kundera Architecture

33 Impetus Proprietary

Page 34: Introduction to-big data

Setting up Kundera

Download Jar: - You can download latest Executable Kundera jar from here:

https://github.com/downloads/impetus-opensource/Kundera/kundera-

cassandra-2.0.7-jar-with-dependencies.jar

Using Kundera with any maven project

Building Kundera from source

git clone [email protected]:impetus-opensource/Kundera.git

mvn clean install

34

<repository>

<id>sonatype-nexus</id>

<name>Kundera Public Repository</name>

<url>https://oss.sonatype.org/content/repositories/releases</url>

<releases>

<enabled>true</enabled>

</releases>

<snapshots>

<enabled>false</enabled>

</snapshots>

</repository>

<dependency>

<groupId>com.impetus</groupId>

<artifactId>kundera</artifactId>

<version>2.0.7</version>

</dependency>

Impetus Proprietary

Page 35: Introduction to-big data

User information(master data)

and corresponding tweets are

stored into

RDBMS(Oracle/MySQL).

Data growth for tweets(in 100

TB’s) is not scalable with

RDBMS.

How to scale and perform with

big data problem? Migrate tweets to NOSQL and

keeping user master data into

RDBMS- Polyglot persistence

User information(master data)

and corresponding tweets are

stored into

RDBMS(Oracle/MySQL).

Data growth for tweets(in 100

TB’s) is not scalable with

RDBMS.

How to scale and perform with

big data problem? Migrate tweets to NOSQL and

keeping user master data into

RDBMS- Polyglot persistence

Tweet-store app

Problem statement:

35

Migrate tweets to NOSQL

Impetus Proprietary

Page 36: Introduction to-big data

Tweet-store app

Entity Definition and Configuration:

36 Impetus Proprietary

Page 37: Introduction to-big data

Tweet-store app Contd…

Initialize Entity manager factory

Create User object

Find by Key

Find by Query

37 Impetus Proprietary

Page 38: Introduction to-big data

Kundera-core

Kundera-core

Kundera-engine

Kundera Client extension framework

38 Impetus Proprietary

NOSQL Client

NOSQL EntityReader NOSQL

QueryImplementor

NOSQL Client

Factory

Page 39: Introduction to-big data

Features

Supports – Cassandra, Hbase, MongoDB and Any RDBMS

Stronger Query Support (e.g. Super column based search)

CRUD / Query Support Across Datastores

Object Relationships Handling Across Datastores

Caching

Connection Pool

Datastore-Optimized Persistence and Query Approach

Pluggable architecture (Allows developers to – create a library

specific to a particular data-store, Plug it into Kundera and Persist

data into that data-store)

Flexibility for choosing Lucene-based or Datastore provided

secondary indexing.

Provides auto schema generation feature for Cassandra, Mongo,

Hbase and RDBMS. 39 Impetus Proprietary

Page 40: Introduction to-big data

Data Processing

Hadoop

40 Impetus Proprietary

Page 41: Introduction to-big data

Beyond Multithreading

Ever increasing Computing Requirements

Scaling – Horizontal vs Vertical

Parallel vs Distributed

Fault-tolerance

Grids/ Loosely Coupled Systems

Built using commodity systems

Aggregation of distributed systems

Centralized or Decentralized management

41 Impetus Proprietary

Page 42: Introduction to-big data

Challenges of Distributed Processing

Production deployments need to be carefully Planned

Unavailability on 1 node should not Impact

Need High Speed Networks

Data Replication invloves data conflicts

Troubleshotting and diagnosing

Geographically Distributed

Consistency & Reliability

42 Impetus Proprietary

Page 43: Introduction to-big data

What is Hadoop?

A Batch processing Framework for distributed processing of

large data sets on a network of commodity hardware.

Designed to scale out

Fault - tolerant – At Application level

Open source + Commodity hardware = Reduction in Cost

43 Impetus Proprietary

Page 44: Introduction to-big data

Components of Hadoop

NameNode

SNN - Secondary NameNode

JobTracker

TaskTracker

DataNode

HDFS

44 Impetus Proprietary

Page 45: Introduction to-big data

Machine -1

Architecture of Hadoop

NAMENODE

JobTracker

Secondary NAMENODE

TaskTracker

Typical Hadoop Custer –

Master/ Slave

Architecture

Datanode

HDFS

Machine -2

TaskTracker

Datanode

HDFS

Machine -3

TaskTracker

Datanode

HDFS

Impetus Proprietary

Page 46: Introduction to-big data

Hadoop Distributed File System

Large Distributed File System

10K nodes, 100 million files, 10 PB

Assumes Commodity Hardware

Failure is expected, rather than exceptional

Streaming Data Access

Write-Once, Read-Many pattern

Batch processing

Node failure - Replication

46 Impetus Proprietary

Page 47: Introduction to-big data

Hadoop Distributed File System Namenode Metadata

Meta-data in Memory

The entire metadata is in main memory

No demand paging of meta-data

Types of Metadata

List of files

List of Blocks for each file

List of DataNodes for each block

File attributes, e.g creation time, replication factor

A Transaction Log

Records file creations, file deletions. etc

47 Impetus Proprietary

Page 48: Introduction to-big data

Hadoop Distributed File System DataNode

A Block Server

Stores data in the local file system.

Stores meta-data of a block.

Serves data and meta-data to Clients

Block Report

Periodically sends a report of all existing blocks to the NameNode

Facilitates Pipelining of Data

Forwards data to other specified DataNodes

48 Impetus Proprietary

Page 49: Introduction to-big data

Hadoop Distributed File System Architecture

49

name:/users/joeYahoo/myFile - copies:2, blocks:{1,3}

name:/users/bobYahoo/someData.gzip, copies:3, blocks:{2,4,5}

Datanodes (the slaves)

Namenode (the master)

1 1 2

2 2 4 5

3 3 4 4

5 5

Client

Metadata

I/O

Impetus Proprietary

Page 50: Introduction to-big data

Data Integrity

Use checksums (CRC32) to validate data

File Creation

Client computes checksum per 512 byte

DataNode stores the checksum

File access

Client retrieves the data and checksum from DataNode

If validation fails, Client tries other replicas

50 Impetus Proprietary

Page 51: Introduction to-big data

Data Compression

Reduces the number of bytes written to/read from HDFS

Efficiency of network bandwidth and disk space

Reduces the size of data needed to be read

51 Impetus Proprietary

Page 52: Introduction to-big data

Data Compression

LZO Compression – https://github.com/toddlipcon/hadoop-lzo

Hadoop Snappy - http://code.google.com/p/snappy/

52 Impetus Proprietary

Page 53: Introduction to-big data

Map/ Reduce Definition

Map -

Takes input and divides it into smaller sub-problems, and distributes them to worker nodes.

A worker node may do this again in turn, leading to a multi-level tree structure.

The worker node processes the smaller problem, and passes the answer back to its master node.

Reduce –

The master node then collects the answers to all the sub-problems

Combines them in some way to form the output

53 Impetus Proprietary

Page 54: Introduction to-big data

Map/ Reduce Hadoop

Master-Slave architecture

Master: JobTracker

Accepts MR jobs submitted by users

Assigns Map and Reduce tasks to TaskTrackers

Monitors task and TaskTracker status, re-executes tasks upon failure

Slaves: TaskTrackers

Run Map and Reduce tasks upon instruction from the JobTracker

Manage storage and transmission of intermediate output

54 Impetus Proprietary

Page 55: Introduction to-big data

Map/ Reduce Job Submission

DFS 1. Copy Input

Files

User

Input Files

Client

6. Submit Job

3. Read Input Files

4. Create/ get Splits

5. Upload job information

Job.xml, Job.jar

JobTracker

2. Submit Job

Impetus Proprietary

Page 56: Introduction to-big data

Map/ Reduce Job Initialization

Client

DFS

JobTracker

6. Submit Job

Job Queue

7. Initialize Job

Input Splits Job.xml, Job.jar 8. Read Job Files

As many maps As splits

9. Create Maps and Reduces

Maps Reduces

Impetus Proprietary

Page 57: Introduction to-big data

Map/ Reduce Job Scheduling

JobTracker

Job Queue

H1

H2

H3

H4

TaskTracker - H1

TaskTracker – H3

TaskTracker – H2

TaskTracker – H4

10. Heartbeat

10. Heartbeat

10. Heartbeat

10. Heartbeat

11. Pick a task (data local if

possible)

12. Assign Task

13. Launch Task

Impetus Proprietary

Page 58: Introduction to-big data

Task Tracker

Map/ Reduce Task Execution

Upto MAX_MAP_SLOTS Maps Concurrently

Upto MAX_REDUCE_SLOTS Reduces Concurrently

JobTracker

Assign task for Execution

DFS

Job.xml, Job.jar

Read into Local Disk

Impetus Proprietary

Page 59: Introduction to-big data

Map/ Reduce Map Task

Execute User Code

JobTracker

TaskTracker

Assign Task

Launch Task

User calls output.collect

Buffer

Intermediate Output File to Local Disk

Map Completion

Event

Impetus Proprietary

Page 60: Introduction to-big data

Assign Task

JobTracker

TaskTracker

Launch Task

M1 M2 M3 M4

All Map Outputs

Sort and Merge as we get Map Outputs

(Based on some Criteria)

Map/ Reduce Reduce Task

Have All Map Outputs?

DFS

Execute User Code

Write Output File to DFS

Impetus Proprietary

Page 61: Introduction to-big data

How Does it Scale?

Software – Apache hadoop

Designed for Scaling and Failures

Scale out

Add nodes at any time

Hardware –

Commodity Boxes?

DN/ TT: -

dual processor/dual core

4-8 GB Ram with ECC memory

4 x 500GB SATA drives

NameNode –

Do Not Comprise.

Server class Box with 32-48 GB Ram

4 x 1TB SATA drives with RAID

61 Impetus Proprietary

Page 62: Introduction to-big data

How Does it Scale?

Yahoo –

100,000 CPUs in >40,000 computers running Hadoop

Biggest cluster: 4500 nodes (2*4 cpu, 4*1TB disk & 16GB RAM)

Ebay

532 nodes cluster (8 * 532 cores, 5.3PB)

Facebook

A 1100-machine cluster with 8800 cores and about 12 PB raw storage

Linked-In

1200 nodes, with 2x6 cores, 24GB RAM, 6x2TB SATA

62 Impetus Proprietary

Page 63: Introduction to-big data

Strategies for handling SPOF

Run on Different Servers

Primary and Secondary Node

SN periodically creates checkpoint

Download FSImage and EditLog from NN and

merge them

Upload new Image to NN

63

NN SN

Repl

Impetus Proprietary

Page 64: Introduction to-big data

Strategies for handling SPOF

Avatar Node@Facebook

Commercial Versions: -

MapR

Hortonworks

Some Geek Solutions : -

Replace HDFS with MySQL Cluster for Namenode

64 Impetus Proprietary

Page 65: Introduction to-big data

Hadoop Ecosystem

65

Ba

ck

up

& R

ec

ove

ry

De

plo

ym

en

t

Se

cu

rity

Ma

na

gem

en

t

Mo

nit

ori

ng

Impetus Proprietary

Page 66: Introduction to-big data

Thank You

Q & A

66 Impetus Proprietary