Introduction to-big data

Introduction to

Big Data

1

What is Big Data?

"Big Data are high-volume, high-velocity, and/or high-variety

information assets that require new forms of processing to enable

enhanced decision making, insight discovery and process

optimization.“ – Gartner, 2012

2 Impetus Proprietary

Evolution of Big Data

Data explosion!!

48 hours of equities market data ~ 5 TB

3.3 months of OPRA feeds ~ 5 PB

Semi/ Unstructured and real time Data

Google processes PB/hour

Bioinformatics – large datasets of genetics & drug formulations

Money laundering / terror funding, Spatial Data

By 2015, more than 85 percent of Fortune 500 organizations

will fail to effectively exploit big data for competitive advantage.

– Gartner

Already producing more than 1.9 zettabytes of data - Analyst

firm, IDC Australia.


4 V's of Big Data

4

VOLUME Velocity

Variety Veracity

BigData comes in One Size – Large

Terabytes/ Petabytes/ Exabytes

Streaming Data, Time sensitive Data

Batch, Near/ Real Time Streams

Structures, Semi-Structured, unstructured

Text, Audio/ Video, Click Streams, log files etc

Data is Doubt, Truthfulness of Data,

Authenticity or correctness of Data

Value to Business Value to Business

Impetus Proprietary

Value for Business

5

Finance Healthcare

Telecommunication

Media

Retail

Government

Deeper Analysis to avoid credit risk.

Predict the future and have risk free

investment’s

Targetted Medicines with fewer

complications and side effects

Focus on what customer wants Predict Failures and Reliable networks

Targetted and Focused Content Services based on facts and not

on fiction

Impetus Proprietary

BigData - Use Cases


Potential Use Cases for Big Data


Big Data - Use Cases Telecommunication

Telecom Vendors Subscribers

What Plans should I offer to

my customers?

24x7 Service - predict the Failures?

Why am I loosing Customers?

Can I offer something better?

Which Plans are good for me?

Is network reliable

Better/ Best Deals?

Impetus Proprietary

Big Data - Use Cases Financial Services

9

• Are there any offers?

Ah! There are so many offers… hard to

find the relevant ones…

• How do I make the best use of it?

I am launching some new offers • Are these offers good for me?

• which stores are providing these offers?

• How to attract the relevant/

Interested customers?

Wanted to buy some products

Impetus Proprietary

Big Data - Use Cases Web & Digital Media

10

Ad Promotions

Paying $$$

• How many visitors are turned into

buyers?

• Am I getting any value?

• Are customers really coming to my site?

• Are there any returning Buyers?

• Market and Customer Segmentation?

Merchant Portal

Click Stream Data

Analytical Reports/ Web

Analytics

Impetus Proprietary

BigData Challenges

Data processing: -

Processing & Analyzing large data – Terabyte++

Massively Scalable and Parallel

Moving computation is easy than moving data

Support Partial Failure

Data Storage

Doesn’t fit on 1 node, requires cluster

Flexible and Schema less Structure

Data Replication, Partioning and Sharding


What we need? Solution? Data Processing

Distributed/ Grid/ Parallel Computing

Distributed data processing

Scales Linearly and Fault tolerant

Leverage NoSQL storage systems as well

Take computation near data to reduce IO

Merge processed results and serve

Stored results for further analytics


What we need? Solution? Data Storage

NoSQL

NoSQL starts from where RDBMS becomes dysfunctional

Flexible Data structure

Linear scaling on commodity Boxes – No SPOF

Real Time and bi-directional Replication

Scales Linearly

Sharding and Partioning of Data

Partition the data in smaller chunks

Store chunks on distributed file system

Replicate them to enable recovery from a node failure


Technology Landscape


Typical Hadoop Based Solution

15

Call Data Records

Web Clickstreams

Network Logs

Satellite Feeds

GPS Data

Sensor Readings

Sales Data

Emails

Commodity Servers

Big Data HDFS and Map Reduce Jobs Information

XML

CSV

JSON

BINARY

LOG

Impetus Proprietary

Deep Dive

Into

BigData Challenges


Data Storage

NoSQL


Background

RDBMS

Ruled the world for last 3 decades.

Internet changed the world and technology around us.

Scaling up does not work after a certain limit.

Scaling out is not much charming either

Sharding scales but you loose all the useful features of RDBMS

Sharding is operationally difficult

Web2 Apps have different requirement than enterprise apps


Today’s Requirement - Data

Data does not fit on one node

Data may not fit in one rack

SAN's are too expensive

Data partitioning - across multiple nodes / racks / datacenter

Evolving schema


Today’s Requirement - Reliability

Must be highly available

Commodity nodes - they may crash

Data must survive disk / node failure

Data must survive datacenter failure


Introduction to NoSQL

A different thought process

RDBMS vs. NoSQL

How do we store vs. How do we use

Referencing vs. Embedding

Fixed schema vs. Evolving schema

Depth of functionality vs. Scalability + performance

Compute on read vs. Compute on write


Introduction to NoSQL

22

{

“_id” : “some unique string that is assigned to the contact”,

“type” : “contact”,

“name” : “contact's name”,

“birth_day” : “a date in string form”,

“address” : “the address in string form”,

“phone_number” : “phone number in string form”

}

ग

ज

छ

ख घ

झ

क

च

Document

Graph

Column Based [ Key = “name” ; value = “vinod” ; timestamp = Friday July 22, 2011]

Impetus Proprietary

Data Model (Column Based)

Column (Cell)

RDBMS

23

name • emailAddress

value • [email protected]

timestamp • 1311150988226

time to live • 3600

No SQL

Impetus Proprietary


Table / Column Family

RDBMS

24

Column 1 Column 2 Column n

Column 1

Column 1 Column 2 Column 3

rowKey 1

rowKey 2

rowKey n Column z

Column m

No SQL

Impetus Proprietary


Super Column

NoSQL

No matching concept in RDBMS

25

Column 1 Column 2 Column n address

name value

Impetus Proprietary


Super Column Family

NoSQL

No matching concept in RDBMS

26

Super

Column 1

Super

Column 2

Super

Column n

Super

Column 1

Super

Column 1

Super

Column 2

Super

Column 3

rowKey 1

rowKey 2

rowKey n Super

Column z

Super

Column m

Impetus Proprietary

Use Cases When to use?

Huge amount of data – distributed across network

High query load – give results quickly

Evolving schema

Changes should happen without restart

Migration is not an option with large amount of data


Use Cases When to avoid?

Complex transactions such as in financial & accounting

ACID transactions is must

Small data size


NoSQL pros

Massive scalability

High Availability

Lower cost with predictable elasticity

Flexible data structure


NoSQL cons

Limited data query possibilities

Lower level of consistency aka Eventual consistency

No support for multi object transactions

No standardization

Ad hoc data fixing & reporting - no query language available


Curious case of NoSQL: How To..?

Scale, data growth is ~ 50%?

Migrate massive relational schema data to NoSQL?

Integrate existing application(s) with NoSQL?

Reduce on efforts of learning on NoSQL arena?


Kundera

An Open-source project available on

https://github.com/impetus-opensource/Kundera

An OGM (Object – Grid / NoSQL-Datastore) Mapping Tool

JPA 2.0 Compliance (ZERO Diversion)

Developers don’t need to unlearn (and learn)

Easy to use, Less boilerplate code

Drop-dead simple and fun

Relieves developer from the diversity and complexity that

comes with NoSQL Datastores

Up and running in 5 minutes

For Cassandra

For MongoDB

For HBase

… And For Any RDBMS








Kundera Architecture


Setting up Kundera

Download Jar: - You can download latest Executable Kundera jar from here:

https://github.com/downloads/impetus-opensource/Kundera/kundera-

cassandra-2.0.7-jar-with-dependencies.jar

Using Kundera with any maven project

Building Kundera from source

git clone [email protected]:impetus-opensource/Kundera.git

mvn clean install

34

<repository>

<id>sonatype-nexus</id>

<name>Kundera Public Repository</name>

<url>https://oss.sonatype.org/content/repositories/releases</url>

<releases>

<enabled>true</enabled>

</releases>

<snapshots>

<enabled>false</enabled>

</snapshots>

</repository>

<dependency>

<groupId>com.impetus</groupId>

<artifactId>kundera</artifactId>

<version>2.0.7</version>

</dependency>

Impetus Proprietary

https://github.com/downloads/impetus-opensource/Kundera/kundera-cassandra-2.0.7-jar-with-dependencies.jar













mailto:[email protected]:impetus-opensource/Kundera.git





User information(master data)

and corresponding tweets are

stored into

RDBMS(Oracle/MySQL).

Data growth for tweets(in 100

TB’s) is not scalable with

RDBMS.

How to scale and perform with

big data problem? Migrate tweets to NOSQL and

keeping user master data into

RDBMS- Polyglot persistence

User information(master data)

and corresponding tweets are

stored into

RDBMS(Oracle/MySQL).

Data growth for tweets(in 100

TB’s) is not scalable with

RDBMS.

How to scale and perform with

big data problem? Migrate tweets to NOSQL and

keeping user master data into

RDBMS- Polyglot persistence

Tweet-store app

Problem statement:

35

Migrate tweets to NOSQL

Impetus Proprietary

Tweet-store app

Entity Definition and Configuration:


Tweet-store app Contd…

Initialize Entity manager factory

Create User object

Find by Key

Find by Query


Kundera-core

Kundera-core

Kundera-engine

Kundera Client extension framework


NOSQL Client

NOSQL EntityReader NOSQL

QueryImplementor

NOSQL Client

Factory

Features

Supports – Cassandra, Hbase, MongoDB and Any RDBMS

Stronger Query Support (e.g. Super column based search)

CRUD / Query Support Across Datastores

Object Relationships Handling Across Datastores

Caching

Connection Pool

Datastore-Optimized Persistence and Query Approach

Pluggable architecture (Allows developers to – create a library

specific to a particular data-store, Plug it into Kundera and Persist

data into that data-store)

Flexibility for choosing Lucene-based or Datastore provided

secondary indexing.

Provides auto schema generation feature for Cassandra, Mongo,

Hbase and RDBMS. 39 Impetus Proprietary

Data Processing

Hadoop


Beyond Multithreading

Ever increasing Computing Requirements

Scaling – Horizontal vs Vertical

Parallel vs Distributed

Fault-tolerance

Grids/ Loosely Coupled Systems

Built using commodity systems

Aggregation of distributed systems

Centralized or Decentralized management


Challenges of Distributed Processing

Production deployments need to be carefully Planned

Unavailability on 1 node should not Impact

Need High Speed Networks

Data Replication invloves data conflicts

Troubleshotting and diagnosing

Geographically Distributed

Consistency & Reliability


What is Hadoop?

A Batch processing Framework for distributed processing of

large data sets on a network of commodity hardware.

Designed to scale out

Fault - tolerant – At Application level

Open source + Commodity hardware = Reduction in Cost


Components of Hadoop

NameNode

SNN - Secondary NameNode

JobTracker

TaskTracker

DataNode

HDFS


Machine -1

Architecture of Hadoop

NAMENODE

JobTracker

Secondary NAMENODE

TaskTracker

Typical Hadoop Custer –

Master/ Slave

Architecture

Datanode

HDFS

Machine -2

TaskTracker

Datanode

HDFS

Machine -3

TaskTracker

Datanode

HDFS

Impetus Proprietary

Hadoop Distributed File System

Large Distributed File System

10K nodes, 100 million files, 10 PB

Assumes Commodity Hardware

Failure is expected, rather than exceptional

Streaming Data Access

Write-Once, Read-Many pattern

Batch processing

Node failure - Replication


Hadoop Distributed File System Namenode Metadata

Meta-data in Memory

The entire metadata is in main memory

No demand paging of meta-data

Types of Metadata

List of files

List of Blocks for each file

List of DataNodes for each block

File attributes, e.g creation time, replication factor

A Transaction Log

Records file creations, file deletions. etc


Hadoop Distributed File System DataNode

A Block Server

Stores data in the local file system.

Stores meta-data of a block.

Serves data and meta-data to Clients

Block Report

Periodically sends a report of all existing blocks to the NameNode

Facilitates Pipelining of Data

Forwards data to other specified DataNodes


Hadoop Distributed File System Architecture

49

name:/users/joeYahoo/myFile - copies:2, blocks:{1,3}

name:/users/bobYahoo/someData.gzip, copies:3, blocks:{2,4,5}

Datanodes (the slaves)

Namenode (the master)

1 1 2

2 2 4 5

3 3 4 4

5 5

Client

Metadata

I/O

Impetus Proprietary

Data Integrity

Use checksums (CRC32) to validate data

File Creation

Client computes checksum per 512 byte

DataNode stores the checksum

File access

Client retrieves the data and checksum from DataNode

If validation fails, Client tries other replicas


Data Compression

Reduces the number of bytes written to/read from HDFS

Efficiency of network bandwidth and disk space

Reduces the size of data needed to be read


Data Compression

LZO Compression – https://github.com/toddlipcon/hadoop-lzo

Hadoop Snappy - http://code.google.com/p/snappy/


https://github.com/toddlipcon/hadoop-lzo



http://code.google.com/p/snappy/

Map/ Reduce Definition

Map -

Takes input and divides it into smaller sub-problems, and distributes them to worker nodes.

A worker node may do this again in turn, leading to a multi-level tree structure.

The worker node processes the smaller problem, and passes the answer back to its master node.

Reduce –

The master node then collects the answers to all the sub-problems

Combines them in some way to form the output


Map/ Reduce Hadoop

Master-Slave architecture

Master: JobTracker

Accepts MR jobs submitted by users

Assigns Map and Reduce tasks to TaskTrackers

Monitors task and TaskTracker status, re-executes tasks upon failure

Slaves: TaskTrackers

Run Map and Reduce tasks upon instruction from the JobTracker

Manage storage and transmission of intermediate output


Map/ Reduce Job Submission

DFS 1. Copy Input

Files

User

Input Files

Client

6. Submit Job

3. Read Input Files

4. Create/ get Splits

5. Upload job information

Job.xml, Job.jar

JobTracker

2. Submit Job

Impetus Proprietary

Map/ Reduce Job Initialization

Client

DFS

JobTracker

6. Submit Job

Job Queue

7. Initialize Job

Input Splits Job.xml, Job.jar 8. Read Job Files

As many maps As splits

9. Create Maps and Reduces

Maps Reduces

Impetus Proprietary

Map/ Reduce Job Scheduling

JobTracker

Job Queue

H1

H2

H3

H4

TaskTracker - H1

TaskTracker – H3

TaskTracker – H2

TaskTracker – H4

10. Heartbeat

10. Heartbeat

10. Heartbeat

10. Heartbeat

11. Pick a task (data local if

possible)

12. Assign Task

13. Launch Task

Impetus Proprietary

Task Tracker

Map/ Reduce Task Execution

Upto MAX_MAP_SLOTS Maps Concurrently

Upto MAX_REDUCE_SLOTS Reduces Concurrently

JobTracker

Assign task for Execution

DFS

Job.xml, Job.jar

Read into Local Disk

Impetus Proprietary

Map/ Reduce Map Task

Execute User Code

JobTracker

TaskTracker

Assign Task

Launch Task

User calls output.collect

Buffer

Intermediate Output File to Local Disk

Map Completion

Event

Impetus Proprietary

Assign Task

JobTracker

TaskTracker

Launch Task

M1 M2 M3 M4

All Map Outputs

Sort and Merge as we get Map Outputs

(Based on some Criteria)

Map/ Reduce Reduce Task

Have All Map Outputs?

DFS

Execute User Code

Write Output File to DFS

Impetus Proprietary

How Does it Scale?

Software – Apache hadoop

Designed for Scaling and Failures

Scale out

Add nodes at any time

Hardware –

Commodity Boxes?

DN/ TT: -

dual processor/dual core

4-8 GB Ram with ECC memory

4 x 500GB SATA drives

NameNode –

Do Not Comprise.

Server class Box with 32-48 GB Ram

4 x 1TB SATA drives with RAID


How Does it Scale?

Yahoo –

100,000 CPUs in >40,000 computers running Hadoop

Biggest cluster: 4500 nodes (2*4 cpu, 4*1TB disk & 16GB RAM)

Ebay

532 nodes cluster (8 * 532 cores, 5.3PB)

Facebook

A 1100-machine cluster with 8800 cores and about 12 PB raw storage

Linked-In

1200 nodes, with 2x6 cores, 24GB RAM, 6x2TB SATA


Strategies for handling SPOF

Run on Different Servers

Primary and Secondary Node

SN periodically creates checkpoint

Download FSImage and EditLog from NN and

merge them

Upload new Image to NN

63

NN SN

Repl

Impetus Proprietary

Strategies for handling SPOF

Avatar Node@Facebook

Commercial Versions: -

MapR

Hortonworks

Some Geek Solutions : -

Replace HDFS with MySQL Cluster for Namenode


Hadoop Ecosystem

65

Ba

ck

up

& R

ec

ove

ry

De

plo

ym

en

t

Se

cu

rity

Ma

na

gem

en

t

Mo

nit

ori

ng

Impetus Proprietary

Thank You

Q & A


Introduction to-big data

Technology

Transcript of Introduction to-big data