Introduction to-big data
-
Upload
nasscom -
Category
Technology
-
view
2.126 -
download
0
Transcript of Introduction to-big data
Introduction to
Big Data
1
What is Big Data?
"Big Data are high-volume, high-velocity, and/or high-variety
information assets that require new forms of processing to enable
enhanced decision making, insight discovery and process
optimization.“ – Gartner, 2012
2 Impetus Proprietary
Evolution of Big Data
Data explosion!!
48 hours of equities market data ~ 5 TB
3.3 months of OPRA feeds ~ 5 PB
Semi/ Unstructured and real time Data
Google processes PB/hour
Bioinformatics – large datasets of genetics & drug formulations
Money laundering / terror funding, Spatial Data
By 2015, more than 85 percent of Fortune 500 organizations
will fail to effectively exploit big data for competitive advantage.
– Gartner
Already producing more than 1.9 zettabytes of data - Analyst
firm, IDC Australia.
3 Impetus Proprietary
4 V's of Big Data
4
VOLUME Velocity
Variety Veracity
BigData comes in One Size – Large
Terabytes/ Petabytes/ Exabytes
Streaming Data, Time sensitive Data
Batch, Near/ Real Time Streams
Structures, Semi-Structured, unstructured
Text, Audio/ Video, Click Streams, log files etc
Data is Doubt, Truthfulness of Data,
Authenticity or correctness of Data
Value to Business Value to Business
Impetus Proprietary
Value for Business
5
Finance Healthcare
Telecommunication
Media
Retail
Government
Deeper Analysis to avoid credit risk.
Predict the future and have risk free
investment’s
Targetted Medicines with fewer
complications and side effects
Focus on what customer wants Predict Failures and Reliable networks
Targetted and Focused Content Services based on facts and not
on fiction
Impetus Proprietary
BigData - Use Cases
6 Impetus Proprietary
Potential Use Cases for Big Data
7 Impetus Proprietary
Big Data - Use Cases Telecommunication
Telecom Vendors Subscribers
What Plans should I offer to
my customers?
24x7 Service - predict the Failures?
Why am I loosing Customers?
Can I offer something better?
Which Plans are good for me?
Is network reliable
Better/ Best Deals?
Impetus Proprietary
Big Data - Use Cases Financial Services
9
• Are there any offers?
Ah! There are so many offers… hard to
find the relevant ones…
• How do I make the best use of it?
I am launching some new offers • Are these offers good for me?
• which stores are providing these offers?
• How to attract the relevant/
Interested customers?
Wanted to buy some products
Impetus Proprietary
Big Data - Use Cases Web & Digital Media
10
Ad Promotions
Paying $$$
• How many visitors are turned into
buyers?
• Am I getting any value?
• Are customers really coming to my site?
• Are there any returning Buyers?
• Market and Customer Segmentation?
Merchant Portal
Click Stream Data
Analytical Reports/ Web
Analytics
Impetus Proprietary
BigData Challenges
Data processing: -
Processing & Analyzing large data – Terabyte++
Massively Scalable and Parallel
Moving computation is easy than moving data
Support Partial Failure
Data Storage
Doesn’t fit on 1 node, requires cluster
Flexible and Schema less Structure
Data Replication, Partioning and Sharding
11 Impetus Proprietary
What we need? Solution? Data Processing
Distributed/ Grid/ Parallel Computing
Distributed data processing
Scales Linearly and Fault tolerant
Leverage NoSQL storage systems as well
Take computation near data to reduce IO
Merge processed results and serve
Stored results for further analytics
12 Impetus Proprietary
What we need? Solution? Data Storage
NoSQL
NoSQL starts from where RDBMS becomes dysfunctional
Flexible Data structure
Linear scaling on commodity Boxes – No SPOF
Real Time and bi-directional Replication
Scales Linearly
Sharding and Partioning of Data
Partition the data in smaller chunks
Store chunks on distributed file system
Replicate them to enable recovery from a node failure
13 Impetus Proprietary
Technology Landscape
14 Impetus Proprietary
Typical Hadoop Based Solution
15
Call Data Records
Web Clickstreams
Network Logs
Satellite Feeds
GPS Data
Sensor Readings
Sales Data
Emails
Commodity Servers
Big Data HDFS and Map Reduce Jobs Information
XML
CSV
JSON
BINARY
LOG
Impetus Proprietary
Deep Dive
Into
BigData Challenges
16 Impetus Proprietary
Data Storage
NoSQL
17 Impetus Proprietary
Background
RDBMS
Ruled the world for last 3 decades.
Internet changed the world and technology around us.
Scaling up does not work after a certain limit.
Scaling out is not much charming either
Sharding scales but you loose all the useful features of RDBMS
Sharding is operationally difficult
Web2 Apps have different requirement than enterprise apps
18 Impetus Proprietary
Today’s Requirement - Data
Data does not fit on one node
Data may not fit in one rack
SAN's are too expensive
Data partitioning - across multiple nodes / racks / datacenter
Evolving schema
19 Impetus Proprietary
Today’s Requirement - Reliability
Must be highly available
Commodity nodes - they may crash
Data must survive disk / node failure
Data must survive datacenter failure
20 Impetus Proprietary
Introduction to NoSQL
A different thought process
RDBMS vs. NoSQL
How do we store vs. How do we use
Referencing vs. Embedding
Fixed schema vs. Evolving schema
Depth of functionality vs. Scalability + performance
Compute on read vs. Compute on write
21 Impetus Proprietary
Introduction to NoSQL
22
{
“_id” : “some unique string that is assigned to the contact”,
“type” : “contact”,
“name” : “contact's name”,
“birth_day” : “a date in string form”,
“address” : “the address in string form”,
“phone_number” : “phone number in string form”
}
ग
ज
छ
ख घ
झ
क
च
Document
Graph
Column Based [ Key = “name” ; value = “vinod” ; timestamp = Friday July 22, 2011]
Impetus Proprietary
Data Model (Column Based)
Column (Cell)
RDBMS
23
name • emailAddress
value • [email protected]
timestamp • 1311150988226
time to live • 3600
No SQL
Impetus Proprietary
Data Model (Column Based)
Table / Column Family
RDBMS
24
Column 1 Column 2 Column n
Column 1
Column 1 Column 2 Column 3
rowKey 1
rowKey 2
rowKey n Column z
Column m
No SQL
Impetus Proprietary
Data Model (Column Based)
Super Column
NoSQL
No matching concept in RDBMS
25
Column 1 Column 2 Column n address
name value
Impetus Proprietary
Data Model (Column Based)
Super Column Family
NoSQL
No matching concept in RDBMS
26
Super
Column 1
Super
Column 2
Super
Column n
Super
Column 1
Super
Column 1
Super
Column 2
Super
Column 3
rowKey 1
rowKey 2
rowKey n Super
Column z
Super
Column m
Impetus Proprietary
Use Cases When to use?
Huge amount of data – distributed across network
High query load – give results quickly
Evolving schema
Changes should happen without restart
Migration is not an option with large amount of data
27 Impetus Proprietary
Use Cases When to avoid?
Complex transactions such as in financial & accounting
ACID transactions is must
Small data size
28 Impetus Proprietary
NoSQL pros
Massive scalability
High Availability
Lower cost with predictable elasticity
Flexible data structure
29 Impetus Proprietary
NoSQL cons
Limited data query possibilities
Lower level of consistency aka Eventual consistency
No support for multi object transactions
No standardization
Ad hoc data fixing & reporting - no query language available
30 Impetus Proprietary
Curious case of NoSQL: How To..?
Scale, data growth is ~ 50%?
Migrate massive relational schema data to NoSQL?
Integrate existing application(s) with NoSQL?
Reduce on efforts of learning on NoSQL arena?
31 Impetus Proprietary
Kundera
An Open-source project available on
https://github.com/impetus-opensource/Kundera
An OGM (Object – Grid / NoSQL-Datastore) Mapping Tool
JPA 2.0 Compliance (ZERO Diversion)
Developers don’t need to unlearn (and learn)
Easy to use, Less boilerplate code
Drop-dead simple and fun
Relieves developer from the diversity and complexity that
comes with NoSQL Datastores
Up and running in 5 minutes
For Cassandra
For MongoDB
For HBase
… And For Any RDBMS
32 Impetus Proprietary
Kundera Architecture
33 Impetus Proprietary
Setting up Kundera
Download Jar: - You can download latest Executable Kundera jar from here:
https://github.com/downloads/impetus-opensource/Kundera/kundera-
cassandra-2.0.7-jar-with-dependencies.jar
Using Kundera with any maven project
Building Kundera from source
git clone [email protected]:impetus-opensource/Kundera.git
mvn clean install
34
<repository>
<id>sonatype-nexus</id>
<name>Kundera Public Repository</name>
<url>https://oss.sonatype.org/content/repositories/releases</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
<dependency>
<groupId>com.impetus</groupId>
<artifactId>kundera</artifactId>
<version>2.0.7</version>
</dependency>
Impetus Proprietary
User information(master data)
and corresponding tweets are
stored into
RDBMS(Oracle/MySQL).
Data growth for tweets(in 100
TB’s) is not scalable with
RDBMS.
How to scale and perform with
big data problem? Migrate tweets to NOSQL and
keeping user master data into
RDBMS- Polyglot persistence
User information(master data)
and corresponding tweets are
stored into
RDBMS(Oracle/MySQL).
Data growth for tweets(in 100
TB’s) is not scalable with
RDBMS.
How to scale and perform with
big data problem? Migrate tweets to NOSQL and
keeping user master data into
RDBMS- Polyglot persistence
Tweet-store app
Problem statement:
35
Migrate tweets to NOSQL
Impetus Proprietary
Tweet-store app
Entity Definition and Configuration:
36 Impetus Proprietary
Tweet-store app Contd…
Initialize Entity manager factory
Create User object
Find by Key
Find by Query
37 Impetus Proprietary
Kundera-core
Kundera-core
Kundera-engine
Kundera Client extension framework
38 Impetus Proprietary
NOSQL Client
NOSQL EntityReader NOSQL
QueryImplementor
NOSQL Client
Factory
Features
Supports – Cassandra, Hbase, MongoDB and Any RDBMS
Stronger Query Support (e.g. Super column based search)
CRUD / Query Support Across Datastores
Object Relationships Handling Across Datastores
Caching
Connection Pool
Datastore-Optimized Persistence and Query Approach
Pluggable architecture (Allows developers to – create a library
specific to a particular data-store, Plug it into Kundera and Persist
data into that data-store)
Flexibility for choosing Lucene-based or Datastore provided
secondary indexing.
Provides auto schema generation feature for Cassandra, Mongo,
Hbase and RDBMS. 39 Impetus Proprietary
Data Processing
Hadoop
40 Impetus Proprietary
Beyond Multithreading
Ever increasing Computing Requirements
Scaling – Horizontal vs Vertical
Parallel vs Distributed
Fault-tolerance
Grids/ Loosely Coupled Systems
Built using commodity systems
Aggregation of distributed systems
Centralized or Decentralized management
41 Impetus Proprietary
Challenges of Distributed Processing
Production deployments need to be carefully Planned
Unavailability on 1 node should not Impact
Need High Speed Networks
Data Replication invloves data conflicts
Troubleshotting and diagnosing
Geographically Distributed
Consistency & Reliability
42 Impetus Proprietary
What is Hadoop?
A Batch processing Framework for distributed processing of
large data sets on a network of commodity hardware.
Designed to scale out
Fault - tolerant – At Application level
Open source + Commodity hardware = Reduction in Cost
43 Impetus Proprietary
Components of Hadoop
NameNode
SNN - Secondary NameNode
JobTracker
TaskTracker
DataNode
HDFS
44 Impetus Proprietary
Machine -1
Architecture of Hadoop
NAMENODE
JobTracker
Secondary NAMENODE
TaskTracker
Typical Hadoop Custer –
Master/ Slave
Architecture
Datanode
HDFS
Machine -2
TaskTracker
Datanode
HDFS
Machine -3
TaskTracker
Datanode
HDFS
Impetus Proprietary
Hadoop Distributed File System
Large Distributed File System
10K nodes, 100 million files, 10 PB
Assumes Commodity Hardware
Failure is expected, rather than exceptional
Streaming Data Access
Write-Once, Read-Many pattern
Batch processing
Node failure - Replication
46 Impetus Proprietary
Hadoop Distributed File System Namenode Metadata
Meta-data in Memory
The entire metadata is in main memory
No demand paging of meta-data
Types of Metadata
List of files
List of Blocks for each file
List of DataNodes for each block
File attributes, e.g creation time, replication factor
A Transaction Log
Records file creations, file deletions. etc
47 Impetus Proprietary
Hadoop Distributed File System DataNode
A Block Server
Stores data in the local file system.
Stores meta-data of a block.
Serves data and meta-data to Clients
Block Report
Periodically sends a report of all existing blocks to the NameNode
Facilitates Pipelining of Data
Forwards data to other specified DataNodes
48 Impetus Proprietary
Hadoop Distributed File System Architecture
49
name:/users/joeYahoo/myFile - copies:2, blocks:{1,3}
name:/users/bobYahoo/someData.gzip, copies:3, blocks:{2,4,5}
Datanodes (the slaves)
Namenode (the master)
1 1 2
2 2 4 5
3 3 4 4
5 5
Client
Metadata
I/O
Impetus Proprietary
Data Integrity
Use checksums (CRC32) to validate data
File Creation
Client computes checksum per 512 byte
DataNode stores the checksum
File access
Client retrieves the data and checksum from DataNode
If validation fails, Client tries other replicas
50 Impetus Proprietary
Data Compression
Reduces the number of bytes written to/read from HDFS
Efficiency of network bandwidth and disk space
Reduces the size of data needed to be read
51 Impetus Proprietary
Data Compression
LZO Compression – https://github.com/toddlipcon/hadoop-lzo
Hadoop Snappy - http://code.google.com/p/snappy/
52 Impetus Proprietary
Map/ Reduce Definition
Map -
Takes input and divides it into smaller sub-problems, and distributes them to worker nodes.
A worker node may do this again in turn, leading to a multi-level tree structure.
The worker node processes the smaller problem, and passes the answer back to its master node.
Reduce –
The master node then collects the answers to all the sub-problems
Combines them in some way to form the output
53 Impetus Proprietary
Map/ Reduce Hadoop
Master-Slave architecture
Master: JobTracker
Accepts MR jobs submitted by users
Assigns Map and Reduce tasks to TaskTrackers
Monitors task and TaskTracker status, re-executes tasks upon failure
Slaves: TaskTrackers
Run Map and Reduce tasks upon instruction from the JobTracker
Manage storage and transmission of intermediate output
54 Impetus Proprietary
Map/ Reduce Job Submission
DFS 1. Copy Input
Files
User
Input Files
Client
6. Submit Job
3. Read Input Files
4. Create/ get Splits
5. Upload job information
Job.xml, Job.jar
JobTracker
2. Submit Job
Impetus Proprietary
Map/ Reduce Job Initialization
Client
DFS
JobTracker
6. Submit Job
Job Queue
7. Initialize Job
Input Splits Job.xml, Job.jar 8. Read Job Files
As many maps As splits
9. Create Maps and Reduces
Maps Reduces
Impetus Proprietary
Map/ Reduce Job Scheduling
JobTracker
Job Queue
H1
H2
H3
H4
TaskTracker - H1
TaskTracker – H3
TaskTracker – H2
TaskTracker – H4
10. Heartbeat
10. Heartbeat
10. Heartbeat
10. Heartbeat
11. Pick a task (data local if
possible)
12. Assign Task
13. Launch Task
Impetus Proprietary
Task Tracker
Map/ Reduce Task Execution
Upto MAX_MAP_SLOTS Maps Concurrently
Upto MAX_REDUCE_SLOTS Reduces Concurrently
JobTracker
Assign task for Execution
DFS
Job.xml, Job.jar
Read into Local Disk
Impetus Proprietary
Map/ Reduce Map Task
Execute User Code
JobTracker
TaskTracker
Assign Task
Launch Task
User calls output.collect
Buffer
Intermediate Output File to Local Disk
Map Completion
Event
Impetus Proprietary
Assign Task
JobTracker
TaskTracker
Launch Task
M1 M2 M3 M4
All Map Outputs
Sort and Merge as we get Map Outputs
(Based on some Criteria)
Map/ Reduce Reduce Task
Have All Map Outputs?
DFS
Execute User Code
Write Output File to DFS
Impetus Proprietary
How Does it Scale?
Software – Apache hadoop
Designed for Scaling and Failures
Scale out
Add nodes at any time
Hardware –
Commodity Boxes?
DN/ TT: -
dual processor/dual core
4-8 GB Ram with ECC memory
4 x 500GB SATA drives
NameNode –
Do Not Comprise.
Server class Box with 32-48 GB Ram
4 x 1TB SATA drives with RAID
61 Impetus Proprietary
How Does it Scale?
Yahoo –
100,000 CPUs in >40,000 computers running Hadoop
Biggest cluster: 4500 nodes (2*4 cpu, 4*1TB disk & 16GB RAM)
Ebay
532 nodes cluster (8 * 532 cores, 5.3PB)
A 1100-machine cluster with 8800 cores and about 12 PB raw storage
Linked-In
1200 nodes, with 2x6 cores, 24GB RAM, 6x2TB SATA
62 Impetus Proprietary
Strategies for handling SPOF
Run on Different Servers
Primary and Secondary Node
SN periodically creates checkpoint
Download FSImage and EditLog from NN and
merge them
Upload new Image to NN
63
NN SN
Repl
Impetus Proprietary
Strategies for handling SPOF
Avatar Node@Facebook
Commercial Versions: -
MapR
Hortonworks
Some Geek Solutions : -
Replace HDFS with MySQL Cluster for Namenode
64 Impetus Proprietary
Hadoop Ecosystem
65
Ba
ck
up
& R
ec
ove
ry
De
plo
ym
en
t
Se
cu
rity
Ma
na
gem
en
t
Mo
nit
ori
ng
Impetus Proprietary
Thank You
Q & A
66 Impetus Proprietary