Hadoop enhancements using next gen IA technologies

Hadoop Enhancements Using Next-Gen Intel ® Platform TechnologiesAnoop Sam John – PMC member for Apache HBase

Rakesh R - Committer for Apache Zookeeper and PMC member for Apache Bookkeeper

Hadoop Enhancements Using Next-Gen Intel ® Platform Technologies

Anoop Sam JohnRakesh R

About Us• Anoop Sam John

• PMC member for Apache HBase and Phoenix• anoopsamjohn@apache.org• https://www.linkedin.com/in/anoopsamjohn

• Rakesh R• Committer for Apache ZooKeeper and BookKeeper• Apache Hadoop contributor• rakeshr@apache.org• https://www.linkedin.com/in/rakeshadr

Agenda• Intel Enhancements on Hadoop platform• HDFS

Erasure coding using ISA-L library Encryption using AES-NI

• HBase Go Big Cache

HDFS – Distributed FileSystem“Between the birth of the world and 2003, there were 5 Exabytes of information created. We now create 5 Exabytes every two days.”

Eric Schmidt, Executive Chairman of Alphabet, Inc.

Massive

amount of

datafrom many

sources

HDFS – Current Replication Strategy• Inherits 3-way replication from Google File System to increase data availability

- 3x storage overhead

• Expensive for,- Massive amount of data- Geo-distributed data recovery

Datanode1

Datanode2

DFSClient

Rack-1 Rack-2

Write data

block replicates

3X replication

HDFS – Erasure Coding• k data blocks + m parity blocks (k + m)

Example: Reed-Solomon 6 + 3

Codec library

• Save disk space• 1.5x storage overhead

X Y X Y0 0 00 1 11 0 11 1 0

data bits parity bits

Sample codec library (XOR based)

b1 b2 b3 b4 b5 b6 b7 b8 b9

6 data blocks

3 parity blocks

D1 D2 D3 D4 D5 D6 D7 D8 D9

Durability & Efficiency

3-way data replication Erasure coding : RS – (6,3)

Data Durability 2 3

Storage efficiency 1/3 (33.33%) 6/9 (67%)

Data Durability = How many simultaneous failures can be tolerated?Storage Efficiency = How much portion of storage is for useful data?

useful data extra datauseful data

Datanode1 Datanode2 Datanode3Replica1 Replica2 Replica3

redundant data

3-way data replicationD1 D2 D3 D4 D5 D6 D7 D8 D9b1 b2 b3 b4 b5 b6 b7 b8 b9

6 data blocks

3 parity blocks

RS-(6,3) Erasure coding

• Released version – Apache Hadoop 3.0.0-alpha1

Microbenchmark : Codec Calculation

Image courtesy Cloudera

High performance

• New Intel architecture solutions for storage (ISA-L)• Intel® Intelligent Storage Acceleration Library provides a solution to deploy EC

with better performance. https://01.org/intel%C2%AE-storage-acceleration-library-open-source-version

HDFS – Encryption

HDFS-Encryption• Sensitivity of the data and

managing privacy of the data is very important for the big data analytics• Encryption is a regulatory

requirement for many business sectors

- Finance- Government- Healthcare etc.

DFSClient

Per-file key operations

Data opsRead/Write encrypted data

HDFS Cluster

Encryption key ops

data at-rest

data in-transit

Encryption library

• Released version – Apache Hadoop 2.6.0

Encryption Algorithm• Data encryption/decryption is costlier

• Encryption ciphers.

AES-CTR (Advanced Encryption Standard - Counter Mode) is most popular

Either 128 or 192 or 256 bit keys

Encryption AES-CTR• Two implementations of AES-CTR

1. JCE (Java Cryptography Extension) software implementation

2. OpenSSL hardware accelerated AES-NI (Intel ® Advanced Encryption

Standard New Instructions) implementation

AES-NI available in Westmere(2010) and newer Intel CPUs

AES-NI further optimized in Haswell(2013)

https://software.intel.com/en-us/articles/intel-advanced-encryption-standard-instructions-aes-ni

Microbenchmark: Encrypt/Decrypt 1GB Byte arrayTest Environment:

Run locally on a single Haswell machine Single threaded, excluded HDFS overheads(checksumming, network, copies)

Image courtesy Cloudera

Apache Commons Crypto• Cryptographic layer is incubated as new Apache component

http://commons.apache.org/proper/commons-crypto/

• Apache Commons crypto was integrated with Apache Spark as well

for shuffle encryption.

HBase• NoSQL database in Hadoop eco• Accumulates writes in memory and flushes to HDFS• Caches data • Reduced read latency• Better read throughput

•Memory hungry processes

Big Memory• Hadoop platforms no longer only for commodity hardware. • Systems moving towards faster CPU and bigger memories

Big Data => Big Storage + Big Memory• Non Volatile memory technology• 3D XPoint™ DIMMS from Intel®• Higher memory capability• Lower cost vs DDR

HBase – Go Big Cache

Data Data Data Data Data Data

JVM Offheap memory

Client

• JVM GC tuning continues to

be a challenge with larger

heaps (new GC algos)

• Much bigger sized cache in

offheap memory for faster

random reads

• Better predictable latency

• Building blocks for

supporting 3D XPoint™

products

HBase – Go Big Cache

Performance before offheaping

Performance after offheaping

Image courtesy Alibaba Inc.

• Alibaba adopted this feature for their 1600 node cluster. • Used in double 11 online sale

Questions

Thank Youhttps://software.intel.com/en-us/bigdata/a

pache-big-data-stack

Backup slides

Microbenchmark : Codec Calculation

Hadoop enhancements using next gen IA technologies

Data & Analytics

Transcript of Hadoop enhancements using next gen IA technologies

Continuous Delivery for Linux/Windows/Hadoop...Beta Cluster Hadoop JobTracker Jenkins Slave Hadoop node Hadoop node Hadoop node Hadoop node Slave Node Gateway Prod. Cluster PigServer

250 Hadoop Interview Questions and Answers for Experienced Hadoop Developers - Hadoop Online Tutorials

Hadoop Summit 2010 Benchmarking And Optimizing Hadoop

PROFESSIONAL HADOOP® SOLUTIONS - Startseite€¦ · The Hadoop Ecosystem 7 Hadoop Core Components 7 Hadoop Distributions 10 Developing Enterprise Applications with Hadoop 12 Summary

Hadoop virtualization extensions hadoop world meetup

Hadoop Training #4: Programming with Hadoop

Hadoop Present - Open Enterprise Hadoop

[Hadoop] Terapot: Massive Email Archiving with Hadoop

· (Page views ? Hourly? Monthly Hadoop Node Hadoop Node Hadoop Camus Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Ad-Hoc Analysis External Datastores Trends

NAAUG ENHANCEMENTS 2002-3003 Larry Woods NAAUG 2003 Iowa City, IA.

Docker based Hadoop provisioning - Hadoop Summit 2014

Why use Hadoop?, Challenges / Learning Hadoop & Average Salary of Hadoop Professional

Analyzing Hadoop with Hadoop

Hadoop World 2010: Productionizing Hadoop: Lessons Learned

Hadoop Deployment Manual - Hyadespleiades.ucsc.edu/doc/bright/hadoop-deployment-manual.pdf2.2 Ncurses Installation Of Hadoop Using cm-hadoop-setup ... •The Hadoop Deployment Manual

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

HadoopLearn | HADOOP Online Training USA | HADOOP Trainings

WP2 7-ATOS · Highway Addressable Remote Transducer HDSF . Hadoop Distributed File System HTML . Hypertext Mark-up Language HTTP . Hypertext Transfer Protocol IA Innovation Action

Hadoop Installation Guide | Hadoop Configuration

PEPS 11 Enhancements - · PDF filePEPS 11 Enhancements Kernel Enhancements CAD translator Enhancements • SolidWorks 2015 Support • SolidEdge ST7 Support • Catia V5 R25 Support