Hadoop intro

INTRODUCTION TO HADOOPExplaining a complex product in 20 minutes or less…

INTRODUCTION

Keith R. Davis

Data Architect – NEMSIS Project

University of Utah, School of Medicine

[email protected]

WHAT IS HADOOP?

Hadoop is an open source Apache software project that enables the distributed processing of large data sets across clusters of commodity servers.

A QUICK BIT OF HISTORY…

• (2004) Google publishes the GFS and MapReduce papers

• (2005) Apache Nutch search project rewritten to use MapReduce

• (2006) Hadoop was factored out of the Apache Nutch project

• (2006) Development was sponsored by Yahoo

• (2008) Becomes a top-level Apache project

• (Trivia) Why is it called Hadoop?

• It was named after the principal architect’s son's toy elephant!

WHO IS USING HADOOP?

And more…

HOW IS HADOOP DIFFERENT FROM A TRADITIONAL RDBMS?

• Data is not stored in tables

• Haoop supports only forward parsing

• Hadoop doesn’t guarantee ACID properties

• Hadoop takes code to the data

• Scales horizontally vs. vertically

WHAT’S THE BIG DEAL?

Hadoop is:

• Easily Scalable– New cluster nodes can be added as needed

• Cost effective– Hadoop brings massively parallel computing to commodity servers

• Flexible– Hadoop is schema-less, and can absorb any type of data

• Fault tolerant– Share nothing architecture prevents data loss and process failure

WHEN SHOULD I USE HADOOP?

Use Hadoop when you need to:

• Process a terabytes of unstructured data

• Running batch jobs is acceptable

• You have access to a lot of cheap hardware

DO NOT use Hadoop when you need to:

• Perform calculations with little or no data (Pi to one million places)

• Process data in a transactional manner

• Have interactive ad-hoc results (this is changing)

BASIC ARCHITECTURE

Hadoop consists of two primary services:

1. Reliable storage though HDFS (Hadoop Distributed File System)

2. Parallel data processing using a technique known as MapReduce

HOW IT WORKS: HDFS WRITE STEP #1 (FILE SPLITS)

Input Data(CSV)

Block #2

Block #1

Block #3

HOW IT WORKS: HDFS WRITE STEP #2 (REPLICATION)

Block #1

Block #2

Block #1

Block #3

Block #3

Block #2

Node #1 Node #2

Node #3

HOW IT WORKS: MAP/REDUCE

Client

Job Scheduler

Data Node

Data Node

Data Node

Data Node

...

...H

DFS

File

Syst

em

(in

put)

HD

FS F

ile S

yst

em

(outp

ut)

Mapper

Mapper

Mapper

Reducer

Reducer

Mapper

Mapper

Mapper

Reducer

Reducer

LOOKS COMPLICATED!

Not to worry, there are many ways to access the power of MapReduce:

• Hadoop Java API (If you like Java and low level stuff)

• Pig (If you are a script wiz and LINQ doesn’t scare you)

• Hive (You know some SQL and coding isn’t your thing)

• RHadoop (If R is your thing)

• SAS/ACCESS (If SAS is your thing)

HIVE: THE EASY WAY TO GET DATA OUT

• Supports the concepts of databases, tables, and partitions through the use of metadata (think of views over delimited text files)

• Supports a restricted version of SQL (no updates or deletes)

• Supports joins between tables - INNER, OUTER (FULL, LEFT, and RIGHT)

• Supports UNION to combine multiple SELECT STATEMENTS

• Provides a rich set of data types and predefined functions

• Allows the user to create custom scalar and aggregate functions

• Executes queries via MapReduce

• Provides JDBC and ODBC drivers for integration with other applications

• Hive is NOT a replacement for a traditional RDBMS as it is not ACID compliant

HIVE: MATH AND STATS FUNCTIONS

If you use HIVE to create sample sets for your analysis, here are a few standard functions you may find useful:

round(), floor(), ceil(), rand(), exp(), ln(), log10(), log2(), log(), pow(), sqrt(), bin(), hex(), unhex(), conv(), abs(), pmod(), sin(),

asin(), cos(), acos(), tan(), atan(), degrees(), radians(), positive(), negative(), sign(), e(), pi(), count(), sum(), avg(), min(), max(),

variance(), var_samp(), stddev_pop(), stddev_samp(), covar_pop(), covar_samp(), corr(), percentile(), percentile_approx(),

histogram_numeric(), collect_set()

RESOURCES

• Cloudera (Easy Setup) - http://www.cloudera.com/content/cloudera/en/home.html

• NoSQL - http://nosql-database.org/

• Emulab - http://www.emulab.net/

• Apache Hadoop - http://hadoop.apache.org/#Getting+Started

• RHadoop - https://github.com/RevolutionAnalytics/RHadoop/wiki

• SAS/ACCESS - http://www.sas.com/software/data-management/access/index.html

http://www.cloudera.com/content/cloudera/en/home.html



http://nosql-database.org/

http://nosql-database.org/

http://www.emulab.net/

http://www.emulab.net/

http://hadoop.apache.org/#Getting+Started

http://hadoop.apache.org/#Getting+Started

https://github.com/RevolutionAnalytics/RHadoop/wiki

https://github.com/RevolutionAnalytics/RHadoop/wiki

http://www.sas.com/software/data-management/access/index.html

http://www.sas.com/software/data-management/access/index.html

THANK YOU!

Hadoop intro

Technology

Transcript of Hadoop intro