BIG DATA: Apache Hadoop

A part of the Nordic IT group EVRY

Infopulse

Oleksiy Krotov (Expert Oracle DBA) 19.01.2016

BIG DATA: Apache Hadoop

2BIG DATA: Apache Hadoop

Apache Hadoop

HADOOP ARCHITECTURE

HADOOP INTERFACE

HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

HADOOP MAPREDUCE

ORACLE BIG DATA

RESOURCES

Hadoop Architecture

Apache Hadoop is an open-source framework for distributed storage and distributed processing of very large data sets

storage part known as Hadoop Distributed File System (HDFS)

processing part called MapReduce.

Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed.

Hadoop Architecture

Biggest Hadoop cluster: Yahoo! has more than 100,000 CPUs in over 40,000 servers running Hadoop, with its biggest Hadoop cluster running 4,500 nodes with 455 PetaBytes of data in Hadoop (2014)

More than half of the Fortune 50 companies run open source Apache Hadoop based on Cloudera. (2012)

The HDFS file system is not restricted to MapReduce jobs. It can be used for other applications, many of which are under development at Apache. The list includes the HBase database, the Apache Mahout machine learning system, and the Apache Hive Data Warehouse system. Hadoop can in theory be used for any sort of work that is batch-oriented rather than real-time, is very data-intensive, and benefits from parallel processing of data.

Hadoop Architecture

NameNode hosts metadata (file system index of files and blocks)

DataNode hosts the data (blocks)

JobTracker is a master which creates and runs the job

Hadoop Interface[training@localhost ~]$ hdfs dfsadmin -reportConfigured Capacity: 15118729216 (14.08 GB)Present Capacity: 10163642368 (9.47 GB)DFS Remaining: 9228095488 (8.59 GB)DFS Used: 935546880 (892.21 MB)DFS Used%: 9.2%Under replicated blocks: 3Blocks with corrupt replicas: 0Missing blocks: 0

-------------------------------------------------Datanodes available: 1 (1 total, 0 dead)

Live datanodes:Name: 127.0.0.1:50010 (localhost.localdomain)Hostname: localhost.localdomainDecommission Status : NormalConfigured Capacity: 15118729216 (14.08 GB)DFS Used: 935546880 (892.21 MB)Non DFS Used: 4955086848 (4.61 GB)DFS Remaining: 9228095488 (8.59 GB)DFS Used%: 6.19%DFS Remaining%: 61.04%Last contact: Mon Jan 18 14:05:48 EST 2016

Hadoop Interface[training@localhost ~]$ hadoop fs -help get-get [-ignoreCrc] [-crc] <src> ... <localdst>: Copy files that match the file pattern <src> to the local name. <src> is kept. When copying multiple, files, the destination must be a directory.

hadoop fs –ls

hadoop fs -put purchases.txt

hadoop fs -put access_log

hadoop fs -ls

hadoop fs -tail purchases.txt

hadoop fs get filename hs {mapper script} {reducer script} {input_file} {output directory}

hs mapper.py reducer.py myinput joboutput

Hadoop Interface

Hadoop Distributed File System (HDFS)

HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers.

HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications


Default replication value 3, data is stored on three nodes: two on the same rack, and one on a different rack.

Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high

Apache Hadoop can work with additional file systems:

FTP, Amazon S3, Windows Azure Storage Blobs (WASB)

Hadoop MapReduce

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks.

Hadoop MapReduce

Hadoop MapReduceUsage: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar [options]Options: -input <path> DFS input file(s) for the Map step -output <path> DFS output directory for the Reduce step -mapper <cmd|JavaClassName> The streaming command to run -combiner <cmd|JavaClassName> The streaming command to run -reducer <cmd|JavaClassName> The streaming command to run -file <file> File/dir to be shipped in the Job jar file -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional. -outputformat TextOutputFormat(default)|JavaClassName Optional. -partitioner JavaClassName Optional. -numReduceTasks <num> Optional. -inputreader <spec> Optional. -cmdenv <n>=<v> Optional. Pass env.var to streaming commands -mapdebug <path> Optional. To run this script when a map task fails -reducedebug <path> Optional. To run this script when a reduce task fails -io <identifier> Optional. -verbose

hs {mapper script} {reducer script} {input_file} {output directory}

hs mapper.py reducer.py myinput joboutput

Oracle Big Data Connectors

Load Data into the Database

Oracle Loader for Hadoop– Map Reduce job transforms data on Hadoop into Oracle-ready data types– Use more Hadoop compute resources

Oracle SQL Connector for HDFS– Oracle SQL access to data on Hadoop via external tables– Use more database compute resources– Includes option to query in-place

Oracle Big Data Connectors

Oracle Big Data Appliance X5-2

Enterprise-class security for Hadoop through Oracle Big Data SQL, which also provides the ability to use a simple SQL query to quickly

explore data across Hadoop, SQL, and relational databases.

Resources

https://hadoop.apache.org/docs/stable/https://en.wikipedia.org/wiki/Apache_Hadoophttps://developer.yahoo.com/hadoop/tutorial/http://go.cloudera.com/udacity-lesson-1http://content.udacity-data.com/courses/ud617/access_log.gzhttp://content.udacity-data.com/courses/ud617/purchases.txt.gzhttps://www.youtube.com/watch?v=acWtid-OOWMhttp://www.oracle.com/technetwork/database/bigdata-appliance/overview/index.htmlhttps://www.udacity.com/courses/ud617

https://en.wikipedia.org/wiki/Apache_Hadoop




https://developer.yahoo.com/hadoop/tutorial/

https://developer.yahoo.com/hadoop/tutorial/

http://go.cloudera.com/udacity-lesson-1

http://go.cloudera.com/udacity-lesson-1

http://content.udacity-data.com/courses/ud617/access_log.gz

http://content.udacity-data.com/courses/ud617/access_log.gz

http://content.udacity-data.com/courses/ud617/purchases.txt.gz

http://content.udacity-data.com/courses/ud617/purchases.txt.gz

https://www.youtube.com/watch?v=acWtid-OOWM

https://www.youtube.com/watch?v=acWtid-OOWM

http://www.oracle.com/technetwork/database/bigdata-appliance/overview/index.html



https://www.udacity.com/courses/ud617

https://www.udacity.com/courses/ud617

Thank you for attention!

BIG DATA: Apache Hadoop 27


Contact us!

Address:03056,24, Polyova Str., Kyiv, Ukraine

Phone:+38 044 457-88-56Email:[email protected]

Contact us!

Address:03056,24, Polyova Str., Kyiv, Ukraine

Phone:+38 044 457-88-56Email:[email protected]

BIG DATA: Apache Hadoop

Presentations & Public Speaking

Transcript of BIG DATA: Apache Hadoop