Hadoop introduction

1. Background Big data challenge - (link)

2. User Case - Retails Retail and The Big Data Evolution:

3. User Case - Retails Data is providing gains in three main ways: opening new channels, tailoring my service, and driving revenue:

4. Pain Point Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

5. So what is Hadoop? Hadoop was created by Doug Cutting and Mike Cafarella. Hadoop provides the reliable shared storage and analysis system. (HDFS) It is designed to scale up from a single server to thousand of machines, with a high degree of fault tolerance. (HDFS) It provides a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. (MapReduce)

6. History

7. Hadoop Distributed FileSystem (HDFS) A given file is broken down into blocks (default=64MB), then blocks are replicated across cluster (default=3). The number and size of file have no limition. HDFS allows you to put/get/delete files. (No update!) Follows the philosophy "Write Once, Read Multiple Times!"

8. MapReduceFlow

9. Hadoop Ecosystem The Hadoop Ecosystem Table

10. Flume Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Each Flume agent has a source, a sink and a channel

11. Sqoop Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

12. Hive Apache Hive is a high-level abstraction on top of MapReduce Uses an SQL-like language called HiveQL Generates MapReduce jobs that run on the Hadoop cluster Originally developed by Facebook for data warehousing - Now an open-source Apache project.

13. HBase Apache HBase is the Hadoop database, a distributed, scalable, big data store. Linear scalability, capabile of storing hundreds of terabytes of data Automatic and configurable sharding of tables Automatic failover support Block cache and Bloom Filters for real-time queries Provides realtime random read/write access to data stored in HDFS.

14. Pig Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The data flow language (Pig Latin) The interactive shell where you can type Pig Latin statements (Grunt) The Pig interpreter and execution engine

15. Oozie Oozie is a 'workflow engine' which runs on a server and typically outside the cluster. It can runs workflows of Hadoop jobs including Pig, Hive, Sqoop jobs and submit those jobs to the cluster based on a workflow definition.

16. Why?

17. Why?

18. Appendix What is Hadoop? https://www.youtube.com/watch?v=4DgTLaFNQq0 https://www.youtube.com/watch?v=9s-vSeWej1U Intro to MapReduce https://www.youtube.com/watch?v=HFplUBeBhcM What is Big Data? Big Data Explained (Hadoop & MapReduce) Big Data University - Hadoop Fundamentals I - v2 Big Data Challenges

19. Appendix Hadoop Tutorial 1 - What is Hadoop? Hadoop Tutorial 2 - Challenges Created by Big Data Hadoop Tutorial 3 - History Behind Creation of Hadoop (Google, Yahoo, and Apac Hadoop Tutorial 4 - Overview of Hadoop Projects Hadoop Tutorial 5 - Steps to Install Hadoop on a Personal Computer (Windows/OS Hadoop Tutorial 6 - Downloading and Installing Oracle VirtualBox Hadoop Tutorial 7 - Downloading Hadoop Appliance for Oracle VirtualBox

Hadoop introduction

Technology

Transcript of Hadoop introduction