2. User Case - Retails Retail and The Big Data Evolution:
3. User Case - Retails Data is providing gains in three main
ways: opening new channels, tailoring my service, and driving
revenue:
4. Pain Point Big data is the term for a collection of data
sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data
processing applications.
5. So what is Hadoop? Hadoop was created by Doug Cutting and
Mike Cafarella. Hadoop provides the reliable shared storage and
analysis system. (HDFS) It is designed to scale up from a single
server to thousand of machines, with a high degree of fault
tolerance. (HDFS) It provides a programming model and an associated
implementation for processing and generating large data sets with a
parallel, distributed algorithm on a cluster. (MapReduce)
6. History
7. Hadoop Distributed FileSystem (HDFS) A given file is broken
down into blocks (default=64MB), then blocks are replicated across
cluster (default=3). The number and size of file have no limition.
HDFS allows you to put/get/delete files. (No update!) Follows the
philosophy "Write Once, Read Multiple Times!"
8. MapReduceFlow
9. Hadoop Ecosystem The Hadoop Ecosystem Table
10. Flume Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and moving large
amounts of log data. Each Flume agent has a source, a sink and a
channel
11. Sqoop Apache Sqoop(TM) is a tool designed for efficiently
transferring bulk data between Apache Hadoop and structured
datastores such as relational databases.
12. Hive Apache Hive is a high-level abstraction on top of
MapReduce Uses an SQL-like language called HiveQL Generates
MapReduce jobs that run on the Hadoop cluster Originally developed
by Facebook for data warehousing - Now an open-source Apache
project.
13. HBase Apache HBase is the Hadoop database, a distributed,
scalable, big data store. Linear scalability, capabile of storing
hundreds of terabytes of data Automatic and configurable sharding
of tables Automatic failover support Block cache and Bloom Filters
for real-time queries Provides realtime random read/write access to
data stored in HDFS.
14. Pig Apache Pig is a platform for analyzing large data sets
that consists of a high-level language for expressing data analysis
programs, coupled with infrastructure for evaluating these
programs. The data flow language (Pig Latin) The interactive shell
where you can type Pig Latin statements (Grunt) The Pig interpreter
and execution engine
15. Oozie Oozie is a 'workflow engine' which runs on a server
and typically outside the cluster. It can runs workflows of Hadoop
jobs including Pig, Hive, Sqoop jobs and submit those jobs to the
cluster based on a workflow definition.
16. Why?
17. Why?
18. Appendix What is Hadoop?
https://www.youtube.com/watch?v=4DgTLaFNQq0
https://www.youtube.com/watch?v=9s-vSeWej1U Intro to MapReduce
https://www.youtube.com/watch?v=HFplUBeBhcM What is Big Data? Big
Data Explained (Hadoop & MapReduce) Big Data University -
Hadoop Fundamentals I - v2 Big Data Challenges
19. Appendix Hadoop Tutorial 1 - What is Hadoop? Hadoop
Tutorial 2 - Challenges Created by Big Data Hadoop Tutorial 3 -
History Behind Creation of Hadoop (Google, Yahoo, and Apac Hadoop
Tutorial 4 - Overview of Hadoop Projects Hadoop Tutorial 5 - Steps
to Install Hadoop on a Personal Computer (Windows/OS Hadoop
Tutorial 6 - Downloading and Installing Oracle VirtualBox Hadoop
Tutorial 7 - Downloading Hadoop Appliance for Oracle
VirtualBox