Hadoop Education v1

download Hadoop Education v1

of 32

Transcript of Hadoop Education v1

  • 8/13/2019 Hadoop Education v1

    1/32

    Hadoop EducationDBA Team

    4/14/2011

  • 8/13/2019 Hadoop Education v1

    2/32

    2Jan. 16, 2009

    SHI BI Landscape

  • 8/13/2019 Hadoop Education v1

    3/32

    3Jan. 16, 2009

    SHC Hadoop Landscape

  • 8/13/2019 Hadoop Education v1

    4/32

    4Jan. 16, 2009

    SHC Hadoop

    Hadoop is a software framework for processing large amounts of data scattered across multiple

    commodity nodes (servers). The base Hadoop environment will contain a Distributed File System

    (HDFS) and a Parallel Programming (MapReduce) piece. Additional projects may be added to the

    Hadoop software framework.

    Hadoop is not a replacement for a RDBMS (Relational DataBase Management System).

  • 8/13/2019 Hadoop Education v1

    5/32

    5Jan. 16, 2009

    SHC Hadoop Projects Overview

  • 8/13/2019 Hadoop Education v1

    6/32

    6Jan. 16, 2009

    HADOOP CORE

  • 8/13/2019 Hadoop Education v1

    7/327Jan. 16, 2009

    HDFS

    HDFS (Hadoop Distributed File System) isa distributed fault-tolerant file system designed to be deployed on

    low cost commodity hardware. HDFS provides high throughput access to large amounts of application data.

    HDFS is not a file system which requires expensive fast disk drives with RAID (Redundant Array of Independent

    Disks) to provide high throughput and fault tolerance.

  • 8/13/2019 Hadoop Education v1

    8/328Jan. 16, 2009

    MapReduce

    Mapsplit1

    Mapsplit2

    Mapsplit0

    Reduce part0

    merge

    sortcopy

    sort

    sort Reduce part1

    merge

    MapReduce is a programming model and software framework for writing applications

    that rapidly process vast amounts of data in parallel on large clusters of compute nodes.

    MapReduce is not a replacement for a RDBMS (Relational DataBase Management

    System) or SQL (Structured Query Language).

  • 8/13/2019 Hadoop Education v1

    9/329Jan. 16, 2009

    HADOOP PROJECTS & SUBPROJECTS

  • 8/13/2019 Hadoop Education v1

    10/3210Jan. 16, 2009

    AVRO

    AVRO is a data serialization system. It provides a means to

    distribute non-text files, such as .zip, graphics, binary files (andtext files) in a consistent manner across a distributed (Hadoop)

    environment.

  • 8/13/2019 Hadoop Education v1

    11/3211Jan. 16, 2009

    FLUME

    Flume (Log Flume) is a horizontally scalable data aggregation

    tool, which can support different levels of compression, batchingand reliability for each unique data flow.

  • 8/13/2019 Hadoop Education v1

    12/3212Jan. 16, 2009

    HBase

    HBase is a NoSQL multi-dimensional, distributed, highly available data store made up of rows and

    column families, which can support billions of rows and millions of columns.

    HBase is not a SQL database and thus does not have the concepts of joins, data types, SQL or even a

    query engine.

  • 8/13/2019 Hadoop Education v1

    13/3213Jan. 16, 2009

    HiveHive is a data warehouse environment built on top of Hadoop. Hive gives the capability for SQL

    programmers and map reduce programmers to use a common SQL-like query language called QL

    which is extensible to custom mapper and reducer plug ins. It is best used for batch jobs with largeimmutable sets of data.

    Hive is not designed for online transaction processing (OLTP) and does not offer real time queries

    and row level updates.

  • 8/13/2019 Hadoop Education v1

    14/3214Jan. 16, 2009

    HUEHue (Hadoop User Experience) is a unified web-based UI for interacting with Hadoop. Hue

    provides an interface to submit jobs, watch running jobs, browse the file system, and interact with

    Hive . Additional UI applications can be built to be used with Hue, thus providing a single accesspoint into Hadoop.

  • 8/13/2019 Hadoop Education v1

    15/3215Jan. 16, 2009

    LUCENE/SOLR

    www.yonik.com

    Lucene/Solrare two projects that merged into one in March 2010. Lucene is a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and

    advanced analysis/tokenization capabilities. Solr is ahigh performance enterprise search server, with XML/HTTP and JSON/Python/Ruby APIs, hit

    highlighting, faceted search, caching, replication, distributed search, database integration, web admin and

    search interfaces.

  • 8/13/2019 Hadoop Education v1

    16/3216Jan. 16, 2009

    PIG

    Apache Pig (Pig Latin) is a scripting language for exploring large datasets. It provides

    the ability with a few commands to search terabytes of data. Pig programs run in adistributed environment on a cluster (programs are compiled into MapReduce jobs and

    execute using Hadoop).

  • 8/13/2019 Hadoop Education v1

    17/3217Jan. 16, 2009

    OOZIE (Yahoo)

    http://yahoo.github.com/oozie/design.html

    Oozie is a workflow and coordination server tool for managing jobs

    on a distributed (Hadoop) environment. Oozie job execution can bedriven on a Time and/or Data availability basis.

  • 8/13/2019 Hadoop Education v1

    18/3218Jan. 16, 2009

    SQOOP

    RDBMS HADOOPSQOOP

    Generated Record

    Datatype Definitions

    Sqoop (Sql-to-hadoop) is a database import tool which provides the capability to easily

    copy tables or entire databases between SQL databases (RDBMS) and Hadoop files inHDFS (Hadoop Distributed File System).

  • 8/13/2019 Hadoop Education v1

    19/3219Jan. 16, 2009

    ZOOKEEPER

    ZooKeeper enables highly reliable distributed coordination by providing a centralized

    service for maintaining configuration information, naming, distributed synchronization,

    and group services for distributed (Hadoop) applications.

  • 8/13/2019 Hadoop Education v1

    20/32

    20Jan. 16, 2009

    NON-HADOOP PROJECTS

  • 8/13/2019 Hadoop Education v1

    21/32

  • 8/13/2019 Hadoop Education v1

    22/32

    22Jan. 16, 2009

    GANGLIA

    Cluster

    Node #1

    gmond

    Node #2

    gmond

    Node #3

    gmond

    Node

    gmond

    Node

    gmetad

    RRD

    BrowserClient

    Gangliais a scalable distributed monitoring system used to monitor cluster and grids. It

    provides the ability to drill down through standard or custom textual and graphical views

    at a single node or at a cluster level.

  • 8/13/2019 Hadoop Education v1

    23/32

    23Jan. 16, 2009

    NAGIOS

    NAGIOS isa open source monitoring, alerting, response, reporting, maintenance, and

    capacity planning tool for servers and networks. Nagios can be setup to monitor critical

    infrastructure, such as network protocols, applications, services, servers and network

    components. It is very flexible by allowing custom Nagios plugins to be created and

    shared via the open community, to enhance Nagioss features.-

  • 8/13/2019 Hadoop Education v1

    24/32

    24Jan. 16, 2009

    INFOBRIGHT

    Infobright is a columnar MySQL compatible analytic database.

  • 8/13/2019 Hadoop Education v1

    25/32

    25Jan. 16, 2009

    JASPERSOFTJaspersoft is an open source BI (Business Intellegence) and ETL (Extract, Transform

    and Load) set of tools, which incorporates R (project for Statistical Computing) and

    supports Hadoop/Hive.

  • 8/13/2019 Hadoop Education v1

    26/32

    26Jan. 16, 2009

    R

    Ris a language and environment for statistical (linear and nonlinear modelling, classical

    statistical tests, time-series analysis, classification, clustering, etc.) computing andgraphics.

  • 8/13/2019 Hadoop Education v1

    27/32

    27Jan. 16, 2009

    HADOOP HARDWARE

  • 8/13/2019 Hadoop Education v1

    28/32

  • 8/13/2019 Hadoop Education v1

    29/32

    29Jan. 16, 2009

    Hadoop Nodes

    Production Cluster

    DL380 - Master Nodes

    8 x 2.8Ghz Intel, 60GB RAM4 x 146GB 10k SAS (RAID)

    6 x GB NICs, Mgmt Onboard

    Redundant Power Supplies

    R415 - Worker/Data Nodes

    12 x 2.6Ghz AMD, 32GB RAM

    4 x 2TB SATA (JBOD)4 x GB NICs, Mgmt Onboard

    Single Power Supply

    R515 - Access Nodes

    12 x 2.6Ghz AMD, 64GB RAM

    12 x 2TB SATA (RAID)4 x 10GB NICs, Mgmt Onboard

    Redundant Power Supplies

    Gb

    Gb

    3

    S

    T

    EST

    FAN S

    PR OC

    1

    PR OC

    2

    POWER

    SU PPLY

    2POWER

    SU PPLY

    1 OVER

    TEMP

    POWER

    C AP

    1 2 3 4

    9

    8

    7

    6

    5

    4

    3

    2

    1 1

    2

    3

    4

    5

    6

    7

    8

    9

    ONLINESPAR E

    MIR R OR

    UID

    2

    1

    4

    3

    6

    5

    8

    76 5 4 3 2 14 3 2 16 5

    PROC

    1

    PROC

    2

    POWERSUPPLY

    2

    POWERSUPPLY

    1 OVERTEMP

    POWERCAP

    1 2 3 4

    9

    8

    7

    6

    5

    4

    3

    2

    11

    2

    3

    4

    5

    6

    7

    8

    9

    AMPSTATUS

    FANS

    DIMMS

    HProLiant

    DL380G7

    iLO4 3UID

    2 1

    Backup Cluster

    R710 - Master Nodes

    8 x 2.6Ghz Intel, 48GB RAM4 x 300GB 15k SAS (RAID)

    8 x GB NICs, Mgmt Onboard

    Redundant Power Supplies

    R310 - Worker/Data Nodes

    4 x 2.4Ghz Intel, 8GB RAM

    4 x 2TB SATA (JBOD)2 x GB NICs, No Mgmt

    Single Power Supply

    R515 - Access Nodes

    12 x 2.6Ghz AMD, 64GB RAM

    12 x 2TB SATA (RAID)4 x 10GB NICs, Mgmt Onboard

    Redundant Power Supplies

    Gb

    Gb

    3

    S

    T

    EST

    Gb4b 3b b

    3

    4

    EST

    1Gb

    2Gb

    MEST

    Primary Network(1GbE)

    Management

    NetworkSecondary Network

    (1GbE)

    Gb Gb

    Secondary Network(1GbE)

    Primary Network(1GbE)

    Integration/UAT Cluster

    R415 - Worker/Data Nodes

    12 x 2.6Ghz AMD, 32GB RAM

    4 x 2TB SATA (JBOD)4 x GB NICs, Mgmt Onboard

    Single Power Supply

    R515 - Access Nodes

    12 x 2.6Ghz AMD, 64GB RAM

    12 x 2TB SATA (RAID)4 x 10GB NICs, Mgmt Onboard

    Redundant Power Supplies

    Gb

    Gb

    3

    S

    T

    EST

    1Gb

    2Gb

    MEST

    Primary Network(1GbE)

    Management

    NetworkSecondary Network

    (1GbE)

    R710 - Master Nodes

    8 x 2.6Ghz Intel, 48GB RAM

    4 x 300GB 15k SAS (RAID)

    8 x GB NICs, Mgmt Onboard

    Redundant Power Supplies

    EST

    Gb4b 3b b

    3

    4

  • 8/13/2019 Hadoop Education v1

    30/32

    30Jan. 16, 2009

    Hadoop - Production

  • 8/13/2019 Hadoop Education v1

    31/32

    31Jan. 16, 2009

    Hadoop Integration/UAT and Backup

  • 8/13/2019 Hadoop Education v1

    32/32