Hadoop Reference Arch v1.0

download Hadoop Reference Arch v1.0

of 32

description

Hadoop Reference Arch v1.0

Transcript of Hadoop Reference Arch v1.0

  • Dell Next Generation Compute Solutions

    Dell | Cloudera Solution for Apache Hadoop

    Reference Architecture

    A Dell Reference Architecture Guide

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    2

    Table of Contents

    Tables 3

    Figures 3

    Overview 4

    Summary 4 Abbreviations 4

    Dell | Cloudera Hadoop Solution 5

    Solution Overview 5

    Dell | Cloudera Hadoop Solution Network Architecture 8

    Dell | Cloudera Network Segmentation 8 Dell | Cloudera Network Design Overview 8 Dell Certified Switch Solution 9 Dell Open Switch Solution 9 Dell | Cloudera NIC Teaming 10 IPv6 Capabilities 11

    Dell | Cloudera Hadoop Solution Hardware Architecture 12

    High-level Architecture 12 High-level Network Architecture 14 Hadoop Network Cable Scheme and Connections 15 Dell | Cloudera Hadoop Solution ComputePowerEdge C2100 18

    Dell | Cloudera Hadoop Solution Software Architecture 20

    Linux File System Configuration Definition 20 Disk Partitioning Recommendation for the Name Node 20 Hadoop Ecosystem Services and Utilities Mapping 20

    Dell | Cloudera Hadoop Solution Software Configuration 21

    Dell | Cloudera Configuration Parameters Recommended Values 21

    Dell | Cloudera Hadoop Solution Deployment Methodology 25

    Site Preparation Needed for the Deployment 25

    Dell | Cloudera Hadoop Solution Hardware Monitoring and Alerting 26

    Nagios 26 Ganglia 26 Cloudera Enterprise 26

    Dell | Cloudera Hadoop Solution Security Design 28 What is available in CDH3 28 Implementing Secure Hadoop 28

    Appendix A: Bill of Materials 29

    Name Node Bill of Materials 29 Slave Node Bill of Materials 29 Edge Node Bill of Materials 30 Network Connectivity Bill of MaterialsTop of Rack 31 Software 31

    Appendix B: Dell | Cloudera Hadoop Solution Components Decoder Ring 32

    Appendix C: External References 32

    To Learn More 32

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    3

    Tables

    Table 1: Dell | Cloudera Hadoop Solution Use Cases 5

    Table 2: Dell | Cloudera Hardware Configurations 13

    Table 3: Dell | Cloudera Software Locations 13

    Table 4: Dell | Cloudera Support Matrix 14

    Table 5: Dell | Cloudera Network Cabling 14

    Table 6: Hadoop Ecosystem Utilities Mapping 20

    Table 7: hdfs-site.xml 21

    Table 8: mapred-site.xml 22

    Table 9: default.xml 22

    Table 10: hadoop-env.sh 23

    Table 11: /etc/fstab 23

    Table 12: core-site.xml 23

    Table 13: /etc/security/limits.conf 23

    Figures

    Figure 1: Dell | Cloudera Hadoop Solution Taxonomy 6

    Figure 2: Dell | Cloudera Network Switch Connectivity 8

    Figure 3: Hadoop Top of Rack Network Connectivity 9

    Figure 4: Dell | Cloudera Node-Level Network Connectivity 10

    Figure 5: Dell | Cloudera Hardware Architecture 12

    Figure 6: Dell | Cloudera Compute Node Network Interconnects 14

    Figure 7: PowerEdge C2100 18

    Figure 8: Kerberos Authentication in Hadoop 28

    THIS PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND. 2011 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell. Dell, the DELL logo, and the DELL badge, PowerConnect, and PowerEdge are trademarks of Dell Inc. Cloudera, CDH,, Cloudera Enterprise are trademarks of Cloudera and its affiliates in the US and other countries. Intel and Xeon are registered trademarks of Intel Corporation in the U.S. and other countries. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell Inc. disclaims any proprietary interest in trademarks and trade names other than its own. August 2011

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    4

    Overview

    Summary The document presents the reference architecture of the Dell | Cloudera Hadoop Solution that Dell

    designed jointly with Cloudera.

    The reference architecture introduces all the high-level components, hardware, and software that are included

    in the stack. Each high-level component is then described individually.

    Abbreviations

    Abbreviation Definition

    BMC Baseboard management controller

    DBMS Database management system

    EDW Enterprise data warehouse EoR End-of-row switch/router

    HDFS Hadoop Distributed File System

    IPMI Intelligent Platform Management

    Interface LAG Link aggregation group

    NIC Network interface card

    ToR Top-of-rack switch/router

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    5

    Dell | Cloudera Hadoop Solution

    Solution Overview Hadoop is an Apache project being built and used by a global community of contributors, using the Java

    programming language. Yahoo! has been the largest contributor to the project, and uses Hadoop extensively

    across its businesses. Other contributors and users include Facebook, LinkedIn, eHarmony, and eBay. Cloudera

    has created a quality controlled distribution of Hadoop and offers commercial management software, support,

    and consulting services.

    Dell developed a solution for Hadoop that includes optimized hardware, software, and services to streamline

    deployment and improve the customer experience.

    The Dell | Cloudera Hadoop Solution is based on the Cloudera CDH 3 Enterprise distribution of Hadoop. Dells solution

    includes:

    Reference architecture and best practices

    Optimized hardware and network infrastructure

    Cloudera CDH Enterprise software (CDH Community-provided for customer-deployed solutions)

    Hadoop infrastructure management tools

    Dell Crowbar software

    This solution provides Dell a foundation to offer additional solutions as the Hadoop environment evolves and

    expands.

    The solution is designed to address the following use cases:

    Table 1: Dell | Cloudera Hadoop Solution Use Cases

    Use case Description

    Data storage

    The user would like to be able to collect and store unstructured and semi-structured data

    in a fault-resilient scalable data store that can be organized and sorted for indexing and

    analysis.

    Batch processing of

    unstructured data

    The user would like to batch-process (index, analyze, etc.) large quantities of unstructured

    and semi-structured data.

    Data archive The user would like medium-term (1236 months) archival of data from EDW/DBMS to

    increase the length that data is retrained or to meet data retention policies/compliance.

    Integration with data

    warehouse

    The user would like to transfer data stored in Hadoop into a separate DBMS for advanced

    analytics. Also the user may want to transfer the data from the DBMS back to Hadoop.

    Aside from the Hadoop core technology (HDFS, MapReduce, etc.) Dell had designed additional capabilities meant to address

    specific customer needs:

    Monitoring, reporting, and alerting of the hardware and software components

    Infrastructure configuration automation

    The Dell | Cloudera Hadoop Solution lowers the barrier to adoption for organizations looking to use Hadoop in

    production. Dells customer-centered approach is to create rapidly deployable and highly optimized end-to-

    end Hadoop solutions running on commodity hardware. Dell provides all the hardware and software

    components and resources to meet your requirements, and no other supplier need be involved. Cloudera will

    provide support and software updates for the Hadoop software components within the solution.

    The hardware platform for the Dell | Cloudera Hadoop Solution (Figure 1) is the Dell PowerEdge C-series.

    Dell PowerEdge C-series servers are focused on hyperscale and cloud capabilities. Rather than emphasizing

    gigahertz and gigabytes, these servers deliver maximum density, memory, and serviceability while minimizing

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    6

    total cost of ownership. Its all about getting the processing you need in the least amount of space and in an

    energy-efficient package that slashes operational costs.

    Dell recommends Red Hat Enterprise Linux 5.6 for use in Cloudera deployments. You can choose to install

    CentOS 5.6 for user-deployed solutions.

    The recommended Java Virtual Machine (JVM) is the Oracle Sun JVM 1.6u25 or above.

    The hardware platforms, the operating system, and the Java Virtual Machine make up the foundation on which

    the Hadoop software stack runs.

    Figure 1: Dell | Cloudera Hadoop Solution Taxonomy

    The dark blue layer, depicting the Cloudera CDH components (Figure 1) comprises two frameworks:

    1. The Data Storage Framework (HDFS) is the file system that Hadoop uses to store data on the cluster nodes.

    Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable file system.

    2. The Data Processing Framework (MapReduce) is a massively-parallel compute framework inspired by

    Googles MapReduce papers.

    The next layer of the stack in the Dell | Cloudera Hadoop Solution design is the network layer. Dell recommends implementing

    the Hadoop cluster on a dedicated network for two reasons:

    1. Dell provides network design blueprints that have been tested and qualified.

    2. Network performance predictabilitysharing the network with other applications may have a detrimental

    impact on the performance of the Hadoop jobs.

    The next two frameworksthe Data Access Framework and the Data Orchestration Frameworkcomprise

    utilities that are part of the Hadoop ecosystem.

    Dell listened to its customers and designed a Hadoop solution that is unique in the marketplace. Dells end-to-

    end solution approach means that you can be in production with Hadoop in a shorter time than is traditionally

    possible with homegrown solutions. The Dell | Cloudera Hadoop Solution embodies all the software functions

    and services needed to run Hadoop in a production environment. One of Dells chief contributions to Hadoop

    is a method to rapidly deploy and integrate Hadoop in production. These complementary functions are

    designed and implemented side-by-side with Hadoop core technology.

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    7

    Installing and configuring Hadoop is non-trivial. There are different roles and configurations that need to be

    deployed on various nodes. Designing, deploying, and optimizing the network layer to match Hadoops

    scalability requires consideration for the type of workloads that will be running on the Hadoop cluster. The

    deployment mechanism that Dell designed for Hadoop automates the deployment of the cluster from bare-

    metal (no operating system installed) all the way to installing and configuring the Cloudera software

    components to your specific requirements. Intermediary steps include system BIOS update and configuration,

    RAID/SAS configuration, operating system deployment, Hadoop software deployment, Hadoop software

    configuration, and integration with your data center applications (i.e. monitoring and alerting).

    Data backup and recovery is another topic that was brought up during customer roundtables. As Hadoop

    becomes the de facto platform for business-critical applications, the data that is stored in Hadoop is crucial for

    ensuring business continuity. Dells approach is to offer several enterprise-grade backup solutions and let the

    customer choose while providing reference architectures and deployments guides for streamlined, consistent,

    low-risk implementations.

    Lastly, Dells open, integrated approach to enterprise-wide systems management enables you to build a

    comprehensive system management solution based on open standards and integrated with industry-leading

    partners. Instead of building a patchwork of solutions leading to systems management sprawl, Dell integrates

    the management of the Dell hardware running the Hadoop cluster with the traditional Hadoop management

    consoles (Ganglia, Nagios).

    To summarize, Dell has added Hadoop to its data analytics solutions portfolio. Dells end-to-end solution

    approach means that Dell will provide readily available software interfaces for integration between the

    solutions in the portfolio.

    In the current design, the Dell | Cloudera Hadoop Solution contains the core components of a typical Hadoop

    deployment (HDFS, MapReduce, etc.) and auxiliary services (monitoring, reporting, security, etc.) that span the

    entire solution stack.

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    8

    Dell | Cloudera Hadoop Solution Network Architecture

    The implementation is a multi-node, multi-rack design that uses Dell PowerEdge C series machines. The

    network infrastructure is implemented with Dell PowerConnect switches.

    Dell | Cloudera Network Segmentation The Dell | Cloudera solution implements at minimum three distinct, separate VLANs:

    Cluster Access LANconnects the compute node NICs into a fabric used for sharing data and distributing work

    tasks among compute nodes

    Cluster Management LANconnects all the BMCs in the cluster nodes

    Cluster Edge LANconnects the cluster to the outside world

    Dell | Cloudera Network Design Overview The network design for the Dell | Cloudera Hadoop Solution is based on two assumptions:

    1. The majority of the traffic will be between the nodes within the cluster (on the Cluster Access LAN).

    2. Since Hadoop implements logical proximity between nodes at the application level, there is no need to segment

    the network at the rack (or group of racks) level; thus the network within the cluster can be flat.

    The easiest way to build a flat network is by stacking the switches. A potential problem with stacking a large

    number of switches together is the potential need for traffic to traverse the ring. As the ring extends the traffic

    must pass through more and more switches to reach its destination. Without top-of-rack switches being

    stacked, the traffic between racks will need to travel up to the core router and then back down to the

    destination rack. As this is only one Layer 2 hop away, the uplinks have to share this traffic with traffic that

    originates outside the environment.

    Additionally, the benefits of the switch stack are retained however the drawbacks are mitigated.

    Improved manageability: All switches in the stack are managed as a single unit.

    Efficient Spanning Tree: The stack is viewed as a single switch by the Spanning Tree Protocol.

    Link aggregation: Stacking multiple switches in a chassis allows a link aggregation group (LAG) across ports on

    different switches in the stack.

    Reduced network traffic: Traffic between the individual switches in a stack is passed across the stacking cable,

    reducing the amount of traffic passed upstream to network distribution switches.

    Higher speed: The stacking module supports a higher data rate than the 10GbE uplink module (supports 12Gb per

    stack port offering 24Gb between switches)

    Lower cost: Uplink ports are shared by all switches in the stack, reducing the number of distribution switch ports

    necessary to connect modular servers to the network.

    Simplified updates: The basic firmware management commands will propagate new firmware versions and boot

    image settings to all switch stack members.

    Figure 2 depicts the connections required to scale above a single stack of switches for the Hadoop cluster

    interconnect. The top-of-rack switches depicted are connected redundantly to the six stacked Gigabit

    Ethernet switches that provide network connectivity to the Hadoop cluster nodes.

    Figure 2: Dell | Cloudera Network Switch Connectivity

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    9

    Dell recommends Dell PowerConnect 6248 switches. The six stacked switches provide Ethernet

    connectivity for three racks of servers. The customer can install up to 20 servers in each rack, which means

    that the six stacked switches can connect up to 60 servers. Large-scale implementations that require more

    than 60 machines need to consider multiple 6-switch stacks, each stack running no more than 60 machines

    each. These large-scale implementations also require a pair of core routers (or end-of-row routers) to connect

    all stacks together. Each switch in each stack need to uplink to the core routers using 10GB links as

    represented in Figure 2. There is no need for core routers for implementations smaller than 60 nodes.

    Dells recommendation for core routers includes Arista Networks 7100 Series or 7500 Series 10Gb switches.

    Next figure shows the connectivity between the Hadoop Name Nodes, Edge Node and the switches. The Edge

    Node acts as gateway between data stored in Hadoop and users connected on the Corporate Network.

    Hence, note that the Edge Node is connected to the cluster network (via stacked switches) and the corporate

    network. You have the option of removing the Edge Nodes and directly connecting the Hadoop network to

    your organizations network.

    Figure 3: Hadoop Top of Rack Network Connectivity

    The BMC of each node connects to the Cluster Management LAN using a dedicated (non-shared) NIC.

    VLAN Number Use Connectivity

    Untagged BMC Ports

    10 Hadoop Data Network Slave Node LAG groups

    Dell Certified Switch Solution This Reference Architecture includes the use of Dell PowerConnect 6248 Gigabit Ethernet switches as the

    top-of-rack connectivity to all Hadoop related nodes. The reference architecture is used to ensure consistency

    in rapid deployments through the minimal differences in the network configuration.

    Dell Open Switch Solution In addition to the Dell switch-based reference architecture, Dell provides an open standard that allows you to

    choose other brands and configurations of switches for your Hadoop environment. The following list of

    requirements will enable other brands of switches to properly operate with the tools and configurations in the

    Dell Hadoop Reference Architecture.

    Support for IEEE 802.1Q VLAN traffic and port tagging

    Ability to provide a minimum of 170 Gigabit Ethernet ports in a non-blocking configuration within VLAN 10

    o Configuration can be a single switch or a combination of stacked switches to meet the additional

    requirements

    o The ability to create link aggregation groups (LAGs) with a minimum of two physical links in each LAG

    Corporate Network

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    10

    If multiple switches are stacked:

    o The ability to create a LAG across stacked switches is required

    o Full-bisection bandwidth

    o Support for VLANs to be available across all switches in the stack

    The ability to provide a minimum 65 10/100 Ethernet ports on the untagged VLAN

    250,000 packets-per-second capability per switch

    The ability to provide 12 10Gb ports for redundant uplinks contained in VLAN 10

    A managed switch that supports SSH and Serial line configuration

    SNMP v3 support

    Figure 4: Dell | Cloudera Node-Level Network Connectivity

    Dell | Cloudera NIC Teaming Since the network must be able to handle the ever-increasing volume of data, Dell recommends using

    network teaming to increase the available bandwidth to each node in the Hadoop cluster. Teaming the NICs

    for bandwidth allows both NICs to be active and the operating system to balance the network traffic across all

    NICs in the team.

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    11

    The teaming configuration that Dell recommends is balanced-alb (mode = 6). This configuration setting is

    explained in greater deal in the Hadoop Solution Deployment Guide. The Dell Crowbar deployment software

    automatically configures this setting for Hadoop environments.

    IPv6 Capabilities At this time, the Dell | Cloudera Hadoop Solution does not support or allow for the use of IPv6 for network

    connectivity. All deployments are based on IPv4 with IPv6 explicitly disabled on all nodes within the Hadoop

    environment.

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    12

    Dell | Cloudera Hadoop Solution Hardware Architecture

    The Dell | Cloudera hardware contains recommended configurations for the following services within a Hadoop cluster:

    Master Nodesometimes called Name Node, it runs all the services needed to manage the HDFS data storage and

    MapReduce task distribution and tracking.

    Slave Noderuns all the services required to store blocks of data on the local hard drives and execute processing

    tasks against that data

    Edge Nodeprovides the interface between a data and processing capacity available in the Hadoop cluster and a user

    of that capacity

    Admin Nodeprovides cluster deployment and management capabilities

    Figure 5: Dell | Cloudera Hardware Architecture

    High-level Architecture

    Dell | Cloudera Sizing Terms

    The Dell | Cloudera Reference Architecture is organized into three components for sizing as the Hadoop

    environment grows. The smallest designation is rack, moving to a pod, and then into a cluster. Each has

    specific characteristics and sizing considerations documented in this reference architecture. The design goal

    for the Hadoop environment is to enable you to scale the environment by adding the additional capacity as

    needed, without the need to replace any existing components.

    Rack

    A rack is the smallest designation of a Hadoop environment. A rack consists of all the necessary power,

    network cabling, and two Ethernet switches necessary to support up to 20 data nodes. These nodes should

    utilize their own power connectivity and space within the data center, separate from other racks, and be

    treated as a fault zone.

    Pod

    A pod is a single set of stacked Ethernet switches. In the case of this reference architecture, the maximum and

    minimum will be six. A pod consists of the administration and operation infrastructure to support three racks.

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    13

    Cluster

    A cluster is a set of greater than one pod and a maximum of 12 pods. A cluster is a set of Hadoop nodes that

    share the same Name Node and management tools for operating the Hadoop environment.

    Table 2: Dell | Cloudera Hardware Configurations

    Machine Function Master Node (Admin Node) Data Node (Slave Node) Edge Node

    Platform PowerEdge C2100 PowerEdge C2100 PowerEdge C2100

    CPU 2x E5645 (6-core) 2x E5606 (4-core)

    (optional 2x E5645 6-core) 2x E5645 (6-core)

    RAM (Minimum) 96GB 24GB 48GB

    Add-in NIC One dual-port Intel 1GigE None One dual-port Intel 10GigE

    DISK 6x 600GB SAS NL3.5 12x 1TB SATA 7.2K 3.5 6x 600GB SAS NL3.5

    Storage Controller PERC H700 LSI2008 PERC H700

    RAID RAID 10 JBOD RAID 10

    Min per Rack 1

    Max Per Rack 20

    Min per Pod 2 3* 1

    Max per Pod 2 60

    Min per cluster 2 36 1

    Max per Cluster 2 720

    * A minimum of five Data Nodes are needed if ZooKeeper will be used in the environment.

    Table 3: Dell | Cloudera Hadoop Solution Software Locations

    Daemon Primary Location Secondary Location

    JobTracker MasterNode02 MasterNode01

    TaskTracker SlaveNode(x)

    SlaveNode SlaveNode(x)

    NameNode MasterNode01 MasterNode02

    Operating System Provisioning MasterNode02 MasterNode01

    Chef MasterNode02 MasterNode01

    Yum Repositories MasterNode02 MasterNode01

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    14

    Table 4: Dell | Cloudera Hadoop Solution Support Matrix

    RA Version OS Version Hadoop Version Available Support Supported JVM

    1.0 Red Hat Enterprise Linux

    5.6

    Cloudera CDH3

    Enterprise

    Dell Hardware Support

    Cloudera Hadoop

    Support

    Red Hat Linux Support

    Sun Oracle JVM 1.6u20, u21,

    u23, u24, u25

    1.0 CentOS 5.6 Cloudera CDH3

    Community Dell Hardware Support

    Sun Oracle JVM 1.6u20, u21,

    u23, u24, u25

    High-level Network Architecture The network interconnects between various hardware components of the cloud solution is depicted in Figure

    6.

    Figure 6: Dell | Cloudera Compute Node Network Interconnects

    The network cabling within the Dell | Cloudera Hadoop Solution is described in the following table.

    Table 5: Dell | Cloudera Hadoop Solution Network Cabling

    Component NICs to Switch Port

    LOM1 LOM2 PCI-NIC1 PCI-NIC2 BMC

    Master Node

    Data Node N/A N/A

    Edge Node

    Legend

    Cluster Production LAN

    Cluster Management LAN

    Cluster Edge LAN

    NI

    NI

    BM

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    15

    Hadoop Network Cable Scheme and Connections

    Node Connection Switch Port

    rNN-n01 LOM1 rNN-sw01 1

    BMC rNN-sw01 25

    LOM2 rNN-sw02 1

    rNN-n02 LOM1 rNN-sw01 2

    BMC rNN-sw01 26

    LOM2 rNN-sw02 2

    rNN-n03 LOM1 rNN-sw01 3

    BMC rNN-sw01 27

    LOM2 rNN-sw02 3

    rNN-n04 LOM1 rNN-sw01 4

    BMC rNN-sw01 28

    LOM2 rNN-sw02 4

    rNN-n05 LOM1 rNN-sw01 5

    BMC rNN-sw01 29

    LOM2 rNN-sw02 5

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    16

    rNN-n06 LOM1 rNN-sw01 6

    BMC rNN-sw01 30

    LOM2 rNN-sw02 6

    rNN-n07 LOM1 rNN-sw01 7

    BMC rNN-sw01 31

    LOM2 rNN-sw02 7

    rNN-n08 LOM1 rNN-sw01 8

    BMC rNN-sw01 32

    LOM2 rNN-sw02 8

    rNN-n09 LOM1 rNN-sw01 9

    BMC rNN-sw01 33

    LOM2 rNN-sw02 9

    rNN-n10 LOM1 rNN-sw01 10

    BMC rNN-sw01 34

    LOM2 rNN-sw02 10

    rNN-n11 LOM1 rNN-sw01 11

    BMC rNN-sw01 35

    LOM2 rNN-sw02 11

    rNN-n12 LOM1 rNN-sw01 12

    BMC rNN-sw01 36

    LOM2 rNN-sw02 12

    rNN-n13 LOM1 rNN-sw01 13

    BMC rNN-sw01 37

    LOM2 rNN-sw02 13

    rNN-n14 LOM1 rNN-sw01 14

    BMC rNN-sw01 38

    LOM2 rNN-sw02 14

    rNN-n15 LOM1 rNN-sw01 15

    BMC rNN-sw01 39

    LOM2 rNN-sw02 15

    rNN-n16 LOM1 rNN-sw01 16

    BMC rNN-sw01 40

    LOM2 rNN-sw02 16

    rNN-n17 LOM1 rNN-sw01 17

    BMC rNN-sw01 41

    LOM2 rNN-sw02 17

    rNN-n18 LOM1 rNN-sw01 18

    BMC rNN-sw01 42

    LOM2 rNN-sw02 18

    rNN-n19 LOM1 rNN-sw01 19

    BMC rNN-sw01 43

    LOM2 rNN-sw02 19

    rNN-n20 LOM1 rNN-sw01 20

    BMC rNN-sw01 44

    LOM2 rNN-sw02 20

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    17

    End-of-Row Switch Port Connectivity

    POD Number ToR Switch ToR Switch Port EoR Switch EoR Switch Port

    1 r01-s01 10GbE1 Eor-row01-sw01 1

    1 r01-s01 10GbE2 Eor-row01-sw02 1

    1 r01-s02 10GbE1 Eor-row01-sw01 2

    1 r01-s02 10GbE2 Eor-row01-sw02 2

    1 r02-s01 10GbE1 Eor-row01-sw01 3

    1 r02-s01 10GbE2 Eor-row01-sw02 3

    1 r02-s02 10GbE1 Eor-row01-sw01 4

    1 r02-s02 10GbE2 Eor-row01-sw02 4

    1 r03-s01 10GbE1 Eor-row01-sw01 5

    1 r03-s01 10GbE2 Eor-row01-sw02 5

    1 r03-s02 10GbE1 Eor-row01-sw01 6

    1 r03-s02 10GbE2 Eor-row01-sw02 6

    2 r01-s01 10GbE1 Eor-row01-sw01 7

    2 r01-s01 10GbE2 Eor-row01-sw02 7

    2 r01-s02 10GbE1 Eor-row01-sw01 8

    2 r01-s02 10GbE2 Eor-row01-sw02 8

    2 r02-s01 10GbE1 Eor-row01-sw01 9

    2 r02-s01 10GbE2 Eor-row01-sw02 9

    2 r02-s02 10GbE1 Eor-row01-sw01 10

    2 r02-s02 10GbE2 Eor-row01-sw02 10

    2 r03-s01 10GbE1 Eor-row01-sw01 11

    2 r03-s01 10GbE2 Eor-row01-sw02 11

    2 r03-s02 10GbE1 Eor-row01-sw01 12

    2 r03-s02 10GbE2 Eor-row01-sw02 12

    3 r01-s01 10GbE1 Eor-row01-sw01 13

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    18

    3 r01-s01 10GbE2 Eor-row01-sw02 13

    3 r01-s02 10GbE1 Eor-row01-sw01 14

    3 r01-s02 10GbE2 Eor-row01-sw02 14

    3 r02-s01 10GbE1 Eor-row01-sw01 15

    3 r02-s01 10GbE2 Eor-row01-sw02 15

    3 r02-s02 10GbE1 Eor-row01-sw01 16

    3 r02-s02 10GbE2 Eor-row01-sw02 16

    3 r03-s01 10GbE1 Eor-row01-sw01 17

    3 r03-s01 10GbE2 Eor-row01-sw02 17

    3 r03-s02 10GbE1 Eor-row01-sw01 18

    3 r03-s02 10GbE2 Eor-row01-sw02 18

    Dell | Cloudera Hadoop Solution ComputePowerEdge C2100 The Dell PowerEdge C2100 server is focused on Hadoop, MapReduce, and high-performance Web

    applications where high spindle count and high memory density are key. The PowerEdge C2100 offers

    multiple backplane and drive configurations to provide you with the flexibility to meet your application needs.

    The Dell Hadoop solution includes hardware and software support using Dells world-class, worldwide services

    organization. When you use a Dell | Cloudera Hadoop Solution, all the components of your solution are tested,

    validated, and supported by Dell.

    Figure 7: PowerEdge C2100

    PowerEdge C2100 feature summary:

    Two-socket Intel Xeon 5600 series processors for ultimate performance

    18 DDR3 memory slots that enable larger memory footprints while utilizing smaller memory modules

    12 3.5-inch SAS/SATA and two 2.5-inch internal SATA/SSD

    Dual redundant or single 750W high-efficiency power supply options

    Intel 82576 dual-port embedded Gigabit Ethernet NIC with industry leading virtualization performance

    IPMI 2.0-compliant BMC management

    6Gb SAS hard drive and controller support

    Two x8 PCI-E Gen2 slots and two x4 PCI-E Gen2 dedicated mezzanine slots

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    19

    Rack installation, Dell Basic and ProSupport options, next-business-day support only

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    20

    Dell | Cloudera Hadoop Solution Software Architecture

    Linux File System Configuration Definition Dell recommends and supports the use of ext3 for all HDFS disks.

    Disk Partitioning Recommendation for the Name Node There are two types of services running on the Name Node:

    JobTracker (Supports MapReduce job distribution)

    NameNode (Supports HDFS data storage)

    There are two types of services running on the Slave Node:

    TaskTracker Daemon (to support MapReduce job execution)

    SlaveNode Daemon (to support HDFS data storage)

    All disk configuration parameters are documented in the Dell | Cloudera Hadoop Solution Deployment Guide,

    as well as Linux Kickstart scripts for proper configuration at the time of operating system installation.

    Hadoop Ecosystem Services and Utilities Mapping

    Table 6: Hadoop Ecosystem Utilities Mapping

    Component Master Node Slave Node Edge Node Utilize From Administer From

    Pig X X X Edge Node Edge Node

    Hive Edge Node Edge Node

    Sqoop X Edge Node Edge Node

    Zookeeper X (5) Edge Node Edge Node

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    21

    Dell | Cloudera Hadoop Solution Software Configuration

    Dell | Cloudera Hadoop Solution Configuration Parameters Recommended Values

    Table 7: hdfs-site.xml

    Property Description Value

    dfs.block.size Lower value offers parallelism 134217728 (128Mb)

    dfs.name.dir Comma-separated list of folders (no space) where

    a SlaveNode stores its blocks Cluster specific

    dfs.datanode.handler.count Number of handlers dedicated to serve data block

    requests in Hadoop SlaveNodes

    16

    (start 2 x CORE_COUNT in each SlaveNode )

    dfs.namenode.handler.count More Master Node server threads to handle RPCs

    from large number of SlaveNodes

    Start with 10, increase large clusters

    (higher count will drive higher CPU, RAM

    and Network utilization)

    dfs.SlaveNode.du.reserved The amount of space on each storage volume

    which HDFS should not use. 10M

    dfs.replication Data replication factor. Default is 3. 3 (default)

    fs.trash.interval Time interval between HDFS space reclaiming ( 1440 (minutes)

    dfs.permissions true (default)

    dfs.datanode.handler.count 8

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    22

    Table 8: mapred-site.xml

    Table 9: default.xml

    Property Description Value

    SCAN_IPC_CACHE_LIMIT

    Number of rows cached in search

    engine for each scanner next call over

    the wire. It reduces the network round

    trip by 300 times caching 300 rows in

    each trip.

    100

    LOCAL_JOB_HANDLER_COUNT

    Number of parallel queries executed at

    one go. Query requests above than this

    limit gets queued up.

    30

    Property Description Value

    mapred.child.java.opts Larger heap-size for child JVMs of

    maps/reduces. -Xmx1024M

    mapred.job.tracker Hostname or IP address and port of the

    JobTracker. TBD

    mapred.job.tracker.handler.count More JobTracker server threads to handle

    RPCs from large number of TaskTrackers.

    Start with 32, increase large clusters

    (higher count will drive higher CPU, RAM

    and Network utilization)

    mapred.reduce.tasks The number of Reduce tasks per job. Set to a prime close to the number of

    available hosts

    mapred.local.dir

    Comma-separated list of folders (no space)

    where a TaskTracker stores runtime

    information

    Cluster-specific

    mapred.tasktracker.map.tasks.maximu

    m

    Maximum number of map tasks to run on

    the node 2 + (2/3) * number of cores per node

    mapred.tasktracker.reduce.tasks.maxim

    um

    Maximum number of reduce tasks to run

    per node 2 + (1/3) * number of cores per node

    mapred.child.ulimit 2097152

    mapred.map.tasks.speculative.executio

    n FALSE

    mapred.reduce.tasks.speculative.execut

    ion FALSE

    mapred.job.reuse.jvm.num.tasks -1

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    23

    Table 10: hadoop-env.sh

    Property Description Value

    java.net.preferIPv4Stack true

    JAVA_HOME

    HADOOP_*_OPTS -Xmx2048m

    Table 11: /etc/fstab

    Property Description Value

    File system mount options data=writeback,nodiratime, noatime

    Table 12: core-site.xml

    Property Description Value

    io.file.buffer.size

    The size of buffer for use in sequence

    files. The size of this buffer should

    probably be a multiple of hardware

    page size (4096 on Intel x86), and it

    determines how much data is buffered

    during read and write operations.

    65536 (64Kb)

    fs.default.name

    The name of the default file system. A

    URI whose scheme and authority

    determine the file system

    implementation.

    Example:

    hdfs://someserver.example.com:8020/

    fs.checkpoint.dir

    Comma-separated list of directories on

    the local file system of the Secondary

    Master Node where its checkpoint

    images are stored

    TBD

    io.sort.factor 80

    Io.sort.mb 512

    Table 13: /etc/security/limits.conf

    Property Description Value

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    24

    mapred nofile 32768

    hdfs nofile 32768

    hbase nofile 32768

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    25

    Dell | Cloudera Hadoop Solution Deployment Methodology

    Site Preparation Needed for the Deployment The heating, ventilation, air conditioning (HVAC), and power requirement can be estimated using the Dell

    Energy Smart Solution Advisor at:

    http://www.dell.com/content/topics/topic.aspx/global/products/pedge/topics/en/config_calculator?c=us&cs=555&l=en&s

    =biz

    Using this tool, you can plan the needs for your solution, order the correct PDUs, and ensure that the proper

    HVAC is ready for the installation.

    Detailed deployment instructions are documented in the Dell | Cloudera Hadoop Solution Deployment Guide.

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    26

    Dell | Cloudera Hadoop Solution Hardware Monitoring and Alerting

    To automate the alert and response to unexpected events and failures within the Dell | Cloudera Hadoop Solution, the software

    stack includes Nagios and Ganglia. The Dell | Cloudera Hadoop Solution includes capabilities for three primary components of

    the monitoring environment:

    Monitoring of cluster activitiesThe Dell | Cloudera Hadoop Solution utilizes Nagios to monitor the cluster,

    including hardware, software, and users. The Nagios deployment as part of the Dell | Cloudera Hadoop Solution

    will keep historical information regarding system availability, maintenance, and failure events.

    Alerting of unexpected eventsThe Dell | Cloudera Hadoop Solution utilizes Nagios to alert system operations staff

    to events that occur that are unexpected from normal operation and that the administrator has designated for

    notification.

    Debugging of cluster runtime operations The Dell | Cloudera Hadoop Solution utilizes Cloudera Enterprise to

    provide the users and administrators of the Hadoop environment with the necessary tools for tracking, debugging,

    and monitoring job performance and characteristics.

    The Dell | Cloudera Hadoop Solution is designed to include the necessary components to monitor and

    respond to events in your Hadoop environment, while being flexible enough to allow integration with existing

    operations management frameworks in your environment.

    The monitoring components of the Dell | Cloudera Hadoop Solution Reference Architecture are designed to

    be proactive in nature, alerting the IT operations team when failures in the environment occur, but before they

    cause an outage that affects product workloads and users.

    Nagios Nagios is an open source solution for enterprise monitoring. It utilizes a pluggable architecture to allow for

    consistent event handling, while supporting a wide variety of sensors, plug-ins, applications, servers, and

    hardware platforms.

    The Dell | Cloudera Hadoop Solution includes Nagios as part of all default installations. The Dell | Cloudera

    Hadoop Solution will automatically install the Nagios console and the necessary Nagios plug-ins for

    monitoring the Hadoop cluster, including processes, operating systems, and physical servers.

    Ganglia Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters

    and grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used

    technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for

    data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low

    per-node overheads and high concurrency. The implementation is robust, has been ported to an extensive set

    of operating systems and processor architectures, and is currently in use on thousands of clusters around the

    world. It has been used to link clusters across university campuses and around the world and can scale to

    handle clusters with 2,000 nodes.

    The Dell | Cloudera Hadoop Solution automates the installation and configuration of Ganglia within the

    Hadoop cluster, enabling IT operations staff detailed reporting on the status and utilization of all Hadoop

    nodes.

    Cloudera Enterprise Cloudera Enterprise is a subscription service that comprises Cloudera support and a portfolio of software,

    including Cloudera Management Suite, which enables data-driven enterprises to run Apache Hadoop

    environments in production cost-effectively and with repeatable success. By combining expert support with a

    software layer that delivers deep visibility into and across Hadoop clusters, Cloudera Enterprise gives you an

    efficient way to precisely provision and manage cluster resources. It also allows your IT shop to apply familiar

    business metricssuch as measurable SLAs and chargebacksto your Hadoop environment so it can run at

    optimal utilization.

    Cloudera Enterprise consists of two core components:

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    27

    Cloudera Management Suite (CMS)

    As Hadoop is increasingly used as a shared platform for multiple applications, organizations require the same manageability

    characteristics that they expect from other popular enterprise technologies. Cloudera meets this need with a management

    suite that helps your organization improve quality of service, increase compliance, and reduce administrative costs.

    The Cloudera Management Suite includes:

    Service and Configuration Manager (SCM)

    Deploy Hadoop in minutes.

    Manage system services, automate changes and validate settings.

    Activity Monitor

    Consolidate all user activities into a single, real-time view.

    Diagnose user performance and track activity metrics.

    Resource Manager

    Report system resources usage.

    Plan for capacity expansion.

    Authorization Manager

    Centralize management of all users, groups and privileges.

    Manage permissions via delegated administration.

    Cloudera Support

    As the use of Hadoop grows and an increasing number of groups and applications move into production, your

    internal customers will expect greater levels of performance and consistency. Clouderas proactive

    production-level support gives your administrators the expertise and responsiveness they need.

    Cloudera Support includes:

    Flexible Support Windows

    Choose 85 or 247 to meet SLA requirements.

    Configuration Checks

    Verify that your Hadoop cluster is fine tuned for your environment.

    Escalation and Issue Resolution

    Resolve support cases with maximum efficiency.

    Comprehensive Knowledgebase

    Expand your Hadoop knowledge with hundreds of articles and tech notes.

    Support for Certified Integration

    Connect your Hadoop cluster to your existing data analysis tools.

    Proactive Notification

    Stay up to speed with new developments and events.

    With Cloudera Enterprise, you can leverage your existing teams experience and Clouderas expertise to

    operationalize your Hadoop system with ease. Built-in predictive capabilities anticipate shifts in the Hadoop

    infrastructure, helping to ensure reliable operation.

    Cloudera Enterprise makes it easy to run open source Hadoop in production:

    Simplify and accelerate Hadoop deployment.

    Reduce the costs and risks of adopting Hadoop in production.

    Reliably operate Hadoop in production with repeatable success.

    Apply SLAs to Hadoop.

    Increase control over Hadoop cluster provisioning and management.

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    28

    Dell | Cloudera Hadoop Solution Security Design

    What is available in CDH3 Clouderas CDH3 release offers the following security features:

    There are two levels of authentication:

    o Cluster: SlaveNode to NameNode, TaskTracker to JobTracker

    o User: Unix-style file permissions

    NOTE: Access to cluster can be restricted to Kerberos-authorization users.

    Sqoop and Pig support security with no configuration required.

    ZooKeeper will operate normally in an unsecured mode with a secure Hadoop cluster.

    All cluster nodes need direct access to the Kerberos server.

    Implementing Secure Hadoop https://ccp.cloudera.com/display/CDHDOC/CDH3+Security+Guide

    Figure 8: Kerberos Authentication in Hadoop

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    29

    Appendix A: Bill of Materials

    Name Node Bill of Materials

    Component Description SKU

    PowerEdge C2100 PowerEdge C2100 Expander Backplane support for 3.5-inch Hard Drives Redundant Power Supplies

    224-8350

    Processor 2x Intel

    Xeon

    E5640 2.66Ghz, 12M

    Cache,Turbo, HT, 1066MHz Max Mem 317-4110

    Memory 96GB Memory (12x8GB), 1333MHz Dual Ranked RDIMMs for 2 Processors, Optimized

    317-6324

    Operating System No Factory Installed Operating System

    OS Media None

    Documentation/Disks C2100 Documentation 330-8774

    Rails C2100 Sliding Rail Kit 330-8520

    Power Cords 2XC13 to C14, PDU Style, 12 AMP, 2 Feet (.6m), Power Cord

    330-6870

    Hard Drive Controller and Configuration

    Add-in LSI 9260-8i controllers for up to 12 HP Drives total

    342-1529, 342-0993

    Hardware Support Services 3 Year ProSupport and NBD On-site Service

    908-3264, 908-3304, 909-1668, 909-1677, 926-4080

    Installation Services CUSTOM INSTALLATION SOLUTION REQUIRED

    Keep Your Hard Drive None

    Dell Recycling None

    2nd Controller None

    Bios None

    Hard Drives - DellSta 600GB 15K RPM Serial-Attach SCSI 6Gbps 3.5in Hot Plug Hard Drive

    342-1544

    Network Card Intel Gigabit ET Dual Port 1GbE, PCIe x4 430-0937

    Slave Node Bill of Materials

    Component Description SKU

    PowerEdge C2100 PowerEdge C2100 Expander Backplane support for 3.5-inch Hard Drives Single Power Supply

    224-8331

    Processor 2x Intel Xeon E5640 2.66Ghz, 12M Cache,Turbo, HT, 1066MHz Max Mem

    317-4110

    Memory 24GB Memory (6x4GB), 1333MHz Dual Ranked RDIMMs for 2 Processors, Optimized

    317-6324

    Operating System No Factory Installed Operating System

    OS Media None

    Documentation/Disks C2100 Documentation 330-8774

    Rails C2100 Sliding Rail Kit 330-8520

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    30

    Power Cords C13 to C14, PDU Style, 12 AMP, 2 Feet (.6m), Power Cord

    330-6870

    Hard Drive Controller and Configuration

    Add-in 6Gb SAS Mezzanine controllers for up to 12 HP Drives total

    342-0989, 342-0990

    Hardware Support Services 3 Year ProSupport and NBD On-site Service

    908-3264, 908-3304, 909-1668, 909-1677, 926-4080

    Installation Services CUSTOM INSTALLATION SOLUTION REQUIRED

    Keep Your Hard Drive None

    Dell Recycling None

    2nd Controller None

    Bios None

    Hard Drives - DellSta 2TB 7.2K RPM SATA 3.5in Hot Plug Hard Drive

    342-0935

    Network Card None

    Edge Node Bill of Materials

    Component Description SKU

    PowerEdge C2100 PowerEdge C2100 Expander Backplane support for 3.5-inch Hard Drives Redundant Power Supplies

    224-8350

    Processor 2x Intel

    Xeon

    E5640 2.66Ghz, 12M

    Cache,Turbo, HT, 1066MHz Max Mem 317-4110

    Memory 24GB Memory (6x4GB), 1333MHz Dual Ranked RDIMMs for 2 Processors, Optimized

    317-6324

    Operating System No Factory Installed Operating System

    OS Media None

    Documentation/Disks C2100 Documentation 330-8774

    Rails C2100 Sliding Rail Kit 330-8520

    Power Cords 2XC13 to C14, PDU Style, 12 AMP, 2 Feet (.6m), Power Cord

    330-6870

    Hard Drive Controller and Configuration

    Add-in LSI 9260-8i controllers for up to 12 HP Drives total

    342-1529, 342-0993

    Hardware Support Services 3 Year ProSupport and NBD On-site Service

    908-3264, 908-3304, 909-1668, 909-1677, 926-4080

    Installation Services CUSTOM INSTALLATION SOLUTION REQUIRED

    Keep Your Hard Drive None

    Dell Recycling None

    2nd Controller None

    Bios None

    Hard Drives - DellSta 600GB 15K RPM Serial-Attach SCSI 6Gbps 3.5in Hot Plug Hard Drive

    342-1544

    Network Card Intel 82599 Dual Port 10GE Mezzanine Card

    342-0727

    Network Card Intel Gigabit ET Dual Port 1GbE, PCIe x4 430-0937

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    31

    Network Connectivity Bill of MaterialsTop of Rack

    Component Description SKU

    PowerConnect 6248P PowerConnect 6248P, 48 GbE Ports, Managed Switch, 10GbE and Stackg Capable

    Front-end SFP Fiber Transceivers

    None

    Modular Upgrade Bay 1: Modules

    Stacking Module, 48Gbps, Includes 1m Stacking Cable

    Modular Upgrade Bay 1: Optics None

    Modular Upgrade Bay 2: Modules

    None

    Modular Upgrade Bay 2: Optics None

    Cables Stacking Cable, 3m

    External Redundant Power Supply

    None

    Hardware Support Services 3 Year ProSupport and NBD On-site Service

    Installation Services No Installation Services Selected

    Asset Recovery Services None

    Software Package Description SKU

    RHEL 5.6 Red Hat Enterprise Linux 5.6, 1yr

    Subscription

    CDH Community 331-4370

    CDH Enterprise 5x8 Per Node 331-4371

    CDH Enterprise 24x7 Per Node 331-4372

    Cloudera Developer Training

    and Certification 331-4373

    Cloudera System Administrator

    Training and Certification 331-4374

    Cloudera Hbase Training 331-4375

    Cloudera Hive & Pig Training 331-4376

    Cloudera Essentials for

    Managers Training 331-4377

    Hadoop Informational SKU 331-3282

  • Dell | Cloudera Hadoop Solution Reference Architecture Guide v1.0

    32

    Appendix B: Dell | Cloudera Hadoop Solution Components Decoder Ring

    (source: http://hadoop.apache.org/)

    1. Hadoop: http://en.wikipedia.org/wiki/Hadoop

    2. Hadoop Distributed File System (HDFS): a distributed file system that provides high-throughput access to

    application data

    (http://en.wikipedia.org/wiki/Hadoop_Distributed_Filesystem#Hadoop_Distributed_File_System)

    3. MapReduce: a software framework for distributed processing of large data sets on compute clusters.

    (http://en.wikipedia.org/wiki/MapReduce)

    4. Avro: a data serialization system

    5. Chukwa: a data collection system for managing large distributed systems

    6. HBase: a scalable, distributed database that supports structured data storage for large tables

    7. Hive: a data warehouse infrastructure that provides data summarization and ad-hoc querying

    8. ZooKeeper: a high-performance coordination service for distributed applications

    9. Pig: a platform for analyzing large data sets that consists of high-level language for expressing data analysis

    programs, coupled with infrastructure for evaluating these programs.

    10. Sqoop (from Cloudera): a tool designed to import data from relational databases into Hadoop. Sqoop uses

    JDBC to connect to a database.

    11. Flume (from Cloudera): a distributed service for collecting, aggregating and moving large amounts of log

    data. Its architecture is based on streaming data flows.

    12. Crowbara Dell provided, supported, and maintained toolset for system deployment and configuration

    automation. Crowbar supports the bare-metal bring-up of new hardware and configuration management of

    existing hardware.

    Appendix C: External References

    Nagios: http://www.nagios.org

    Ganglia: http://ganglia.sourceforge.net/

    Cloudera: http://www.cloudera.com

    To Learn More

    For more information on the Dell | Cloudera Hadoop Solution, visit:

    www.Dell.com/Hadoop 2011 Dell Inc. All rights reserved. Trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Specifications are correct at date of publication but are subject to availability or change without notice at any time. Dell and its affiliates cannot be responsible for errors or omissions in typography or photography. Dells Terms and Conditions of Sales and Service apply and are available on request. Dell service offerings do not affect consumers statutory rights. Dell, the DELL logo, and the DELL badge, PowerConnect, and PowerVault are trademarks of Dell Inc.