Ceph Reference Architecture Doc

download Ceph Reference Architecture Doc

of 12

Transcript of Ceph Reference Architecture Doc

  • 8/10/2019 Ceph Reference Architecture Doc

    1/12

    CEPH

    REFERENCE

    ARCHITECTURE

    Single Rack Object Store

    Copyright, Inktank Storage Inc. 2013

  • 8/10/2019 Ceph Reference Architecture Doc

    2/12

    Summary

    This document presents a reference architecture for a small digital content repository, designed for simplicity and low cost, while stilldelivering moderate throughput and high reliability. A good example might be an Indie film production company that needs:

    highly reliable storage for their valuable raw footage and edited results high performance temporary storage for their editing and rendering tools a system that be built for a very low initial cost, and operated inexpensively a system that can grow incrementally as they grow

    There are many other applications, with similar needs, for which this system would also be appropriate:

    storage for work-group collaboration products photo, video, and music storage for a web site archival storage for on-line backups or data-tiering

    It would is also be suited as a proof-of-concept implementation for a much larger system: a small system on which performance, reliabil-ity and operational scenario testing can be performed to validate its suitability for a much larger deployment.

    This system requires only a single 10G switch, simple networking, and a single Ceph Object Gateway. Higher throughput and availabilitycan be obtained by adding additional switches, networks, object gateways and load balancers .

    Structure of this document 1.brief overview of the use case, key system characteristics, and the hardware, software and networking components 2.detailed discussion of the servers and networking, and why those choices are right for this use case 3.discussion of the recommended software distributions, versions, and configuration 4.brief overview of Inktank Proof of Concept and product support services

    Intended Audience for this Document Solution architects and system administrators tasked with designing and deploying Ceph-based storage solutions will benefit

    from studying the design considerations of this reference architecture. Developers looking to improve their content repositorysolution by integrating with Ceph can also get a sense of how the storage subsystem will be deployed.

    Brief Description of System The storage sub-system presents S3-compatible RESTful APIs to the repository management software running on one or more

    servers. The Ceph system is built across four commodity servers, each holding twelve 3TB SATA drives. This provides the studiowith

    35TB of editing scratch space 20TB (6,000hours) of triple replicated, high-resolution, film projects sufficient free space to maintain this redundancy after the failure of any server

    Such a system should be able to service up to 2000 storage requests per second. Streaming write throughput is expected toreach approximately 200MB/s and reads up to 600MB/s.

    This system could easily be expanded to five times this capacity and throughput, adding only additional servers. Further growth

    would also be incremental, but would require additional racks and switches

  • 8/10/2019 Ceph Reference Architecture Doc

    3/12

    Summary Diagram The following diagram shows the logical components of the system: Four applications which are consumers and producers of

    data, and the storage sub-system composed of four machines.

    1.Solution Overview

    This chapter provides an overview of the ingredients that went into the reference architecture, describes how the software componentsare deployed on the participating nodes, and dependencies towards the underlying operating system.

    1.1Relevant Use Cases and Environments This is a good solution for the Indie film company because:

    it can provide three-copy redundancy for valuable raw footage and finished products while allowing scratch space tobe unreplicated.

    it can provide very good streaming write throughput for a small number of editing stations and excellent streamingread throughput for a larger number of editing and viewing stations.

    it will continue providing service after the complete failure of any single node. it can be implemented with a single rack and switch and four servers.

    From a technical viewpoint, it should be recognized that because this system uses only a single switch and a single (active) CephObject Gateway:

    it is not highly available (the switch is a single point of failure) the aggregate client throughput is limited to what can be handled by a single Ceph Object Gateway.

    However, for our Indie film company, these two limitations are not too much of a concern, and they are willing to make thesetrade-offs. They are getting excellent durability of their data for a very low budget. In the event of a switch outage, they are willing to take the risk of having to wait a while until a replacement part will be installed.

  • 8/10/2019 Ceph Reference Architecture Doc

    4/12

    1.2Component Overview This is a relatively small system, designed for a minimum of four nodes, and expandable to around twenty nodes and several

    hundreds of terabytes. It is intended to all fit in a single rack, served by a single switch. Because this system needs to beable to run on a small number of nodes, we have chosen to co-locate all of the services on identical servers (each with 12+2disksand 64G of RAM). In larger systems one would use different types of machines for storage nodes, monitor nodes, and Gatewayservers.

    1.3Connectivity Overview A small cluster, served by a single Gateway server, can carry all client and internal traffic on a single 10G network, served by a

    single switch, and requires only a single 10G NIC per storage node. Even small clusters must be lights-out manageable, i.e. evenafter the failure of a NIC or switch. For this reason, we recommend that separate1G networks be set up for IPMI and management.

    1.4Software Overview A reasonable Ceph system (whether for testing or deployment) should have at least three nodes: three nodes must be running the monitor service so that two can still form a quorum if one fails three nodes must be providing storage service so that we can still maintain two copies if one fails fortunately we can run both monitor and object storage daemons on the same node

    If three-copy replication is to be used, then a minimum of four nodes are needed. To run a cluster with a minimum number

    of servers it is necessary to co-locate multiple services on each node. In a minimal four-node system we might distributefunctionality among the four nodes as follows

    The Object Storage Daemons, Monitors, and Ceph Object Gateway are 100% user-mode code and able to run on most recentLinux distributions. That having been said, however, these systems should be running stable releases with 3.0or later kernels(to take advantages of bug fixes and the syncfs system call) and the best available version of the chosen OSD file system.

    2.Hardware Components

    In this chapter we will recommend specific classes of hardware for each component and briefly discuss the rationale for those recom-mendations.

  • 8/10/2019 Ceph Reference Architecture Doc

    5/12

    2.1OSD Nodes One of the most fundamental system design questions is how many disks we want per storage node: More disks per node generally result in a denser, and lower cost solution. Each disk represents added throughput capacity, but only up to the point of saturating the nodes NIC, CPU, memory

    or storage adaptor. A storage node is a single point of failure. The more disks per node, the greater the fraction of our storage that can

    be lost in a single incident. The amount of time, network traffic and storage node load required to deal with a storagefailure is proportional to the amount of storage that has been lost.

    Thus there is a tradeoff to be made. For smaller Ceph deployments we recommend a balanced architecture that utilizes a standard 12-drive 2U chassis configuration that is offered by multiple popular hardware vendors. For much larger systems (with onlymoderate throughput demands) many more disks can easily be supported per node, as long as the memory and CPU power areincreased accordingly. Generally we recommend roughly1GHz of 1CPU Core and at least 1-2GB of memory per OSD.

    2.1.1System Disks It is recommended that the operating system and Ceph software be installed on (and boot from) a RAID-mirrored disk pair. This

    prevents the (.7%/year) failure of a system disk from taking an entire node out of service. If that cost is deemed too high, asingle local disk can be used.

    For a small operation, booting off of local disks is almost surely the right answer. In larger organizations that have the appropriate networking and image management infrastructure, centralized boot images may make node management much easier.Network booting reduces our dependency on local disks, but is a slower process that is dependent on network infrastructureand multiple additional servers.

  • 8/10/2019 Ceph Reference Architecture Doc

    6/12

    2.1.2Journal Configuration Ceph storage nodes use a journal device (or partition) to quickly persist and acknowledge writes, while retaining the ability to

    efficiently schedule disk updates. For systems that are expected to receive heavy write traffic, performance can be increasedby maintaining these journals on separate SSD drives. Journals can alternatively be stored on the same drives that hold thecorresponding data. This is simpler, less expensive, and more reliable (having fewer components), but will not be capable of ashigh a write throughput.

    Because this reference architecture is optimized for simplicity and low cost rather than high write throughput, we recommendthe simpler same-disk journal configuration. Clearly, this was a better fit for the tight budget for the Indie film company.

    2.1.3Storage Controllers When determining what kind of disk controller to use with Ceph, there are two distinct classes of controllers that should be

    considered: the first is basic SAS JBOD controllers with no on-board cache. This works well when SSD journals are utilized asthere is no contention between journal writes and data writes on the same device. The second class are RAID capablecontrollers with battery backup units and write-back cache. This kind of controller is extremely useful when journals and dataare stored on the same disk. Write-back cache reduces contention between journal and data writes and generally improvesperformance, though not necessarily to the levels that SSD journals do.

    To see examples of how SSDs and write-back cache affect write performance, please see our Ceph Argonaut vs BobtailPerformance Preview:

    http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/#4kbradoswrite

    Typical System: This reference architecture is focusing on building simple, reliable, and well balanced Ceph nodes for small to medium sized

    clusters. To that end, weve chosen a very common 12-disk platform that is available from many different hardware vendors.Journals have been left on the same disks as the data, but utilize a controller with write-back cache to improve performance.Weve specified 64GB or more of ram which is more than the minimum needed to support the OSDs on the system. The extraRAM provides additional buffer cache, allows the systems to also host MONS or RGW services, and should not add significantlyto the price.

    System Specifications:

    System Disks 1 or2(RAID1) 250GB+ hard drive(s)

    OSD Data Disks 12 x 3.5 3TB+ 7200RPM hard drivesOSD Journal Disks 2GB partition on each OSD Data Disk

    CPU(s) At least 6 Intel or AMD cores running at 2.2GHz+. (2.0GHz is acceptable if monitor or RGW ser-vices are not on the same nodes)

    Memory 64GB+

    Storage Controller Battery-backed write-back cache recommended (ie LSI SAS2208class card with BBU unit orsimilar)

    Network Ports At least110GbE port for data.

    Management Ports 1 1GbE or 10GbE port for management.

    IPMI Port Optional dedicated IPMI port.

    http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/%234kbradoswritehttp://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/%234kbradoswrite
  • 8/10/2019 Ceph Reference Architecture Doc

    7/12

    There are several offerings from different vendors that meet these specifications:

    Vendor Model Link

    Supermicro 6027R-E1R12T(Note: Optional parts MCP-220-82609-0Nand BTR-0022L-LSI00279 recommended. CPU, Memory, andDisk purchased separately. Please speak with your Supermi-

    cro representative or System Integrator).

    http://www.supermicro.com/prod-ucts/system/2U/6027/SSG-6027R-E1R12T.cfm

    Dell R720xd (Note: Flex bay option, H710or H710p controller, and10GbE adapter recommended. Please speak with your Dellrepresentative).

    http://www.dell.com/us/enterprise/p/poweredge-r720xd/pd

    HP DL380e (Note: Optional Rear Drive bay, P420controller, and10GbE adapter recommended. Please speak with your HPrepresentative.

    http://shopping1.hp.com/is-bin/INTERSHOP.enfinity/WFS/WW-USSMBPublicStore-Site/en_US/-/USD/ViewStandardCatalog-Browse?CatalogCategoryID=DSwQ7hacs9sAAAE3Do9ObFx_

    2.2Monitor Nodes

    For small clusters like this one, Ceph monitor services can be run on the same nodes that the OSDs are running on. We recommend slightly over-provisioning the CPU and memory resources if OSD nodes are also used for monitoring. For example, a nodehosting 12 OSDs and 1 monitor could be configured with a 2.2+GHz 6-core CPU and 64GB of RAM to support the OSDs, MON, andprovide additional memory for buffer cache. A larger system disk may be desired to store additional logs as well.

    This configuration has been optimized for simplicity and low price. Larger clusters with more storage nodes and disks will causethe monitors to use more CPU and memory resources. For larger configurations we generally recommend dedicated monitornodes:

    System Disks 1or 2(RAID1) x 3.5 250GB+ hard drive(s)

    CPU(s) 64bit Intel or AMD CPU (XEON E3-1200, XEON E5-2400, or Opteron 4100series processor accept-able)

    Memory 8GB+

    Network Ports 1 1GbE or 10GbE port for monitor traffic.

    Management Ports 1 1GbE or 10GbE port for management.

    IMPI Ports Optional dedicated IPMI port.

    Example offerings from hardware vendors that meet these specifications include:

    Vendor Model Link

    Supermicro 5017R-MTRF (Note: CPU, Memory, and Diskpurchased separately. Please speak withyour Supermicro representative or SystemIntegrator).

    http://www.supermicro.com/products/system/1U/5017/SYS-5017R-MTRF.cfm

    Dell R420 http://www.dell.com/us/enterprise/p/poweredge-r420/fs

    HP DL160 http://h10010.www1.hp.com/wwpc/us/en/sm/WF25a/15351-15351-3328412-241644-3328421-5211699.html?dnr=1

    http://www.supermicro.com/products/system/2U/6027/SSG-6027R-E1R12T.cfmhttp://www.supermicro.com/products/system/2U/6027/SSG-6027R-E1R12T.cfmhttp://www.supermicro.com/products/system/2U/6027/SSG-6027R-E1R12T.cfmhttp://www.dell.com/us/enterprise/p/poweredge-r720xd/pdhttp://www.dell.com/us/enterprise/p/poweredge-r720xd/pdhttp://www.dell.com/us/enterprise/p/poweredge-r720xd/pdhttp://shopping1.hp.com/is-bin/INTERSHOP.enfinity/WFS/WW-USSMBPublicStore-Site/en_US/-/USD/ViewStandardCatalog-Browse%3FCatalogCategoryID%3DDSwQ7hacs9sAAAE3Do9ObFx_http://shopping1.hp.com/is-bin/INTERSHOP.enfinity/WFS/WW-USSMBPublicStore-Site/en_US/-/USD/ViewStandardCatalog-Browse%3FCatalogCategoryID%3DDSwQ7hacs9sAAAE3Do9ObFx_http://shopping1.hp.com/is-bin/INTERSHOP.enfinity/WFS/WW-USSMBPublicStore-Site/en_US/-/USD/ViewStandardCatalog-Browse%3FCatalogCategoryID%3DDSwQ7hacs9sAAAE3Do9ObFx_http://shopping1.hp.com/is-bin/INTERSHOP.enfinity/WFS/WW-USSMBPublicStore-Site/en_US/-/USD/ViewStandardCatalog-Browse%3FCatalogCategoryID%3DDSwQ7hacs9sAAAE3Do9ObFx_http://shopping1.hp.com/is-bin/INTERSHOP.enfinity/WFS/WW-USSMBPublicStore-Site/en_US/-/USD/ViewStandardCatalog-Browse%3FCatalogCategoryID%3DDSwQ7hacs9sAAAE3Do9ObFx_http://shopping1.hp.com/is-bin/INTERSHOP.enfinity/WFS/WW-USSMBPublicStore-Site/en_US/-/USD/ViewStandardCatalog-Browse%3FCatalogCategoryID%3DDSwQ7hacs9sAAAE3Do9ObFx_http://www.supermicro.com/products/system/1U/5017/SYS-5017R-MTRF.cfmhttp://www.supermicro.com/products/system/1U/5017/SYS-5017R-MTRF.cfmhttp://www.dell.com/us/enterprise/p/poweredge-r420/fshttp://www.dell.com/us/enterprise/p/poweredge-r420/fshttp://h10010.www1.hp.com/wwpc/us/en/sm/WF25a/15351-15351-3328412-241644-3328421-5211699.html%3Fdnr%3D1http://h10010.www1.hp.com/wwpc/us/en/sm/WF25a/15351-15351-3328412-241644-3328421-5211699.html%3Fdnr%3D1http://h10010.www1.hp.com/wwpc/us/en/sm/WF25a/15351-15351-3328412-241644-3328421-5211699.html%3Fdnr%3D1http://h10010.www1.hp.com/wwpc/us/en/sm/WF25a/15351-15351-3328412-241644-3328421-5211699.html%3Fdnr%3D1http://h10010.www1.hp.com/wwpc/us/en/sm/WF25a/15351-15351-3328412-241644-3328421-5211699.html%3Fdnr%3D1http://h10010.www1.hp.com/wwpc/us/en/sm/WF25a/15351-15351-3328412-241644-3328421-5211699.html%3Fdnr%3D1http://www.dell.com/us/enterprise/p/poweredge-r420/fshttp://www.dell.com/us/enterprise/p/poweredge-r420/fshttp://www.supermicro.com/products/system/1U/5017/SYS-5017R-MTRF.cfmhttp://www.supermicro.com/products/system/1U/5017/SYS-5017R-MTRF.cfmhttp://shopping1.hp.com/is-bin/INTERSHOP.enfinity/WFS/WW-USSMBPublicStore-Site/en_US/-/USD/ViewStandardCatalog-Browse%3FCatalogCategoryID%3DDSwQ7hacs9sAAAE3Do9ObFx_http://shopping1.hp.com/is-bin/INTERSHOP.enfinity/WFS/WW-USSMBPublicStore-Site/en_US/-/USD/ViewStandardCatalog-Browse%3FCatalogCategoryID%3DDSwQ7hacs9sAAAE3Do9ObFx_http://shopping1.hp.com/is-bin/INTERSHOP.enfinity/WFS/WW-USSMBPublicStore-Site/en_US/-/USD/ViewStandardCatalog-Browse%3FCatalogCategoryID%3DDSwQ7hacs9sAAAE3Do9ObFx_http://shopping1.hp.com/is-bin/INTERSHOP.enfinity/WFS/WW-USSMBPublicStore-Site/en_US/-/USD/ViewStandardCatalog-Browse%3FCatalogCategoryID%3DDSwQ7hacs9sAAAE3Do9ObFx_http://shopping1.hp.com/is-bin/INTERSHOP.enfinity/WFS/WW-USSMBPublicStore-Site/en_US/-/USD/ViewStandardCatalog-Browse%3FCatalogCategoryID%3DDSwQ7hacs9sAAAE3Do9ObFx_http://shopping1.hp.com/is-bin/INTERSHOP.enfinity/WFS/WW-USSMBPublicStore-Site/en_US/-/USD/ViewStandardCatalog-Browse%3FCatalogCategoryID%3DDSwQ7hacs9sAAAE3Do9ObFx_http://www.dell.com/us/enterprise/p/poweredge-r720xd/pdhttp://www.dell.com/us/enterprise/p/poweredge-r720xd/pdhttp://www.dell.com/us/enterprise/p/poweredge-r720xd/pdhttp://www.supermicro.com/products/system/2U/6027/SSG-6027R-E1R12T.cfmhttp://www.supermicro.com/products/system/2U/6027/SSG-6027R-E1R12T.cfmhttp://www.supermicro.com/products/system/2U/6027/SSG-6027R-E1R12T.cfm
  • 8/10/2019 Ceph Reference Architecture Doc

    8/12

    A Ceph Object Gateway implements RESTful (S3or Swift) operations on top of a RADOS cluster. It receives S3/Swift requestsfrom client nodes, and translates those into operations on the RADOS objects that represent the users, buckets, and fileobjects. Most of the processing in the Gateway server is receiving and sending network messages. All of the actual data storageis in the RADOS cluster. The same platforms described (above) for Monitor nodes would also be a good choice for dedicatedCeph Object Gateway with two key differences: networking and log storage:

    For high throughput applications it might be desirable to put incoming RESTful (S3and Swift) traffic on a separate NIC(and perhaps network) from the outgoing RADOS object traffic. Forcing these two data streams have to compete for asingle NIC could significantly reduce the achievable throughput.

    Ceph Object Gateways maintain extensive logs of all of the requests they serve. These logs are often critical for diagnosing customer complaints (to determine exactly what requests were made when). For this reason, it is a good practice to dedicate a 1TB drive (or perhaps even a RAID-1pair) to log storage.

    Because this reference architecture is optimized for simplicity and low cost rather than high write throughput, we recommendthe simpler configuration, where the Ceph Object Gateways are co-located in one of the storage nodes. Adding a load balancerwould make it possible to support multiple active Ceph Object Gateways, significantly improving both throughput and availability. But if our primary concern is availability, a stand-by Object Gateway can be run on another node, and DNS can be used toreroute traffic to the stand-by if the primary Object Gateway fails.

    2.4 Rack Networking Network design is fairly simple for small systems, because it does not have to address high availability and inter-rack through

    put requirements.

    A basic four-node proof-of-concept system can be served by a few spare ports (four 10G and four 1G) on an existing switch. Eventhe largest system covered by this reference architecture (20 servers, each with separate front-side and back-side 10G NICsand separate 1G IPMI and management networks) can easily be handled by a pair of 48port switches (one 1G, one 10G). But, asmentioned previously, putting all of the data traffic through a single switch creates a single point of failure for the entire cluster. Larger clusters to provide higher availability require multiple switches (and are described in other Reference Architectures).

    It is useful to distinguish as many as four data networks in an object storage system: i.the client service network by which client requests reach the load balancer(s).

    ii.the Gateway service network which interconnects the load balancer(s) to the Ceph Object Gateways. iii. the front-side data network by which the Ceph Object Gateway reaches the RADOS servers. iv. the back-side data network across which RADOS storage nodes perform replication, data redistribution, and recovery.

    In this (small) configuration, there are no load balancers (eliminating network ii), and the Ceph Object Gateway is co-located withRADOS storage nodes (combining networks i and iii). Because all traffic in this cluster is funneled through a single Ceph ObjectGateway, it is not likely that there will be enough traffic to justify the separation of networks iii and iv. In larger configurations(with load balancers and discrete Gateway servers) these four networks would probably be distinct.

    Whether you choose to use spare ports on an existing switch, dedicated small switches, or dedicated large switches depends onyour expectations for the future:

    If this is a temporary proof-of-concept where you expect to do some testing and then recycle the components, thereis little reason to dedicate new switches to this system.

    If this is expected to always be a small system (e.g. starting at four nodes and perhaps growing to eight), relativelysmall (e.g. eight or 16port) switches will surely suffice. If this system is expected to grow to a full rack (or even multiple racks) you would be well advised to start out withrack-scale (e.g. 48port) switches and separate front-side and back-side data networks.

  • 8/10/2019 Ceph Reference Architecture Doc

    9/12

    2.4.1 Front-Side Data Network A single client can easily generate data at rates of 1gigabyte per second or more. A storage node with twelve drives could easily

    stream data to or from disk at an aggregate rate of 1gigabyte per second or more. Unless it is known (e.g. this is an archival service) that data will only be trickling in into this system 1G network fabric (or a Layer 1switch) would surely become a criticalbottleneck. We recommend at least a Layer 2, non-blocking, 10G switch.

    If this cluster is to be more than four nodes and we expect it to see a great deal of traffic from clients who are not on the same

    switch, the interconnection to the client network may need to be much faster (e.g. 40GB/s).

    2.4.2 Back -Side Data Network If a RADOS cluster is expected to receive significant write traffic, it is recommended that the cluster be served by separate 10G

    front-side and back-side data networks: the client can easily use 100% of his NIC throughput to write data into the RADOS cluster (front side network). if multiple copies are to be made, the server that received the initial write will forward copies to secondary servers

    (over the back-side network). Thus, if the storage pool is configured for three copies, each front-side writewill give rise to two back-side writes.

    in addition to initial write-replication, the back-side network is also used for rebalancing and recovery.

    If a cluster is expected to make N copies of each write, the back-side network should be able to handle N+1times the traffic thatis on the front side. In extremely high throughput situations (continuous large writes) it may even be desirable to bond together

    multiple 10G interfaces to handle the corresponding back-side traffic. As with the front-side, if there is to be a separate back-side data network, we recommend at least a Layer 2, non-blocking, 10G switch.

    Because the only traffic carried on the back-side network is data transfers between storage nodes, it may be desirable to provision this network as a distinct VLAN. There is no reason for any other systems to have access to this network.

    You may wonder why we do not recommend putting the front-side and back-side networks on different switches. The switchesare probably more reliable than the servers or their NICs, and adding a second 10G switch to this configuration would greatlyincrease the cost, without greatly increasing reliability. In larger (multi-rack) configurations, however, careful thought must begiven to which servers are on which switches.

    2.4.3Management Network While client data and replication can easily saturate 10G front-side and back-side networks, status reporting, statistics collec

    tion, logging, and management activity generate (comparatively) little traffic, and can easily be accommodated by a1G network.

    Creating a separate 1G network for management offers a few advantages: it prevents management traffic from interfering with performance-critical data traffic. it creates a completely independent path (including the switch) to each node, enabling better failure detection and

    easier diagnosis of failures in the data path. in larger systems (where multiple switches are required) it enables the use of a less expensive switch for the traffic

    that does not require 10G throughput.

    As general (non-management) clients will have no need to participate in these interactions, this network too can be put on adistinct VLAN.

    A simple design with no separate back-side data network

  • 8/10/2019 Ceph Reference Architecture Doc

    10/12

    2.4.4Emergency Network Independently of whether or not you are running Ceph software, remotely hosted servers in a lights-out environment will prob

    ably need additional networking to enable server and switch problems to be corrected without a service call: remote serial console access (to both servers and switches) from a highly accessible serial console server. a distinct IPMI subnet (and perhaps VLAN), preferably served by a different switch than the one that serves the

    management subnet.

    A network design with higher throughput, separation, reliability, and flexibility

    3.Software Components

    3.1Distributions and Versions

    Ceph Server OS The recommended operating system for this RA is Ubuntu Precise (12.04). Inktank provides support for the following distributions: CentOS/RHEL, Debian, Ubuntu, SLES, and Open

    SUSE.

    Ceph Client OS Since the Ceph Object Gateway will be sharing resources with OSD daemons, the Ceph Client OS should be

    the same as the Ceph Server OS.

    Ceph Version Inktank recommends the use of the latest Ceph Bobtail release (0.56.4) for this Reference Architecture.

  • 8/10/2019 Ceph Reference Architecture Doc

    11/12

    3.2Ceph Configuration

    Ceph Policies The underlying filesystem for OSD should be XFS formatted with the following options: -i size=2048. The OSD filesystem

    should be mounted with the following options noatime,inode64.

    Inktank recommends the use of Cephs apache2 and mod_fastcgi forks. The apache2 and mod_fastcgi forks have been optimized

    for HTTP100-continue response, resulting in performance improvements. Our mod_fastcgi fork also provides support for theHTTP1.1Chunked Transfer Encoding.

    For better performance, it is also recommended to deactivate RGW operations logging on the host running the gateway. Whileit is possible to send RGW operations logs to a socket, this configuration is out of the scope of this RA. Logging should be deactivated for performance testing and reactivated afterwards.

    CRUSH will, by default, create replicas on different host. The default number of replica is2, the primary and one copy. If youwish to have more replicas, you can do so by recreating the pools used by RGW. Note however, that a replication level higher than4will not be possible in this RA.

    For this RA, we recommend the following default RGW configuration: [client.radosgw.`hostname`]

    host = `hostname` keyring = /etc/ceph/ceph.client.radosgw.`hostname`.keyring rgw socket path = /tmp/radosgw.sock log file = /var/log/ceph/radosgw.log rgw enable ops log = false

    We also recommend the use of the default path values for OSD and mon directories, especially when using Upstart and/or Cephdeployment solutions (Chef cookbooks, ceph-deploy).

    3.3Other Service Configuration It is important to make sure that your system disks do not fill up, especially on the nodes hosting the monitors. Setting proper l

    log rotation policy for all system services including Ceph is very important. Regular inspection of disk utilization isalso suggested. Be aware that increasing ceph debugging verbosity can generate over 1GB of data per hour. If you are planningon creating a separate partition for the /var directory on the system, please plan accordingly.

    For more information on setting Ceph log rotation policy, see: http://ceph.com/docs/master/rados/operations/debug/

    4.Next Steps

    If this sounds like an interesting architecture, Inktank can help you realize it.

    4.1Proof of Concept Unlike proprietary storage solutions from established vendors, it does not take much to get started with Ceph. There is no need

    for upfront investment into specialized new hardware, or software licenses. To try out the basic functionality, you can use anyold existing commodity server hardware, attach a bunch of hard drives, deploy Ceph and take it for a test drive. Many hobbyistsand software-defined storage enthusiasts have stood up sizeable Ceph clusters just by following the Ceph documentation andby following discussions on the ceph-users community mailing list.

    However, for many corporate users that is not an option. Firstly, because of resource constraints. The ideal person to do a CephPOC is someone who understands Open Source technology, has an appreciation for scalable store clusters, as well as networking infrastructure. Those resources are not easy to find, and if you are lucky enough to have them, they will be in high demand formany projects. Secondly, tight project timelines often make it prohibitive to spend too much time on a proof-of-concept. Formanagement to make a decision, you need to gather facts, and have answers much quicker.

    http://ceph.com/docs/master/rados/operations/debug/http://ceph.com/docs/master/rados/operations/debug/
  • 8/10/2019 Ceph Reference Architecture Doc

    12/12

    Because of these reasons, many users turn to Inktank Professional Services to assist with proof-of-concept projects. InktankPS can assist with your POC by providing the following services:

    analysing your use case, and documenting functional and non-functional requirements for your storage cluster selecting hardware to match these requirements, including review of bill of materials for server and networking

    hardware (from CPU power to disk drives) designing a solutions architecture that best fits the requirements, including design for future scaling performing performance analysis of the assembled configuration (on various levels, from pure disk performance to

    cluster performance under heavy load) making recommendations how to insights from the POC will apply if you build a much larger production system Last but not least, the Inktank PS engineers are experts with proof of concept projects and pilot implementations. They have

    plenty of document templates, tried deployment scripts, and hands-on expertise. They are familiar with common problems thatyou might run into, they can quickly help out with advice, and restore the health of your cluster if you should accidentallydamage your Ceph cluster during experimentation.

    4.2Inktank Professional Services Inktank Professional Services has the technical talent, business experience and long-term vision to help you implement, opti

    mize, and manage your infrastructure as your needs evolve. We are committed to helping you get the most value out of Ceph by leveraging our expertise, dedication, and enterprise-grade services and support.

    If you would like to have a conversation with Intank to plan a proof of concept, [email protected]+(855) 465-8265 * 1(Sales Team).

    http://localhost/var/www/apps/conversion/tmp/scratch_2/sales%40inktank.comhttp://localhost/var/www/apps/conversion/tmp/scratch_2/sales%40inktank.comhttp://localhost/var/www/apps/conversion/tmp/scratch_2/sales%40inktank.com