Reliable Datagram Sockets and InfiniBand Hanan Hit NoCOUG Staff 2010.

23
Reliable Datagram Sockets and InfiniBand Hanan Hit NoCOUG Staff 2010

Transcript of Reliable Datagram Sockets and InfiniBand Hanan Hit NoCOUG Staff 2010.

Reliable Datagram Sockets and InfiniBand

Hanan HitNoCOUG Staff 2010

Agenda

2

• Infiniband Basics• What is RDS (Reliable Datagram

Sockets)?• Advantages of RDS over InfiniBand• Architecture Overview• TPC-H over 11g Benchmark• InfiniBand vs. 10GE

3November 11,2010

3

Value Proposition - Oracle Database RAC

• Oracle Database Real Application Clusters (RAC) provides the ability to build an application platform from multiple systems clustered together

• Benefits– Performance

• Increase performance of a RAC database by adding additional servers to the cluster

– Fault Tolerance• A RAC database is constructed from

multiple instances. Loss of an instance does not bring down the entire database

– Scalability• Scale a RAC database by adding

instances to the cluster database Shared DatabaseShared Database

OracleOracleInstanceInstance

OracleOracleInstanceInstance

OracleOracleInstanceInstance

Shared DatabaseShared DatabaseShared DatabaseShared Database

OracleOracleInstanceInstance

OracleOracleInstanceInstance

OracleOracleInstanceInstance

Some Facts

4

• High-end database applications in the OLTP category are in size range from 10-20 TB with 2-10k IOPS.

• The high end DW applications falls into the category of 20-40 TB with I/O bandwidth requirement of around 4-8 GB per second.

• The x86_64 server with 2 sockets seems to offer the best price at the current point.

•The major limitations of the above servers is limited number of slots available to connect to the external I/O cards and the CPU cost of processing I/O in conventional kernel based I/O mechanisms.

•The main challenge in building cluster databases that runs in multiple serves is the ability to provide low cost balanced I/O bandwidth.

•The conventional fiber channel based storage arrays with its expensive plumbing does not scale very well to create the balance where these db servers could be optimally utilized.

November 11,2010

IBA/Reliable Datagram Sockets (RDS) Protocol

5

What is IBA InfiniBand Architecture (IBA) is an industry-standard, channel-based, switched-fabric, high-speed interconnect architecture with low latency and high throughput. The InfiniBand architecture specification defines a connection between processor nodes and high performance I/O nodes such as storage devices.

What is RDS • A low overhead, low latency, high bandwidth, ultra reliable, supportable, Inter-Process Communication (IPC) protocol and transport system• Matches Oracle’s existing IPC models for RAC communication

Optimized for transfers from 200Bytes to 8MByte• Based on Socket API

November 11,2010

Reliable Datagram Sockets (RDS) Protocol

6

• Leverage InfiniBand’s built-in high availability and load balance features

• Port failover on the same HCA• HCA failover on the same system• Automatic load balancing

• Open Source on Open Fabric / OFED

http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-docs/

November 11,2010

Advantages of RDS over InfiniBand

7

• Lowering Data Center TCO requires efficient fabrics• Oracle RAC 11g will scale for database intensive applications only with the proper high speed protocol and efficient interconnect

• RDS over 10GE•10Gbps not enough to feed multi core Server IO needs

•Each core may require > 3Gbps•Packets can be lost and require retransmit

•Statistics are not accurate throughput indication•Efficiency is much lower than reported

• RDS over InfiniBand • The network efficiency is always 100%• 40Gbps today• Uses Infiniband delivery capabilities that offload end-to-end checking to the Infiniband fabric. •Integrated in the Linux kernel

•More tools will be ported to support RDS, i.e.: netstat, etc. •Shows significant real world application performance boost

•Decision Support System•Mixed Batch/OLTP workloads

November 11,2010

Infiniband considerations

8

Why do Oracle use Infiniband?

• High bandwidth (1x SDR = 2.5 Gbps, 1x DDR = 5.0 Gbps, 1x QDR

= 10.0 Gbps)

•V2 DB machine uses 4x QDR links (40 Gbps in each direction,

simultaneously)

• Low latency (few µs end-to-end, 160ns per switch hop)

• RDMA capable

•Exadata cells recv/send large transfers using RDMA, thus

saving CPU for other operations

November 11,2010

Architecture Overview

9November 11,2010

10November 11,2010

10

#1 Price/Performance TPC-H over 11g Benchmark

• 11g over DDR– Servers: 64 x ProLiant BL460c

• CPU: 2 x Intel Xeon X5450– Quad-Core

– Fabric: Mellanox DDR InfiniBand– Storage:

• Native InfiniBand Storage– 6 x HP Oracle Exadata

World Record clustered TPC-H Performance and Price/Performance

11g over 1GE 11g over DDR

Price / QphH*@1000GB DB

$5.00

$10.00

$15.00

$20.00

$25.00

73% TC

O S

aving

11November 11,2010

11

POC Hardware ConfigurationApplication Servers2x HP BL480C2 Processors / 8 core X560 3.16GHz64GB RAM4x 72GB 15K drivesNIC: HP NC373i 1GB NIC

Concurrent Manager Servers6x HP BL480C2 Processors / 8 core X560 3.16GHz64GB RAM4x 72GB 15K drivesNIC: HP NC373i 1GB NIC

Database Servers6x HP DL580 G54 processors / 24 cores X7460 2.67GHz256GB RAM8x 72GB 15K drivesNIC: Intel 10GBE XF SR 2 port PCIe NICInterconnect: Mellanox 4x PCIe Infiniband

Storage ArrayHP XP2400064GB cache / 20GB shared memory60 Array Groups of 4 spindles240 spindles total146GB 15K fibre channel disk drives

1 GbE Network

10 GbE Network

Infiniband Network

4Gb Fibre Channel Network

ApplicationServers

ConcurrentManagement

Servers

DatabaseServers

StorageArray

12November 11,2010

CPU Utilization• InfiniBand maximize CPU efficiency

– Enables >20% higher than 10GE

InfiniBandInterconnect

10GigEInterconnect

13November 11,2010

Disk IO Rate• InfiniBand maximizes Disk utilization

– Delivers 46% higher IO traffic than 10GE

InfiniBandInterconnect

10GigEInterconnect

14November 11,2010

InfiniBand deliver 63% more TPS vs. 10GE

Activity Start Time End Time Duration Records TPSInfiniBand Interconnect

1 Invoice Load - Load File 6/17/09 7:48 6/17/09 7:54 0:06:01 9,899,635 27,422.812 Invoice Load - Auto Invoice 6/17/09 8:00 6/17/09 9:54 1:54:21 9,899,635 1,442.893 Invoice Load – Total N/A N/A 2:00:22 9,899,635 1,370.76

10 GigE interconnect1 Invoice Load - Load File 6/25/09 17:15 6/25/09 17:20 0:05:21 7,196,171 22,417.982 Invoice Load - Auto Invoice 6/25/09 18:22 6/25/09 20:39 2:17:05 7,196,171 874.913 Invoice Load – Total N/A N/A 2:22:26 7,196,171 842.05

• Work Load– Nodes 1 through 4: Batch processing

– Node 5: Extra Node not used

– Node 6: EBS Other Activity

• Database size (2 TB)– ASM

– 5 LUNS @ 400 GB

• TPS Rates for invoice load use case 1 2 3 4 5 6

Oracle RAC Workload

InfiniBand needs only 6 servers vs. 10 Servers needed by 10GE

10GE InfiniBand

0

200

400

600

800

1000

1200

1400

1600

TP

S

15November 11,2010

Sun Oracle Database Machine

• Clustering is the architecture of the future– Highest performance, lowest cost, redundant, incrementally scalable

• Sun Oracle Database Machine that based on 40Gb/s InfiniBand delivers a complete clustering architecture for all data management needs

16November 11,2010

Sun Oracle Database Server Hardware

• 8 Sun Fire X4170 DB per rack

• 8 CPU cores

• 72 GB memory

• Dual-ports 40Gb/s InfiniBand card

• Fully redundant power and cooling

17November 11,2010

Exadata Storage Server Hardware

• Building block of massively parallel Exadata Storage Grid

– Up to 1.5 GB/sec raw data bandwidth per cell

– Up to 75,000 IOPS with Flash

• Sun Fire™ X4275 Server– 2 Quad-Core Intel® Xeon® E5540 Processors

– 24GB RAM

– Dual-port 4X QDR (40Gb/s) InfiniBand card• Disk Options12 x 600 GB SAS disks (7.2 TB

total)

• 12 x 2TB SATA disks (24 TB total)

– 4 x 96 GB Sun Flash PCIe Cards (384 GB total)

• Software pre-installed– Oracle Exadata Storage Server Software

– Oracle Enterprise Linux

– Drivers, Utilities

• Single Point of Support from Oracle– 3 year, 24 x 7, 4 Hr On-site response

18November 11,2010

Mellanox 40Gbps InfiniBand Networking

• Sun Datacenter InfiniBand Switch– 36 Ports QSFP

• Fully redundant non-blocking IO paths from servers to storage

• 2.88 Tb/sec bi-sectional bandwidth per switch• 40Gb/s QDR, Dual ports per server

Highest Bandwidth and Lowest Latency

• DB machine protocol stack

19November 11,2010

Infiniband HCA

IPoIBRDS

TCP/UDP

iDB

Oracle IPC

RAC

RDS provides

- Zero loss

- Zero copy (ZDP)

SQL*Net, CSS, etc

20November 11,2010

What's new in V2

• 2 managed, 2 unmanaged switches

• 24 port DDR switches

• 15 second min. SM failover timeout

• CX4 connectors

• SNMP monitoring available

• Cell HCA in x4 PCIe slot

• 3 managed switches

• 36 port QDR switches

• 5 seconds min. SM failover timeout

• QSFP connectors

• SNMP monitoring coming soon

• Cell HCA in x8 PCIe slot

V1 DB machine V2 DB machine

21November 11,2010

Infiniband Monitoring

• SNMP alerts on Sun IB switches are coming

• EM support for IB fabric coming

– Voltaire EM plugin available (at an extra cost)

• In the meantime, customers can & should monitor using

– IB commands from host

– Switch CLI to monitor various switch components

• Self monitoring exists

– Exadata cell software monitors its own IB ports

– Bonding driver monitors local port failures

– SM monitors all port failures on the fabric

22November 11,2010

Scale Performance and Capacity

• Scalable– Scales to 8 rack database

machine by just adding wires• More with external InfiniBand

switches

– Scales to hundreds of storage servers

• Multi-petabyte databases

• Redundant and Fault Tolerant– Failure of any component is

tolerated– Data is mirrored across storage

servers

23November 11,2010

Competitive Advantage

“…everybody is using Ethernet, we are using InfiniBand, 40Gb/s InfiniBand”Larry Ellison Keynote at Oracle OpenWorld introducing Exadata-2 (Sun Oracle DB machine), October 14, 2009 San Francisco