Clustering

ClusteringClusteringNext Wave In PC Computing

2PP150299.ppt

Cluster Concepts 101Cluster Concepts 101

This section is about clusters in general,

we’ll get to Microsoft’s Wolfpack cluster

implementation in the next section.

3PP150299.ppt

Why Learn About ClustersWhy Learn About Clusters

Today clusters are a niche Unix market

But Microsoft will bring clusters to the masses Last October, Microsoft announced NT clusters

SCO announced UnixWare clusters Sun announced Solaris / Intel clusters Novell announced Wolf Mountain clusters

In 1998, 2M Intel servers will ship 100K in clusters

In 2001, 3M Intel servers will ship 1M in clusters (IDC’s forecast)

Clusters will be a huge market andRAID is essential to clusters

4PP150299.ppt

What Are Clusters?What Are Clusters? Group of independent systems that

Function as a single system Appear to users as a single system And are managed as a single system’

Clusters are “virtual servers”

5PP150299.ppt

Why ClustersWhy Clusters

#1. Clusters Improve System Availability This is the primary value in Wolfpack-I clusters

#2. Clusters Enable Application Scaling

#3. Clusters Simplify System Management

#4. Clusters (with Intel servers) Are Cheap

6PP150299.ppt

Why Clusters - #1Why Clusters - #1 #1. Clusters Improve System Availability

When a networked server fails, the service it provided is down

When a clustered server fail, the service it provided “failsover” and downtime is avoided

MailServer

InternetServer

Networked Servers

Clustered Servers

Mail & Internet

7PP150299.ppt

Why Clusters - #2Why Clusters - #2

#2. Clusters Enable Application Scaling With networked SMP servers, application scaling is limited to a single server With clusters, applications scale across multiple SMP servers (typically up to

16 servers)

8PP150299.ppt

Why Clusters - #3Why Clusters - #3 #3. Clusters Simplify System Management

Clusters present a Single System Image; the cluster looks like a single server to management applications

Hence, clusters reduce system management costs

Three Management Domains

One Management Domain

9PP150299.ppt

Why Clusters - #4Why Clusters - #4

#4. Clusters (with Intel servers) Are Cheap Essentially no additional hardware costs Microsoft charges an extra $3K per node

Windows NT Server $1,000 Windows NT Server, Enterprise Edition $4,000

Note: Proprietary Unix cluster software costs $10K to $25K per node.

10PP150299.ppt

An Analogy to RAIDAn Analogy to RAID RAID Makes Disks Fault Tolerant

Clusters make servers fault tolerant

RAID Increases I/O Performance Clusters increase compute performance

RAID Makes Disks Easier to Manage

Clusters make servers easier to manage

RAID

11PP150299.ppt

Two Flavors of ClustersTwo Flavors of Clusters

#1. High Availability Clusters

Microsoft’s Wolfpack 1 Compaq’s Recovery Server

#2. Load Balancing Clusters (a.k.a. Parallel Application Clusters)

Microsoft’s Wolfpack 2 Digital’s VAXClusters

Note: Load balancing clusters are a superset of high availability clusters.

12PP150299.ppt

High Availability ClustersHigh Availability Clusters Two node clusters (node = server) During normal operations, both servers do useful work

Failover When a node fails, applications failover to the surviving

node and it assumes the workload of both nodes

Mail Web

Mail & Web

13PP150299.ppt

High Availability ClustersHigh Availability Clusters

Failback

When the failed node is returned to service, the applications failback

Mail Web

WebMail

14PP150299.ppt

Load Balancing ClustersLoad Balancing Clusters Multi-node clusters (two or more nodes)

Load balancing clusters typically run a single application, (e.g. database, distributed across all nodes)

Cluster capacity is increased by adding nodes (but like SMP servers, scaling is less than linear)

3,000 TPM 3,600 TPM

15PP150299.ppt

Load Balancing ClustersLoad Balancing Clusters

Cluster rebalances the workload when a node dies

If different apps are running on each server, they failover to the least busy server or as directed by predefined failover policies

16PP150299.ppt

Two Cluster ModelsTwo Cluster Models

#1. “Shared Nothing” Model Microsoft’s Wolfpack Cluster

#2. “Shared Disk” Model VAXClusters

17PP150299.ppt

#1. “Shared Nothing” Model#1. “Shared Nothing” Model

At any moment in time, each disk is owned and addressable by only one server

“Shared nothing” terminology is confusing Access to disks is shared -- on the same bus

But at any moment in time, disks are not shared

RAID

18PP150299.ppt

#1. “Shared Nothing” Model #1. “Shared Nothing” Model

When a server fails, the disks that it owns “failover” to the surviving server transparently to the clients

RAID

19PP150299.ppt

#2. “Shared Disk” Model#2. “Shared Disk” Model Disks are not owned by servers but shared by all servers At any moment in time, any server can access any disk

Distributed Lock Manager arbitrates disk access so apps on different servers don’t step on one another (corrupt data)

RAID

20PP150299.ppt

Cluster InterconnectCluster Interconnect

This is about how servers are tied together and how disks are physically connected to the cluster

21PP150299.ppt

Cluster InterconnectCluster Interconnect

Clustered servers always have a client network interconnect, typically Ethernet, to talk to users

And at least one cluster interconnect to talk to other nodes and to disks

RAID

Cluster Interconnect

Client Network

HBA HBA

22PP150299.ppt

Cluster Interconnects Cluster Interconnects (cont’d)(cont’d)

Or They Can Have Two Cluster Interconnects One for nodes to talk to each other -- “Heartbeat Interconnect”

Typically Ethernet

And one for nodes to talk to disks -- “Shared Disk Interconnect” Typically SCSI or Fibre Channel

RAID

Shared Disk Interconnect


HBA HBA

NIC NIC

Micosoft’s Wolfpack ClustersMicosoft’s Wolfpack Clusters

24PP150299.ppt

Clusters Are Not NewClusters Are Not New

Clusters Have been Around Since 1985

Most UNIX Systems are Clustered

What’s New is Microsoft Clusters Code named “Wolfpack” Named Microsoft Cluster Server (MSCS)

Software that provides clustering

MSCS is part of Window NT, Enterprise Server

25PP150299.ppt

Microsoft Cluster RolloutMicrosoft Cluster Rollout

Wolfpack-I In Windows NT, Enterprise Server, 4.0 (NT/E, 4.0) [Also includes

Transaction Server and Reliable Message Queue]

Two node “failover cluster”

Shipped October, 1997

Wolfpack-II In Windows NT, Enterprise Server 5.0 (NT/E 5.0)

“N” node (probably up to 16) “load balancing cluster”

Beta in 1998 and ship in 1999

26PP150299.ppt

MSCS (NT/E, 4.0) OverviewMSCS (NT/E, 4.0) Overview Two Node “Failover” Cluster “Shared Nothing” Model

At any moment in time, each disk is owned and addressable by only one server

Two Cluster Interconnects “Heartbeat” cluster interconnect

Ethernet

Shared disk interconnect SCSI (any flavor) Fibre Channel (SCSI protocol over Fibre Channel)

Each Node Has a “Private System Disk” Boot disk

27PP150299.ppt

MSCS (NT/E, 4.0) TopologiesMSCS (NT/E, 4.0) Topologies

#1. Host-based (PCI) RAID Arrays

#2. External RAID Arrays

28PP150299.ppt

NT Cluster with NT Cluster with Host-Based RAID ArrayHost-Based RAID Array Each node has

Ethernet NIC -- Heartbeat Private system disk (generally on an HBA) PCI-based RAID controller -- SCSI or Fibre

Nodes share access to data disks but do not share data

RAIDShared Disk Interconnect

“Heartbeat” Interconnect

RAID

HBA HBANICNIC

29PP150299.ppt

NT Cluster with NT Cluster with SCSI External RAID ArraySCSI External RAID Array

Each node has Ethernet NIC -- Heartbeat Multi-channel HBA’s connect boot disk and external array

Shared external RAID controller on the SCSI Bus -- DAC SX

RAID



HBAHBA

NICNIC

30PP150299.ppt

NT Cluster with NT Cluster with Fibre External RAID ArrayFibre External RAID Array

DAC SF or DAC FL (SCSI to disks) DAC FF (Fibre to disks)

RAID



HBAHBA

NICNIC

MSCS -- A Few of the DetailsMSCS -- A Few of the Details

Managers -->

32PP150299.ppt

Cluster Interconnect & Cluster Interconnect & HeartbeatsHeartbeats

Cluster Interconnect Private Ethernet between nodes Used to transmit “I’m alive” heartbeat messages

Heartbeat Messages When a node stops getting heartbeats, it assumes the other

node has died and initiates failover In some failure modes both nodes stop getting heartbeats

(NIC dies or someone trips over the cluster cable) Both nodes are still alive But each thinks the other is dead Split brain syndrome Both nodes initiate failover Who wins?

33PP150299.ppt

Quorum DiskQuorum Disk Special cluster resource that stores the cluster log When a node joins a cluster, it attempts to reserve the quorum disk

(purple disk) If the quorum disk does not have an owner, the node takes ownership and

forms a cluster If the quorum disk has an owner, the node joins the cluster

RAIDDisk Interconnect

Cluster “Heartbeat” Interconnect

RAID

HBA HBA

34PP150299.ppt

Quorum DiskQuorum Disk

If Nodes Cannot Communicate (no heartbeats) Then only one is allow to continue operating

They use the quorum disk to decide which one lives

Each node waits, then tries to reserve the quorum disk

Last owner waits the shortest time and if it’s still alive will take ownership of the quorum disk

When the other node attempts to reserve the quorum disk, it will find that it’s already owned

The node that doesn’t own the quorum disk then failsover

This is called the Challenge / Defense Protocol

35PP150299.ppt

Microsoft Cluster Server (MSCS)Microsoft Cluster Server (MSCS)

MSCS Objects Lots of MSCS objects but only two we care about

Resources and Groups Resources

Applications, data files, disks, IP addresses, ...

Groups Application and related resources like data on disks

36PP150299.ppt

Microsoft Cluster Server (MSCS)Microsoft Cluster Server (MSCS)

When a server dies, groups failover

When a server is repaired and returned to service, groups failback

Since data on disks is included in groups, disks failover and failback

Group: Mail

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Group: Web

Resource

Resource

Resource

Group: Web

Resource

Resource

Resource

Group: Web

Resource

Resource

Resource

37PP150299.ppt

Groups FailoverGroups Failover

Groups are the entities that failover

And they take their disks with them

Group: Mail

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Group: Web

Resource

Resource

Resource

Group: Web

Resource

Resource

Resource

Group: Web

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

Group: Mail

Resource

Resource

Resource

38PP150299.ppt

Microsoft Cluster CertificationMicrosoft Cluster Certification

Two Levels of Certification Cluster Component Certification

HBA’s and RAID controllers must be certified When they pass:

They’re listed on the Microsoft web site www.microsoft.com/hwtest/hcl/

They’re eligible for inclusion in cluster system certification Cluster System Certification

Complete two node cluster When they pass:

They’re listed on the Microsoft web site They’ll be supported by Microsoft

Each Certification Takes 30 - 60 Days

Mylex NT Cluster SolutionsMylex NT Cluster Solutions

41PP150299.ppt

Internal vs External RAID PositioningInternal vs External RAID Positioning

Internal RAID Lower cost solution Higher performance in read-intensive applications

Proven TPC-C performance enhances cluster performance

External RAID Higher performance in write-intensive applications

Write-back cache is turned-off in PCI-RAID controllers Higher connectivity

Attach more disk drives

Greater footprint flexibility Until PCI-RAID implements fibre

42PP150299.ppt

Why We’re Better -- External Why We’re Better -- External RAIDRAID

Robust Active - Active Fibre Implementation Shipping active - active for over a year It works in NT (certified) and Unix environments Have Fibre on the back-end soon

Mirrored Cache Architecture Without mirrored cache, data is inaccessible or dropped on the floor

when a controller fails Unless you turn-off the write-back cache which degrades write

performance by 5x to 30x.

Four to Six Disk Channels I/O bandwidth and capacity scaling

Dual Fibre Host Ports NT expects to access data over pre-configured paths If it doesn’t find the data over the expected path, then I/O’s don’t

complete and applications fail

43PP150299.ppt

SX Active / Active DuplexSX Active / Active Duplex

HBAHBA

SXSX

Ultra SCSI Disk Interconnect


44PP150299.ppt

SF (or FL) Active / Active DuplexSF (or FL) Active / Active Duplex

HBAHBA

SFSF

FC HBA FC HBA

Single FC Array Interconnect

45PP150299.ppt

SF (or FL) Active / Active DuplexSF (or FL) Active / Active Duplex

HBAHBADual FC Array Interconnect

FC HBA FC HBA

FC Disk Interconnect

FC HBA FC HBA

SFSF

46PP150299.ppt

FF Active / Active DuplexFF Active / Active Duplex

HBAHBA

Single FC Array Interconnect

FC HBA FC HBA

FFFF

47PP150299.ppt

FF Active / Active DuplexFF Active / Active Duplex

HBAHBA

Dual FC Array Interconnect

FC HBA FC HBA

FC HBA FC HBA

FFFF

48PP150299.ppt

Why We’ll Be Better -- Why We’ll Be Better -- Internal RAIDInternal RAID

Deliver Auto-Rebuild

Deliver RAID Expansion MORE-I Add Logical Units On-line MORE-II Add or Expand Logical Units On-

line

Deliver RAID Level Migration 0 ---> 1 1 ---> 0 0 ---> 5 5 ---> 0 1 ---> 5 5 ---> 1

And (of course) Award Winning Performance

49PP150299.ppt

Nodes have: Ethernet NIC -- Heartbeat Private system disks (HBA) PCI-based RAID controller

eXtremeRAID



eXtremeRAID

HBA HBANICNIC

NT Cluster with NT Cluster with Host-Based RAID ArrayHost-Based RAID Array

50PP150299.ppt

Why eXtremeRAID & DAC960PJ ClustersWhy eXtremeRAID & DAC960PJ Clusters

Typically four or less processors

Offers a less expensive, integrated RAID solution

Can combine clustered and non clustered applications in the same enclosure

Uses today’s readily available hardware

51PP150299.ppt

TPC-C Performance for ClustersTPC-C Performance for Clusters

Two ExternalUltra ChannelsAt 40 MB/sec

32 bit PCI bus between the controller and the server, providing burst data transfer rates up to 132 MB/sec.

Three internalUltra ChannelsAt 40 MB/sec

66 Mhz I960 processor off-loads RAID management from the host CPU

DAC960PJ

52PP150299.ppt

eXtremeRAID™ achieves breakthrough in RAID technology, eliminates storage bottlenecks and delivers scaleable performance for NT Clusters.

LEDsSerial Port

233 MHzRISC processor

CPUNVRAM

Ch 0 Ch 1

Ch

1C

h 0

(bot

tom

)C

h 2

(top

)

SCSI SCSI

SCSIPCIBridge

BASS

DAC Memory Modulewith BBU

80 MB/sec.

80 MB/sec.

80 MB/sec.

64 bit PCI bus doubles data bandwidth between the controller and the server, providing burst data transfer rates up to 266 MB/sec.

3 - Ultra2 SCSI LVD channels for up to 42

shared storage devices and Connectivity Up

To 12 Meters

233 MHz strong ARM RISC processor off-loads RAID management from the host CPU

Mylex’s new firmware is optimized for performance and manageability

eXtremeRAID™ supports up to 42 drives, per cluster, as much as 810 GB of capacity per controller. Performance increases as you add drives.

eXtremeRAIDeXtremeRAID™™:: Blazing Clusters Blazing Clusters

53PP150299.ppt

eXtremeRAIDeXtremeRAID™™ 1100 NT Clusters 1100 NT Clusters Nodes have:

Ethernet NIC -- Heartbeat Private system disks (HBA) PCI-based RAID controller

Nodes share access to data disks but do not share data

3 Shared Ultra2 Interconnects


HBA HBANICNIC

eXtremeRAID

eXtremeRAID

54PP150299.ppt

Cluster Support PlansCluster Support Plans

Internal RAID Windows NT 4.0 1998 Windows NT 5.0 1999 Novell Orion Q4 98 SCO TBD SUN TBD

External RAID Windows NT 4.0

1998 Windows NT 5.0

1999 Novell Orion TBD SCO TBD

55PP150299.ppt

Plans For NT Cluster CertificationPlans For NT Cluster Certification

Microsoft Clustering (submission dates) DACSX Completed (Simplex) DACSF Completed (Simplex) DACSX July (Duplex) DACSF July (Duplex) DACFL August (Simplex) DACFL August (Duplex) DAC960 PJ Q4 ‘99 eXtremeRAID™ 1164 Q4 ‘99 AcceleRAID™ Q4 ‘99

56PP150299.ppt

What RAID Arrays are Right for ClustersWhat RAID Arrays are Right for Clusters

eXtremeRAID™-1100AcceleRAID™ 200AcceleRAID™ 250

DAC SFDAC FLDAC FF

Clustering

Documents

Transcript of Clustering