Post on 11-Jan-2016
Better answers
Compaq HPTC SolutionsCompaq HPTC Solutions
Bruce Foster, Ph.D., MBABruce Foster, Ph.D., MBA
bruce.foster@compaq.combruce.foster@compaq.com
Better answers
Top100 SuperComputer Top100 SuperComputer Architectures (June 1999)Architectures (June 1999)
40.8%
50.4%
41.0%
25.1%
9.5%
3.0%
19.5%17.1%
30.0%
6.8%5.3%
7.0%5.6%
9.3%
6.0%
1.8%
4.6%7.0%
0.7%3.3%
5.0%
0.4% 0.5% 1.0%0.0% 0.0% 0.0%0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
Percent CPUs Percent GFlops Percent Sites
Alpha
Pentium
RS6000
MIPs
Hitachi
Fujitsu
NEC
SPARC
PA-RISC
Better answers
The Barriers to Performance ScalingThe Barriers to Performance ScalingThe Barriers to Performance ScalingThe Barriers to Performance Scaling
CPU cycle time (nsec)
11
100100
1010
1010 100100 1K1K 10K10K 100K100K
100 to 200 GFLOP LimitSC(PVP)
SMP Cluster FarmMPP
Numbers of CPUs
PhysicalLimits
ComplexityLimits
Better answers
CPU cycle time (nsec)
PhysicalLimits
11
100100
1010
1010 100100 1K1K 10K10K 100K100K
SC(PVP)
SMPCluster
FarmMPP
Numbers of CPUs
ComplexityLimits
Fastest Microprocessorswith best interconnects
for SMP Clustersyield
MaximumApplication Performance
(TeraFLOP Level)
Clusters of SMPs Are Breaking Clusters of SMPs Are Breaking Through to the TeraFlop LevelThrough to the TeraFlop Level
Better answers
High Performance ComputingHigh Performance ComputingSystemsSystems
HPTC SolutionsHPTC Solutions
AlphaServer AlphaServer SystemsSystems
InterconnectsInterconnects
SoftwareSoftware
ServicesServices
Better answers
Compaq is in it for the long haul!Compaq is in it for the long haul! Alpha roadmap committed for 10 years and Alpha roadmap committed for 10 years and
beyond of performance leadership.beyond of performance leadership. Tandem will use Alpha in their next generation Tandem will use Alpha in their next generation
systems. Tandem owns 36 of the top 38 stock markets systems. Tandem owns 36 of the top 38 stock markets worldwide.worldwide.
Over 50% of Compaq’s revenue is from Over 50% of Compaq’s revenue is from Enterprise SystemsEnterprise Systems
Better answers
Wide Presence in HPTC marketWide Presence in HPTC market Intel/ServerNet clusters at NCSAIntel/ServerNet clusters at NCSA Alpha Linux/ServerNet at CaltechAlpha Linux/ServerNet at Caltech Alpha Tru64 Unix/FastEthernet at SwinburneAlpha Tru64 Unix/FastEthernet at Swinburne Alpha Linux /Myrinet “C-Plant” at Sandia (#44 on Top500 Alpha Linux /Myrinet “C-Plant” at Sandia (#44 on Top500
list)list) HPTi win at FSL (Alpha Linux /Myrinet) 4 TFlop systemHPTi win at FSL (Alpha Linux /Myrinet) 4 TFlop system Compaq Visual Fortran for W95/NTCompaq Visual Fortran for W95/NT Compaq Compilers for Alpha/LinuxCompaq Compilers for Alpha/Linux Several very large SC systems (#34 on Top500 list)Several very large SC systems (#34 on Top500 list) Celera 300 x 4 CPU ES40s (1.2 TFlop)Celera 300 x 4 CPU ES40s (1.2 TFlop) ASCI PathForward and ASCI TurquoiseASCI PathForward and ASCI Turquoise
Better answers
1999 Small and Medium 1999 Small and Medium AlphaServersAlphaServers
Compaq DS10Compaq DS10 Compaq DS20 SystemCompaq DS20 System
2 CPUs, small PC tower2 CPUs, small PC tower 5.13 GB/s peak, 1.3 GB/s Single-CPU McCalpin 5.13 GB/s peak, 1.3 GB/s Single-CPU McCalpin
Memory B/WMemory B/W Compaq ES40 SystemCompaq ES40 System
4 CPUs, bigger cabinet4 CPUs, bigger cabinet EV67 systems: EV67 systems: 2.5 GB/s 4-CPU McCalpin b/w2.5 GB/s 4-CPU McCalpin b/w Double the I/O bandwidth & more slotsDouble the I/O bandwidth & more slots
Better answers
Next Generation DS/ES AlphaServersNext Generation DS/ES AlphaServersDesigned to Protect Your InvestmentDesigned to Protect Your Investment
SecondSecondGenerationGeneration
125 MHz 125 MHz Data BusData Bus125 MHz 125 MHz Data BusData Bus
Ultra2 64-Ultra2 64-bitbit
RAIDRAID
Ultra2 64-Ultra2 64-bitbit
RAIDRAID
EV67 600+ EV67 600+ MHzMHz
EV67 600+ EV67 600+ MHzMHz
EV68EV68800+ MHz800+ MHz
EV68EV68800+ MHz800+ MHz
8 MB 8 MB L2 CacheL2 Cache
8 MB 8 MB L2 CacheL2 Cache
32 GB32 GB32 GB32 GB
Ultra3Ultra3SCSISCSIUltra3Ultra3SCSISCSI
DVDDVDDVDDVD
Processor
Architecture
Memory
Storage
ThirdThirdGenerationGeneration
Alpha 21264 500 MHzAlpha 21264 500 MHz4 MB L2 Cache4 MB L2 Cache
16 GB of Memory16 GB of Memory83 MHz Data Bus83 MHz Data Bus
2 64-bit PCI busses2 64-bit PCI busses33 MHz PCI33 MHz PCI
Ultra2Ultra2
First GenerationFirst Generation
4 PCI Busses4 PCI Busses4 PCI Busses4 PCI Busses
66 MHz PCI66 MHz PCI66 MHz PCI66 MHz PCI
AGPAGPAGPAGP
ThirdThirdGenerationGeneration
Note: Feature set varies between AlphaServer DS and ES products based on customer needs
Better answers
SC’99SC’99
16x4 ES40 => 64 CPUs16x4 ES40 => 64 CPUs Quadrics InterconnectQuadrics Interconnect 1.7TB Storage1.7TB Storage
Better answers
LINPACK NxN Rmax (GFlops)LINPACK NxN Rmax (GFlops)
10.70
21.53
42.04
85.41
154.42 271.40
1
10
100
1,000
16 32 64 128 256 512
Number CPUs
GF
lops
Better answers
MPI 8Byte Ping Pong
y = 0.0005x + 5.5267
R2 = 0.1728
5.25
5.30
5.35
5.40
5.45
5.50
5.55
5.60
5.65
5.70
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 116 121 126
node paired with 0
usec
s
Better answers
Cluster and Parallel File SystemCluster and Parallel File System Cluster File SystemCluster File System
File system mounted on any node is visible to all File system mounted on any node is visible to all nodes without race conditionsnodes without race conditions
Each node is both a CFS server and CFS clientEach node is both a CFS server and CFS client Coherency is maintained by exchanging tokensCoherency is maintained by exchanging tokens Semantics are POSIX and X/OPEN compliantSemantics are POSIX and X/OPEN compliant Performance depends on access type and patternPerformance depends on access type and pattern
Parallel File SystemParallel File System Aggregates CFS files into a single parallel fileAggregates CFS files into a single parallel file Enables striping a single logical file across multiple Enables striping a single logical file across multiple
underlying local filesunderlying local files
Better answers
Compilers & ToolsCompilers & Tools
Compaq F90, C, C++, Java, …Compaq F90, C, C++, Java, … Shared memoryShared memory
Parallelization within SMP node by OpenMPParallelization within SMP node by OpenMP 3rd party decomposition tools (KAI)3rd party decomposition tools (KAI)
Cray T3D/E-compatible Shmem libraryCray T3D/E-compatible Shmem library MPI (MPI 2, MPI-I/O, thread-safe)MPI (MPI 2, MPI-I/O, thread-safe) Debugger: TotalView (Etnus, Inc.)Debugger: TotalView (Etnus, Inc.) Performance analysis: Vampir (PALLAS GmbH)Performance analysis: Vampir (PALLAS GmbH) Load balancing: LSF (Platform Computing)Load balancing: LSF (Platform Computing)
Better answers
Our Capability Machine Our Capability Machine is Hereis Here
A 16-CPU AlphaServer at SC’99A 16-CPU AlphaServer at SC’99 16-way GS160 AlphaServer16-way GS160 AlphaServer 16 * 1.46 GF/CPU = 23.4 GFLOPS16 * 1.46 GF/CPU = 23.4 GFLOPS High sustainable memory High sustainable memory
bandwidthbandwidth 32-way:32-way:
32 CPUs: 46.8 GFLOPS32 CPUs: 46.8 GFLOPS Very high sustainable memory Very high sustainable memory
bandwidthbandwidth
Better answers
Alpha Microprocessor SummaryAlpha Microprocessor Summary EV6 (21264)EV6 (21264)
.35 .35 m, 466 - 500 MHzm, 466 - 500 MHz 4-wide superscalar4-wide superscalar Out-of-order executionOut-of-order execution
EV67 (21264a)EV67 (21264a) .25 .25 m, 667 - 730 MHzm, 667 - 730 MHz 8MB L2 cache8MB L2 cache
EV68 (21264b)EV68 (21264b) .18 .18 m, 800 - 1042 MHzm, 800 - 1042 MHz
EV7 (21364)EV7 (21364) .18 .18 m, ~1200 MHzm, ~1200 MHz L2 cache on-chipL2 cache on-chip RAMBUSRAMBUS Glueless MPGlueless MP
EV8 (21464)EV8 (21464) .13 .13 m, ~1500 MHzm, ~1500 MHz 8-wide superscalar8-wide superscalar SMTSMT
. . . Future Alpha Microprocessors planned through to 2025 !
Better answers
EV67/667MHz EV67/667MHz Preliminary Preliminary HPTC HPTC Applications ResultsApplications Results
30 to 45% improvement over ES40 EV6/500mhz30 to 45% improvement over ES40 EV6/500mhz Competitive leadershipCompetitive leadership
1.15 to over 2 times HP N40001.15 to over 2 times HP N4000– Better than an 8 CPU N4000Better than an 8 CPU N4000
Over 2 times SGI Origin 2000Over 2 times SGI Origin 2000– Better than an 8 CPU Origin 2000Better than an 8 CPU Origin 2000
Over 2 times Sun UE3000Over 2 times Sun UE3000 2 to 4 times Intel Xeon III2 to 4 times Intel Xeon III
Better answers
Glo
bal S
wit
ch
EV6 EV6EV6 EV6
Mem MemMem Mem
I/O Switch
EV67 EV67EV67 EV67
Mem MemMem Mem
I/O Switch
EV67 EV67EV67 EV67
Mem MemMem Mem
I/O Switch
EV67 EV67EV67 EV67
Mem MemMem Mem
I/O Switch
EV67 EV67EV67 EV67
Mem MemMem Mem
I/OSwitch
EV67 EV67EV67 EV67
Mem MemMem Mem
I/OSwitch
EV67 EV67EV67 EV67
Mem MemMem Mem
I/OSwitch
EV67 EV67EV67 EV67
Mem MemMem Mem
I/OSwitch
New High-end New High-end AlphaServerAlphaServer Architecture ArchitectureA new way of looking at ServersA new way of looking at Servers
EachEach Quad Building Block Quad Building Block 4 EV67 CPUs (731 MHz, 1.46 GFlops)4 EV67 CPUs (731 MHz, 1.46 GFlops) 4 Memory Arrays (total of 16GB, 32-way)4 Memory Arrays (total of 16GB, 32-way) 6.4 GB/s Local Switch6.4 GB/s Local Switch 28 PCI slots28 PCI slots
Quads aggregate via a Global Switch (8 ports)Quads aggregate via a Global Switch (8 ports) Combines up to 8 quadsCombines up to 8 quads High Bandwidth, Low LatencyHigh Bandwidth, Low Latency Preserves SMP programming model
Up to 8 System PartitionsUp to 8 System Partitions Hardware firewalls provide software fault isolation between partitions Can be dynamically reconfigured Support multiple instances and versions of same O/S or different O/S
completely (Tru64 UNIX, OpenVMS, and soon Linux)
Better answers
Overview of CY2000Overview of CY2000 CPUs/SMPCPUs/SMP
DS10 (1 CPU), DS10 (1 CPU), DS20 (2 CPUs), DS20 (2 CPUs), ES40 (4 CPUs), andES40 (4 CPUs), and GS80 (8), GS160 (16) and GS320 (32)GS80 (8), GS160 (16) and GS320 (32)
Systems up to 4096 CPUsSystems up to 4096 CPUs 128-way128-way
Microprocessor speedMicroprocessor speed Around 1GHz at end-2000Around 1GHz at end-2000
Better answers
Systems Area Network: Systems Area Network: FAST Message PassingFAST Message Passing
QuadricsQuadrics Backbone of our AlphaServer SC systems.Backbone of our AlphaServer SC systems. High Bandwidth, Low Latency, High Node/CPU CountHigh Bandwidth, Low Latency, High Node/CPU Count It’s a PCI Card; this allows systems of both small and big servers.It’s a PCI Card; this allows systems of both small and big servers.
ServerNetServerNet Engineered for low per-node SAN cost.Engineered for low per-node SAN cost. Brings Tandem Non-Stop technology to Alpha Linux BeowulfsBrings Tandem Non-Stop technology to Alpha Linux Beowulfs
MyrinetMyrinet Ties together hundreds of Alphas on Sandia’s C-Plant.Ties together hundreds of Alphas on Sandia’s C-Plant.
Ethernet/Fast EthernetEthernet/Fast Ethernet Low cost interconnect for medium size systems; (Alpha at Swinburne, Low cost interconnect for medium size systems; (Alpha at Swinburne,
Sydney Uni (Gordon Bell winner), CSIRO multiple divisions)Sydney Uni (Gordon Bell winner), CSIRO multiple divisions)
Better answers
Customer Comments: Alpha and Red HatCustomer Comments: Alpha and Red Hat
Comments from "The Center for the Neural Basis of CognitionComments from "The Center for the Neural Basis of Cognition” ” It runs about six times faster on that {DS20} machine than on a Pentium II It runs about six times faster on that {DS20} machine than on a Pentium II
400.400. Comments From West Coast University math department:Comments From West Coast University math department:
PII-450-512k cache PII-450-512k cache g77 -O3 75:02 g77 -O3 75:02 Celeron 450A-128K cache g77 -O3 74:44Celeron 450A-128K cache g77 -O3 74:44 Alpha 21164-600 4 MB cache g77 -O3 29:27Alpha 21164-600 4 MB cache g77 -O3 29:27 Alpha 21264-500 4 MB cache g77 -O3 17:16Alpha 21264-500 4 MB cache g77 -O3 17:16 Alpha 21264-500 4 MB cache fort -O3 8:42Alpha 21264-500 4 MB cache fort -O3 8:42 I'm impressed (both with the AlphaServer 21264 and Compaq Fortran). I'm impressed (both with the AlphaServer 21264 and Compaq Fortran).
It's a 5 mesh fluid flow used for modeling blood flowsIt's a 5 mesh fluid flow used for modeling blood flows.. Comments from Canadian University. Comments from Canadian University.
With your Fortran compiler the DS20 is about 3.5x the speed of an SGI With your Fortran compiler the DS20 is about 3.5x the speed of an SGI Origin 200 with a 180Mhz R10K CPU, pretty impressiveOrigin 200 with a 180Mhz R10K CPU, pretty impressive..
9 times !
6 times !
3.5 times!
Better answers
Complete Suite of HPTC SystemsComplete Suite of HPTC Systems
•1- 2 Processors•Up to 4GB of memory•6 PCI slots
Switched based system - 64-bit PCI I/O subsystems - Very Large Memory
Scalable clusters on DIGITAL UNIX, OpenVMS and Linux
Modular system packaging - advanced systems management
DS Series
AprApr
•1- 4 Processors•Up to 16GB of memory•Up to 10 PCI slots
ES Series
FebFebMayMay
ComingComingSoonSoon
•1-32 Processors•Up to 128+GB of memory•Up to 224 PCI slots
GS SeriesSC Series
•EV 67 667MHz•64-512 Processors•Up to 2 TB memory•Up to 1.2K I/O slots
AnnouncingAnnouncing
Better answers
Thank You!
Please visit our HPTC Web Site or send eMail to Steve Tolnai or myself
http://www.compaq.com/hpc
eMail: tolnai@compaq.combruce.foster@compaq.com