Bandwidth, Throughput, Iops, And Flops

Bandwidth, Throughput, IOPS, and FLOPS The Delicate Balance

Bill Menger, Dan Wieder, Dave Glover – ConocoPhillips

March 5, 2009

OutlineOutline

The Problem – Provide enough computer to keep your customers happy (at low cost)

The Solution – Estimating needs and looking at options

The Kickoff – Never exactly works like you thought

The Tuning – Bring it into the Groove one way or the other

The Task – Get ‘er done!

The Finale – Evaluating the good, the bad, the ugly, and the beautiful

The ProblemThe Problem

•We all have tasks to do in “not enough time”

•Solve your customer’s problems with an eye to the future

•Don’t box yourself into a corner (if you can help it).

Dave’s Definition of a Supercomputer

•A supercomputer is a computer that is only one order of magnitude slower than what scientists and engineers need to solve their current problems.

•This definition is timeless and assures me job security.

Dave Glover

SolutionSolutionTask-Driven Computer Design

• Get a bunch of Geophysicists in a room, and ask:• What is the overall project you need to accomplish?

• How many tasks (jobs) must be run in order to complete?

• How long does each job take on architecture a,b,c…?

• So buy nnn of architecture xxx, back up your timeline so you can complete on time, and begin! (oh, if it were that easy)

•The FLOPS

• We always start here – How much horsepower needed to do the job?

•The Bandwidth

• Seems like this is the next step – MPI needs a good backbone!

•The Throughput

• Always want a disk system to jump in with all cylinders firing.

•The IOPS/Sec

• Never really thought about this before now. Maybe I should have!

Task Driven Design – Case StudyTask Driven Design – Case Study

•We get what we need… no more, no less

•The Need:

• 2 iterations

• 30,000 jobs/iteration

• ~200 cpu-hours/job

• 12MM cpu-hours in 50 days

Task Driven Design – Case StudyTask Driven Design – Case Study

•A solution

• Run each job on 8 cores to get jobs down to 1 day

• Make it easy to load/unload data for jobs with global parallel file system

• Put enough memory into each computer to hold the job

• Keep enough local disk to store intermediate results

• Buy enough computers to make it happen

OptionsOptions

•Intel/AMD/Cell

•Infiniband/GigE/10GigE/Myrinet/Special

•NFS disk/Directly connected Disk (local client software)

•Specialty hardware

ConsiderationsConsiderations

•Fit within current network infrastructure

•Think about my Reservoir Engineer and CFD customers

•What about future algorithms/approaches?

The Chosen DesignThe Chosen Design

•1250 Dual Socket, Quad-core AMD Opterons

•10GigE card in each node with iWarp

•Global Parallel Direct-connected disk system (client modules on each node)

•10GigE interconnect everywhere

Ideal Network Configurationcompute compute

compute compute1GbitMisc

Switch

Edgeswitch

storage storage

storage

CoreSwitch

10GbitCentralSwitch

•1250 Rackable nodes, dual socket quad core AMD Opteron 2.3Ghz, 32Gbytes DRAM

•1250 Neteffect(Intel) 10Gbit cards with iWarp low-latency software

•42 Force-10 5210 -- 10Gbit 48 port edge switches

•2 Foundry MLX-8 core switches (16 10 Gbit ports each)

•4 Foundry MLX-32 core switches (128 10Gig ports each)

•16 shelves Panasas Storage

The Chosen ComponentsThe Chosen Components

•160 Terabytes of usable storage

•160 Gbits connectivity to storage

•2.5Gbit full bisection bandwidth <10usec latency

•46 Teraflops estimated capability

•400Kwatts total power draw when running Linpack

•24 Racks including management nodes, switches, and storage

The SpecsThe Specs

The KickoffThe Kickoff

•Jobs ran (as modified) in 12 hours. Good!

•10 Gig Network cards not available for all nodes.

•Used 1Gig to nodes, 10 Gig from edge switches to core, and 10 Gig to disks.

•1 data space on disk (huge file system)

A small gotchaA small gotcha

•Commercial software for processing

• logs back to common area

• lots of small-file initialization for each job

• always initiates MPI so many jobs hitting user’s login files all at once.

•Output file format (javaseis)

• Nice structured sparse format, good I/O for the actual data

• very small i/o to common area associated with file updates.

14:00 – 22:00 DiskMBytes/Sec

14:00 – 22:00 DiskI/O Ops/sec

16:30 – 17:00DiskMbytes/Sec

16:30 – 17:00DiskI/O Ops/Sec

ConsequencesConsequences

•Start up 1000 jobs in the commercial system, and watch it take 5 seconds per job to enter the queue. That’s 1.4 hours!

•Jobs all reading at once, lots of i/o contention on /home and another directory

•On completion, jobs all writing small-block i/o in the common area, all at the same time. (requires read-modify-write for each operation)

Getting into the GrooveGetting into the Groove

•First, we decided to stagger jobs. Let the auto-job builder add in a delay within each job, the delay means the jobs will start and sleep for linearly increasing times, causing a nice, steady, I/O pattern.

•Subsequent jobs are simply queued and the staggered effect persists.

•Secondly, we added a smaller data directory to house the common area. Now we have two file systems, one for large-block I/O and one for small.

~1200 Jobs, ~50,000 IOps/Sec

ExecutionExecution

•Jobs ran well, I/O held to a maintainable level

•1Gbit network sufficed

•Finished ahead of schedule

Smooth Sailing!

The good, bad, ugly, and beautifulThe good, bad, ugly, and beautiful

• The job got done

• Frantic startup problems with commercial system, Disk solution

• Realizing we undersized the IOPS

Beautiful

• System is ready for more work with “forwardly upgradable” network at the core.

Lessons LearnedLessons Learned

•Don’t scrimp on ANY of the following

• Network Infrastructure (10+Gbits/s <10nsec latency)

• I/O Bandwidth (>2500Mbytes/sec)

• Memory (>= 4Gbytes/core)

• I/O Ops/second (>100,000 iops)

•Vendor software is out of your control

• (deal with it)

Bandwidth, Throughput, Iops, And Flops

Documents

Transcript of Bandwidth, Throughput, Iops, And Flops

Fish flops

Stanislas Odinot, avant vente, division entreprise Intel ...media.frnog.org/FRnOG_26/FRnOG_26-3.pdf · Capacity: 1.6 TB Up to 450k IOPS Up to 150k IOPS Up to 240k IOPS Up to 2800

Introduction to IOPS Principles for Risk Based Supervision Ross Jones – President IOPS Deputy Chairman, Australian Prudential Regulation Authority Regional.

1M Iops Perf Vsphere5

Hyperconverged Infrastructure Just Imagine. Solutions SRW- Website... · 2017. 10. 27. · The Benefits of Hyperconvergence Data efficiency — Reduces storage, bandwidth and IOPS

IO Virtualization Performance · IO Virtualization Overview - Trend 5 Solid State Drive (SSD) Provides hundreds MB/s bandwidth and >10,000 IOPS for single devices*. PCIe 2.0 Doubles

IOPS Principles of Private Pension Supervision– Assessment Results Ross Jones – President IOPS Deputy Chairman, Australian Prudential Regulation Authority.

Enterprise IT is Transforming Globally...Scale-Out Agility PERFORMANCE y 35 TB 200,000 IOPS 43.6 TB 250,000 IOPS 52.2 TB 300,000 IOPS 60.8 TB 350,000 IOPS Linear and Non-Linear Scale

Bandwidth, Throughput, IOPS, and FLOPS The Delicate Balance

Dell EMC PowerMax and SCM Powered by Dual-port Intel ... · ESG audited the results of Dell EMC performance testing that measured IOPS, latency, and bandwidth using I/O generated

How Many IOPS is Enough? - SNIA...Users will focus more attention on IOPS Understanding will be greater than it is today Assuming higher IOPS will create more data/content and mean

Flip Flops

Public Training Guide · ‒ 4k Rand Mixed: 179k IOPS ‒ 4k Rand Read: 634k IOPS 4-Disk RAID0 Read: 952k IOPS Physical CPU Cores Used: – 4-Disk RAID0 Read: 17 Cores – 4-Disk

Windows 7 IOPS for VDI: Deep Dive - Atlantis · PDF file4 | WINDOWS 7 IOPS FOR VDI: DEEP DIVE WHITE PAPER SUMMARY Windows 7 and IOPS is the most challenging, least understood

iOPS-18 User Manual

Repaso de DOPs e IOPs

FLIP - FLOPS

RELATIONSHIPS BETWEEN IOPS AND BIOGEOCHEMICAL …

Aaron J. Elmore, Carlo Curino, Divyakant Agrawal, Amr El Abbadi …people.cs.uchicago.edu/~aelmore/papers/VLDB_Tutorial... · 2014. 5. 18. · 100 IOPS 50 IOPS Capacity: 200 IOPS

Advanced Programming (GPGPU)• For GPGPU, we mainly concentrate on using the fragment processors – Most of the flops – Highest memory bandwidth 16 And then they were unified…