Extreme Scripting July 2009
-
Upload
ian-foster -
Category
Technology
-
view
1.104 -
download
0
description
Transcript of Extreme Scripting July 2009
![Page 1: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/1.jpg)
Ian FosterAllan Espinosa, Ioan Raicu, Mike Wilde, Zhao Zhang
Computation Institute
Argonne National Lab & University of Chicago
Extreme scriptingand other adventures
in data-intensive computing
![Page 2: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/2.jpg)
2
How data analysis happens at data-intensive computing workshops
![Page 3: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/3.jpg)
3How data analysis really happensin scientific laboratories
% foo file1 > file2
% bar file2 > file3
% foo file1 | bar > file3
% foreach f (f1 f2 f3 f4 f5 f6 f7 … f100)
foreach? foo $f.in | bar > $f.out
foreach? end
%
% Now where on earth is f98.out, and how did I generate it again?
Now: command not found.
%
![Page 4: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/4.jpg)
4
Extreme scripting
Simplescripts
Complexscripts
Bigcomputers
Smallcomputers
Many activitiesNumerous filesComplex data
Data dependencies
Many programs
Many processorsStorage hierarchy
FailureHeterogeneity
Swift
Preservingfile system semantics,
ability to call arbitrary executables
![Page 5: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/5.jpg)
5Functional magnetic resonance imaging (fMRI) data analysis
![Page 6: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/6.jpg)
6
AIRSN program definition
(Run snr) functional ( Run r, NormAnat a, Air shrink ) {
Run yroRun = reorientRun( r , "y" );Run roRun = reorientRun( yroRun , "x" );Volume std = roRun[0];Run rndr = random_select( roRun, 0.1 );AirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, "81 3 3" );Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k" );Volume meanRand = softmean( reslicedRndr, "y", "null" );Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, "81 3 3" );Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir );Run nr = reslice_warp_run( boldNormWarp, roRun );Volume meanAll = strictmean( nr, "y", "null" )Volume boldMask = binarize( meanAll, "y" );snr = gsmoothRun( nr, boldMask, "6 6 6" );}
(Run or) reorientRun (Run ir, string direction) { foreach Volume iv, i in ir.v { or.v[i] = reorient(iv, direction); }}
![Page 7: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/7.jpg)
7
Many many tasks:Identifying potential drug targets
2M+ ligands Protein xtarget(s)
Benoit Roux et al.
![Page 8: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/8.jpg)
8
start
report
DOCK6Receptor
(1 per protein:defines pocket
to bind to)
ZINC3-D
structures
ligands complexes
NAB scriptparameters
(defines flexibleresidues,
#MDsteps)
Amber Score:1. AmberizeLigand
3. AmberizeComplex5. RunNABScript
end
BuildNABScript
NABScript
NABScript
Template
Amber prep:2. AmberizeReceptor4. perl: gen nabscript
FREDReceptor
(1 per protein:defines pocket
to bind to)
Manually prepDOCK6 rec file
Manually prepFRED rec file
1 protein(1MB)
6 GB2M
structures(6 GB)
DOCK6FRED~4M x 60s x 1 cpu
~60K cpu-hrs
Amber~10K x 20m x 1 cpu
~3K cpu-hrs
Select best ~500
~500 x 10hr x 100 cpu~500K cpu-hrsGCMC
PDBprotein
descriptions
Select best ~5KSelect best ~5K
For 1 target:4 million tasks
500,000 cpu-hrs(50 cpu-years)
![Page 9: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/9.jpg)
9
![Page 10: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/10.jpg)
10
IBM BG/P570 Teraflop/s, 164,000 cores, 80 TB
![Page 11: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/11.jpg)
11DOCK on BG/P: ~1M tasks on 119,000 CPUs
118784 cores
934803 tasks
Elapsed time: 7257 sec
Compute time: 21.43 CPU years
Average task: 667 sec
Relative efficiency 99.7% (from 16 to 32 racks)Utilization: 99.6% sustained, 78.3% overall
Ioan Raicu et al.
Time (sec)
![Page 12: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/12.jpg)
12
Managing 160,000 cores
Slower shared storage
High-speed local “disk”
Falkon
![Page 13: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/13.jpg)
13Scaling Posix to
petascale
LFS Computenode
(local datasets)
LFS Computenode
(local datasets)
IFSseg
IFScompute
node
IFSseg
IFScompute
node
…
. . .
Largedataset
CN-striped intermediate file system
Torus and tree interconnects
ZOID IFS ZOID on I/O node
Global file systemChirp(multicast)
MosaStore(striping)
Staging
Intermediate
Local
![Page 14: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/14.jpg)
14
Efficiency for 4 second tasks and varying data size (1KB to 1MB) for CIO and GPFS up to 32K processors
![Page 15: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/15.jpg)
15
Provisioning for data-intensive workloads Example: on-demand
“stacking” of arbitrary locations within ~10TB sky survey
Challenges Random data access Much computing Time-varying load
Solution Dynamic acquisition
of compute & storage
Data diffusion
++++++
=
+
S SloanData
IoanRaicu
![Page 16: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/16.jpg)
16
“Sine” workload, 2M tasks, 10MB:10ms ratio, 100 nodes, GCC policy, 50GB caches/node
IoanRaicu
![Page 17: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/17.jpg)
17
Same scenario, but with dynamic resource provisioning
![Page 18: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/18.jpg)
18
Data diffusion sine-wave workload: Summary
GPFS 5.70 hrs, ~8Gb/s, 1138 CPU hrs DD+SRP 1.80 hrs, ~25Gb/s, 361 CPU hrs DD+DRP 1.86 hrs, ~24Gb/s, 253 CPU hrs
![Page 19: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/19.jpg)
19Data-intensive computing @ Computation Institute: Example applications
Astrophysics Cognitive science East Asian studies Economics Environmental science Epidemiology Genomic medicine Neuroscience Political science Sociology Solid state physics
![Page 20: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/20.jpg)
20
![Page 21: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/21.jpg)
21
Sequencing outpaces Moore’s law
0.5 1 30 60 98 196 294$3,000
$30,000
$300,000
$3,000,000
$3,500
$7,000
$120,000
$240,000$300,000
$600,000$900,000
$15,000
$30,000 $15,000
$30,000
$15,000
$30,000
$45,000
BioinformaticsSequencing
Gigabases
454 Solexa
Next-gen Solexa
Folker Meyer, Computation Institute
BLASTOn EC2,
US$
![Page 22: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/22.jpg)
22Data-intensive computing @ Computation Institute: Hardware
500 TB reliable storage (data &
metadata)
180 TB, 180 GB/s
17 Top/s analysis
Dataingest
Dynamic provisioning
Parallel analysis
Remote access
Offload to remote data centers
P A D S
Diverseusers
Diversedata
sources
1000 TBtape backup
PADS: Petascale Active Data Store (NSF MRI)
![Page 23: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/23.jpg)
23Data-intensive computing @
Computation Institute: Software HPC systems software (MPICH, PVFS, ZeptOS) Collaborative data tagging (GLOSS) Data integration (XDTM) HPC data analytics and visualization Loosely coupled parallelism (Swift, Hadoop) Dynamic provisioning (Falkon) Service authoring (Introduce, caGrid, gRAVI) Provenance recording and query (Swift) Service composition and workflow (Taverna) Virtualization management (Workspace Service) Distributed data management (GridFTP, etc.)
![Page 24: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/24.jpg)
24
Low
LowHigh
High
Agreementabout
outcomes
Certainty about outcomes
Data-intensive computing is an end-to-end problem
Plan and
control
Chaos
Zone of
complexity
Ralph Stacey, Complexity and Creativity in Organizations, 1996
![Page 25: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/25.jpg)
25
Low
LowHigh
High
Agreementabout
outcomes
Certainty about outcomes
We need to function in the zone of complexity
Plan and
control
Chaos
Ralph Stacey, Complexity and Creativity in Organizations, 1996
![Page 26: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/26.jpg)
26
The Grid paradigm
1995 2000 2005 2010
Principles and mechanisms for dynamic virtual organizations
Leverage service oriented architecture Loose coupling of data and services Open software,
architecture
Computer science
Physics
Astronomy
Engineering
Biology
Biomedicine
Healthcare
![Page 27: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/27.jpg)
27
As of Oct19, 2008:
122 participants105 services
70 data35 analytical
![Page 28: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/28.jpg)
28
(Center for Health Informatics)
Multi-center clinical cancer trials image capture and review
![Page 29: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/29.jpg)
29
Summary
Extreme scripting offers the potential for easy scaling of proven working practices Interesting technical problems relating to
programming and I/O models Many wonderful applications
Data-intensive computing is an end-to-end problem Data generation, integration, analysis, etc.,
is a continuous, loosely coupled process
![Page 30: Extreme Scripting July 2009](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e9faeb4c905fb7c8b45c4/html5/thumbnails/30.jpg)
Thank you!
Computation Institutewww.ci.uchicago.edu