Research @ Northeastern University
description
Transcript of Research @ Northeastern University
![Page 1: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/1.jpg)
EMC Presentation April 2005 1
Research @ Northeastern University
• I/O storage modeling and performance – David Kaeli
• Soft error modeling and mitigation – Mehdi B. Tahoori
![Page 2: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/2.jpg)
I/O Storage Research at Northeastern University
David KaeliYijian Wang
Department of Electrical and Computer Engineering Northeastern University
Boston, [email protected]
![Page 3: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/3.jpg)
EMC Presentation April 2005 3
Outline• Motivation to study file-based I/O• Profile-driven partitioning for parallel file
I/O• I/O Qualification Laboratory @ NU• Areas for future work
![Page 4: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/4.jpg)
EMC Presentation April 2005 4
Important File-base I/O Workloads
• Many subsurface sensing and imaging workloads involve file-based I/O– Cellular biology – in-vitro fertilization with NU biologists– Medical imaging – cancer therapy with MGH– Underwater mapping – multi-sensor fusion with Woods Hole
Oceanographic Institution– Ground-penetrating radar – toxic waste tracking with Idaho
National Labs
![Page 5: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/5.jpg)
EMC Presentation April 2005 5
The Impact of Profile-guided Parallelization on SSI Applications
• Reduced the runtime of a single-body Steepest Descent Fast Multipole Method (SDFMM) application by 74% on a 32-node Beowulf cluster
• Hot-path parallelization• Data restructuring
• Reduced the runtime of a Monte Carloscattered light simulation by 98% on a 16-node Silicon Graphics Origin 2000
• Matlab-to-C compliation• Hot-path parallelization
• Obtained superlinear speedup of Ellipsoid Algorithm run on a 16-node IBM SP2
• Matlab-to-C compliation• Hot-path parallelization
Soil
Air
Mine
Scattered Light Simulation Speedup
1
10
100
1000
10000
100000
Run
time
in s
econ
ds
Original
Matlab-to-C
Hot pathparallelization
Ellipsoid Algorithm Speedup(versus serial C version)
05
101520
1 2 4 8 16Number of Nodes
Spee
dup
64-vector 256-vector1024-vector linear speedup
![Page 6: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/6.jpg)
EMC Presentation April 2005 6
Limits of Parallelization• For compute-bound workloads, Beowulf clusters can
be used effectively to overcome computational barriers
• Middlewares (e.g., MPI and MPI/IO) can significantly reduce the programming effort on parallel systems
• Multiple clusters can be combined, utilizing Grid Middleware (Globus Toolkit)
• For file-based I/O-bound workloads, Beowulf clusters and Grid systems are presently ill-suited to exploit the potential parallelism present on these systems
![Page 7: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/7.jpg)
EMC Presentation April 2005 7
Outline• Motivation to study file-based I/O• Profile-driven partitioning for parallel file
I/O• I/O Qualification Laboratory @ NU• Areas for future work
![Page 8: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/8.jpg)
EMC Presentation April 2005 8
Parallel I/O Acceleration• The I/O bottleneck
– The growing gap between the speed of processors, networks and underlying I/O devices
– Many imaging and scientific applications access disks very frequently
• I/O intensive applications– Out-of-core applications
– Work on large datasets that cannot fit in main memory
– File-intensive applications– Access file-based datasets frequently– Large number of file operations
![Page 9: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/9.jpg)
EMC Presentation April 2005 9
Introduction
• Storage architectures– Direct Attached Storage (DAS)
– Storage device is directly attached to the computer
– Network Attached Storage (NAS)– Storage subsystem is attached to a network of servers
and file requests are passed through a parallel filesystem to the centralized storage device
– Storage Area Network (SAN)– A dedicated network to provide an any-to-any connection
between processors and disks
![Page 10: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/10.jpg)
EMC Presentation April 2005 10
I/O PartitioningP
An I/O intensive
applicationDisk
P P P…
Disk
Disk Disk Disk
P P P…
…
Data Partitioning
Multiple disks(i.e. RAID)
Disk Disk Disk
P
…Data Striping
Multiple Processes (i.e. MPI-IO)
![Page 11: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/11.jpg)
EMC Presentation April 2005 11
I/O Partitioning• I/O is parallelized at both the application level
(using MPI and MPI-IO) and the disk level (using file partitioning)
• Ideally, every process will only access files on local disk (though this is typically not possible due to data sharing)
• How to recognize the access patterns?• Profile-guided approach
![Page 12: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/12.jpg)
EMC Presentation April 2005 12
Profile Generation
Run the application
Capture I/O execution profiles
Apply our partitioning algorithm
Rerun the tuned application
![Page 13: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/13.jpg)
EMC Presentation April 2005 13
I/O traces and partitioning• For every process, for every contiguous file access,
we capture the following I/O profile information:– Process ID– File ID– Address– Chunk size– I/O operation (read/write)– Timestamp
• Generate a partition for every process• Optimal partitioning is NP-complete, so we develop a
greedy algorithm• We have found we can use partial profiles to guide
partitioning
![Page 14: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/14.jpg)
EMC Presentation April 2005 14
for each IO process, create a partition;for each contiguous data chunk {
total up the # of read/write accesses on a process-ID basis;if the chunk is accessed by only one process
assign the chunk to the associated partition;if the chunk is read (but never written) by multiple processes
duplicate the chunk in all partitions where read;if the chunk is written by one process, but later read by multiple {
assign the chunk to all partitions where read and broadcast the updates on writes;
else assign the chunk to a shared partition; } }For each partition
sort chunks based on the earliest timestamp for each chunk;
Greedy File Partitioning Algorithm
![Page 15: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/15.jpg)
EMC Presentation April 2005 15
Parallel I/O Workloads• NASA Parallel Benchmark (NPB2.4)/BT
– Computational fluid dynamics– Generates a file (~1.6 GB) dynamically and then reads it back– Writes/reads sequentially in chunk sizes of 2040 Bytes
• SPEChpc96/seismic– Seismic processing– Generates a file (~1.5 GB) dynamically and then reads it back– Writes sequential chunks of 96 KB and reads sequential chunks of 2 KB
• Tile-IO– Parallel Benchmarking Consortium– Tile access to a two-dimensional matrix (~1 GB) with overlap– Writes/reads sequential chunks of 32 KB, with 2KB of overlap
• Perf– Parallel I/O test program within MPICH– Writes a 1 MB chunk at a location determined by rank, no overlap
• Mandelbrot– An image processing application that includes visualization– Chunk size is dependent on the number of processes
![Page 16: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/16.jpg)
EMC Presentation April 2005 16
10/100Mb Ethernet Switch
RAIDNode
LocalPCI-IDE
Disk
LocalPCI-IDE
Disk
P2-350Mhz
P2-350Mhz P2-350Mhz
P2-350Mhz
P2-350Mhz
RAIDNode
P2-350Mhz
Beowulf Cluster
![Page 17: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/17.jpg)
EMC Presentation April 2005 17
Hardware Specifics• DAS configuration
– Linux box, Western Digital WD800BB (IDE), 80GB, 7200RPM
• Beowulf cluster (base configuration)– Fast Ethernet 100Mbits/sec– Network Attached RAID - Morstor TF200 with 6-9GB drives
Seagate SCSI disks, 7200rpm, RAID-5 – Local attached IDE disks – IBM UltraATA-350840, 5400rpm
• Fibre channel disks– Seagate Cheetah X15 ST-336752FC, 15000rpm
![Page 18: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/18.jpg)
EMC Presentation April 2005 18
0
50
100
150
200
UnixWrite
UnixRead
MPI-IOWrite
MPI-IORead
P-IOWrite
P-IORead
Band
wid
th (M
B/se
c)
4 procs9 procs16 procs25 procs
Write/Read Bandwidth
0
50
100
150
200
UnixWrite
UnixRead
MPI-IOWrite
MPI-IORead
P-IOWrite
P-IORead
Band
wid
th (M
B/se
c)
4 procs8 procs16 procs24 procs
NPB2.4/BT
SPECHPC/seis
![Page 19: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/19.jpg)
EMC Presentation April 2005 19
0
25
50
75
100
125
MPI write MPI read PIO write PIO read
Band
wid
th (M
B/se
c)Write/Read Bandwidth
0
50
100
150
200
250
MPI write MPI read PIO write PIO read
Band
wid
th (M
B/se
c)
0
50
100
150
200
250
MPI write MPI read PIO write PIO read
Band
wid
th (M
B/se
c)
4 procs8 procs16 procs24 procs
MPI-Tile Perf
Mandelbrot
![Page 20: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/20.jpg)
EMC Presentation April 2005 20
Total Execution Time
0
1000
2000
3000
4000
Exec
utio
n Ti
me
(sec
onds
)
MPI-IOPIO
![Page 21: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/21.jpg)
EMC Presentation April 2005 21
Profile training sensitivity analysis• We have found that IO access patterns are
independent of file-based data values• When we increase the problem size or reduce
the number of processes, either:– the number of IOs increases, but access patterns and
chunk size remain the same (SPEChpc96, Mandelbrot), or
– the number of IOs and IO access patterns remain the same, but the chunk size increases (NBT, Tile-IO, Perf)
• Re-profiling can be avoided
![Page 22: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/22.jpg)
EMC Presentation April 2005 22
Execution-driven Parallel I/O Modeling• Growing need to process large, complex
datasets in high performance parallel computing applications
• Efficient implementation of storage architectures can significantly improve system performance
• An accurate simulation environment for users to test and evaluate different storage architectures and applications
![Page 23: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/23.jpg)
EMC Presentation April 2005 23
Execution-driven I/O Modeling • Target applications: parallel scientific programs
(MPI)• Target machine/Host machine: Beowulf clusters• Use DiskSim as the underlying disk drive
simulator• Direct execution to model CPU and network
communication• We execute the real parallel I/O accesses and
meanwhile, calculate the simulated I/O response time
![Page 24: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/24.jpg)
EMC Presentation April 2005 24
Validation – Synthetic I/O Workload on DASResponse Time of Sequential Writes
0
2
4
6
8
10
12
1 2 4 8 16access size in number of blocks
number of accesses = 1000
seco
nds
Response Time of Sequential Reads
0
2
4
6
8
10
1 2 4 8 16access size in number of blocks
number of accesses = 1000
seco
nds
modelreal
Response Time of Non-contiguous Reads
0
2
4
6
8
10
1 2 4 8 16 32seek distance in number of blocks
access size = 1 blocknumber of accesses = 1000
seco
nds
Response Time of Non-contiguous Writes
0
2
4
6
8
10
1 2 4 8 16 32seek distance in number of blocks
access size = 1 blocknumber of accesses = 1000
seco
nds
![Page 25: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/25.jpg)
EMC Presentation April 2005 25
Simulation Framework - NAS
LAN/WAN
Network File System
I/O traces
Local I/O traces Local I/O traces Local I/O traces Local I/O traces
RAID controller
DiskSim
I/O requests
Filesystem metadata
Logical file access addresses
![Page 26: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/26.jpg)
EMC Presentation April 2005 26
Execution Time of NPB2.4/BT on NAS - base configuration
0
500
1000
1500
2000
2500
3000
3500
4000
4 9 16 25number of processors
seco
nds
modelreal
![Page 27: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/27.jpg)
EMC Presentation April 2005 27
LAN/WAN
FileSystem FileSystem FileSystem FileSystem
I/O traces I/O traces I/O traces I/O traces
DiskSim
DiskSim
DiskSim
DiskSim
Simulation Framework – SAN direct• A variety of SAN where disks are distributed across the network and each server is directly connected to a single device• File partitioning• Utilize I/O profiling and data partitioning heuristics to distribute portions of files to disks close to the processing nodes
![Page 28: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/28.jpg)
EMC Presentation April 2005 28
Execution Time of NPB2.4/BT on SAN-direct - base configuration
0
500
1000
1500
2000
2500
3000
4 9 16 25
number of processors
seco
nds
modelreal
![Page 29: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/29.jpg)
EMC Presentation April 2005 29
Hardware Specifications
![Page 30: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/30.jpg)
EMC Presentation April 2005 30
I/O Bandwidth of SPEChpc/seis
0
50
100
150
200
250
NA
S-joulian
NA
S-ATA
NA
S-SCSI
NA
S-FC
SAN
-joulian
SAN
-direct -ATA
SAN
-direct-SCSI
SAN
-direct-FC
storage architectures
MB
/s
4 processors8 processors16 processors
![Page 31: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/31.jpg)
EMC Presentation April 2005 31
I/O Bandwidth of Mandelbrot
050
100150200250300350400
NA
S-joulian
NA
S-ATA
NA
S-SCSI
NA
S-FC
SAN
-joulian
SAN
-direct -ATA
SAN
-direct-SCSI
SAN
-direct-FC
storage architectures
MB
/s 4 processors8 processors16 processors
![Page 32: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/32.jpg)
EMC Presentation April 2005 32
Publications• “Profile-guided File Partitioning on Beowulf Clusters,” Journal of Cluster
Computing, Special Issue on Parallel I/O, to appear 2005.• “Execution-Driven Simulation of Network Storage Systems,” Proceedings of the 12th
ACM/IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), October 2004, pp. 604-611.
• “Profile-Guided I/O Partitioning,” Proceedings of the 17th ACM International Symposium on Supercomputing, June 2003, pp. 252-260.
• “Source Level Transformations to Apply I/O Data Partitioning,” Proceedings of the IEEE Workshop on Storage Network Architecture And Parallel IO, Oct. 2003, pp. 12-21.
• “Profile-Based Characterization and Tuning for Subsurface Sensing and Imaging Applications,” International Journal of Systems, Science and Technology, September 2002, pp. 40-55.
![Page 33: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/33.jpg)
EMC Presentation April 2005 33
Summary of Cluster-based Work • Many imaging applications are dominated by file-based
I/O• Parallel systems can only be effectively utilized if I/O is
also parallelized • Developed a profile-guided approach to I/O data
partitioning• Impacting clinical trials at MGH• Reduced overall execution time by 27-82% over MPI-IO• Execution-driven I/O model is highly accurate and
provides significant modeling flexibility
![Page 34: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/34.jpg)
EMC Presentation April 2005 34
Outline• Motivation to study file-based I/O• Profile-driven partitioning for parallel file
I/O• I/O Qualification Laboratory @ NU• Areas for future work
![Page 35: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/35.jpg)
EMC Presentation April 2005 35
I/O Qualification Laboratory• Working with Enterprise Strategy Group• Develop a state-of-the-art facility to provide
independent performance qualification of Enterprise Storage systems
• Provide a quarterly report to ES customer base on the status of current ES offerings
• Work with leading ES vendors to provide them with custom early performance evaluation of their beta products
![Page 36: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/36.jpg)
EMC Presentation April 2005 36
I/O Qualification Laboratory
• Contacted by IOIntegrity and SANGATE for product qualification
• Developed potential partners that are leaders in the ES field
• Initial proposals already reviewed by IBM, Hitachi and other ES vendors
• Looking for initial endorsement from industry
![Page 37: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/37.jpg)
EMC Presentation April 2005 37
I/O Qualification Laboratory
• Why @ NU– Track record with industry (EMC, IBM,
Sun)– Experience with benchmarking and IO
characterization– Interesting set of applications (medical,
environmental, etc.)– Great opportunity to work within the
cooperative education model
![Page 38: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/38.jpg)
EMC Presentation April 2005 38
Outline• Motivation to study file-based I/O• Profile-driven partitioning for parallel file
I/O• I/O Qualification Laboratory @ NU• Areas for future work
![Page 39: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/39.jpg)
EMC Presentation April 2005 39
Areas for Future Work• Designing a Peer-to-Peer storage system on a Grid system
by partitioning datasets across geographically distributed storage devices
joulian.hpcl.neu.edu keys.ece.neu.edu
Internet
1Gbit/s100Mbit/sRAID
31 sub-nodes 8 sub-nodes
Head node Head node
![Page 40: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/40.jpg)
EMC Presentation April 2005 40
NPB2.4/BT read performance
0
20
40
60
80
100
120
140
160
180
single server dual server P2P
MB/s
4 procs9 procs16 procs25 procs
![Page 41: Research @ Northeastern University](https://reader035.fdocuments.us/reader035/viewer/2022062521/568167f8550346895ddd74b2/html5/thumbnails/41.jpg)
EMC Presentation April 2005 41
Areas for Future Work• Reduce simulation time by identifying
characteristic “phases” in I/O workloads• Apply machine learning algorithms to identify
clusters of representative I/O behavior• Utilize K-Means and Multinomial clustering to
obtain high fidelity in simulation runs utilizing sampled I/O behavior
“A Multinomial Clustering Model for Fast Simulation of Architecture Designs”, submitted to the 2005 ACM KDD Conference.