Enabling Data-Intensive Science Through Data Infrastructures
DI-MMAP: A High Performance Memory-Map Runtime for Data-Intensive Applications · 2013. 10. 26. ·...
Transcript of DI-MMAP: A High Performance Memory-Map Runtime for Data-Intensive Applications · 2013. 10. 26. ·...
Introduction DI-MMAP Runtime Experiments Conclusions
DI-MMAP: A High Performance Memory-MapRuntime for Data-Intensive ApplicationsDISCS 2012Nov. 16, 2012
Brian Van Essen, Henry Hsieh (UCLA), Sasha Ames, Maya Gokhale
LLNL-PRES-602272This work performed under the auspicesof the U.S. Department of Energy byLawrence Livermore National Laboratoryunder Contract DE-AC52-07NA27344.
LLNL-PRES-602272 Slide 1
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Motivation
Enable scalable out-of-core computations for data-intensivecomputing.
Effectively integrate non-volatile random access memory into the HPCnode’s memory architecture.
Address data-intensive computing scalability challenges:I Use node-local NVRAM to support larger working setsI DRAM-cached NVRAM to extend main memory
Allow latency-tolerant applications to be oblivious to transitions fromdynamic to persistent memory when accessing out-of-core data.
LLNL-PRES-602272 Slide 2
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
HPC Challenges and opportunities
I Data-intensive high-performance computing applications:I processing of massive real-world graphsI bioinformatics / computational biologyI streamline tracing (in-situ VDA)
I Creating data-intensive architecture is costly and power-intensiveI In traditional HPC architecture DRAM per core is going downI DRAM is expensive: cost and power
I NVRAM technologies promise:I lower latencyI higher densityI better concurrencyI minimal static power→ lower average power
LLNL-PRES-602272 Slide 3
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Data-Intensive High-Performance Computing
Data-Intensive Applications:I large data setsI large working sets that exceed capacity of main memoryI memory bound
I irregular data accessI latency sensitiveI minimal computation
Latency-tolerant algorithms:I highly concurrentI avoid bulk synchronous communicationI potentially asynchronous execution
LLNL-PRES-602272 Slide 4
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Integrating future NVRAM
Peripherally attached storage in near term
I 2-4 year horizonI Existing PCIe-attached Flash storage
High-performance PCIe-attached NVRAM
I Low access latencyI Efficient random accessI Faster peripheral bus
LLNL-PRES-602272 Slide 5
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Challenges for HPC Runtime
Integrating high-performance storage requires:I explicit out-of-core algorithmsI seamless integration of storage into memory hierarchy
I e.g. high-performance memory-map
Linux memory-map runtime does not:I scale well with increased concurrencyI perform well when memory is not freely available
. . . optimize memory-map runtime for data-intensive computing
LLNL-PRES-602272 Slide 6
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Direct I/O or memory-mapped I/O
Direct I/O - direct access to NVRAM pagesI Avoids overheads of software stackI Good for fetching multiple pages of data at once
Memory-Mapped I/O - map file/device into app’s virtual memoryI Good for word-level accessI Word access to cached pages is at memory speedsI Eliminates dichotomy between storage and memory
I Data structures easily transition out-of-coreI Can sacrifice performance
Memory-mapped I/O can seamlessly extended the memory hierarchy
LLNL-PRES-602272 Slide 7
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Data-intensive memory-map runtime (DI-MMAP)
A high-performance alternative to Linux mmap:I performance scales with increased concurrencyI performance does not degrade under memory pressureI explicit assignment from data structures to buffers
DI-MMAP features:I a fixed sized page bufferI minimal dynamic memory allocationI a simple FIFO buffer replacement policyI preferential caching for frequently accessed pages
LLNL-PRES-602272 Slide 8
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Using DI-MMAP
The DI-MMAP device driver:1. is loaded into a running Linux kernel2. it allocates a fixed amount of main memory for page buffering3. it creates a control interface file in the /dev filesystem
Once loaded:1. the control file is then used to create pseudo-files in /dev
2. pseudo-files link (i.e. redirect) to block devices in the system3. accesses to a pseudo-file are redirected to the linked block device4. pseudo-file is memory mapped into the applications virtual
memory space
LLNL-PRES-602272 Slide 9
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Buffer Management
DI-MMAP Buffer
Primary FIFO?
Hotpage FIFO
Eviction Queue
is a hot page
remove fromPage Tables
page fault
page evicted Free Page List
TLB Flush / writeback page
DI-MMAP Buffer Page Location Table
page recovered
Minimize the amount of effort needed to find a page to evict:I In the steady state a page is evicted on each page faultI Track recently evicted pages to maintain temporal reuseI Allow bulk TLB operations to reduce inter-processor interrupts
LLNL-PRES-602272 Slide 10
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Buffer Management
DI-MMAP Buffer
Primary FIFO?
Hotpage FIFO
Eviction Queue
is a hot page
remove fromPage Tables
page fault
page evicted Free Page List
TLB Flush / writeback page
DI-MMAP Buffer Page Location Table
page recovered
Minimize the amount of effort needed to find a page to evict:I In the steady state a page is evicted on each page faultI Track recently evicted pages to maintain temporal reuseI Allow bulk TLB operations to reduce inter-processor interrupts
LLNL-PRES-602272 Slide 10
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Buffer Management
DI-MMAP Buffer
Primary FIFO?
Hotpage FIFO
Eviction Queue
is a hot page
remove fromPage Tables
page fault
page evicted Free Page List
TLB Flush / writeback page
DI-MMAP Buffer Page Location Table
page recovered
Minimize the amount of effort needed to find a page to evict:I In the steady state a page is evicted on each page faultI Track recently evicted pages to maintain temporal reuseI Allow bulk TLB operations to reduce inter-processor interrupts
LLNL-PRES-602272 Slide 10
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Buffer Management
DI-MMAP Buffer
Primary FIFO?
Hotpage FIFO
Eviction Queue
is a hot page
remove fromPage Tables
page fault
page evicted Free Page List
TLB Flush / writeback page
DI-MMAP Buffer Page Location Table
page recovered
Minimize the amount of effort needed to find a page to evict:I In the steady state a page is evicted on each page faultI Track recently evicted pages to maintain temporal reuseI Allow bulk TLB operations to reduce inter-processor interrupts
LLNL-PRES-602272 Slide 10
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Buffer Management
DI-MMAP Buffer
Primary FIFO?
Hotpage FIFO
Eviction Queue
is a hot page
remove fromPage Tables
page fault
page evicted Free Page List
TLB Flush / writeback page
DI-MMAP Buffer Page Location Table
page recovered
Minimize the amount of effort needed to find a page to evict:I In the steady state a page is evicted on each page faultI Track recently evicted pages to maintain temporal reuseI Allow bulk TLB operations to reduce inter-processor interrupts
LLNL-PRES-602272 Slide 10
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Livermore random I/O testbench (LRIOT)Currently lack tools to effectively measure and evaluate NVRAM
I high speed
I highly concurrency
I tolerate complex and unstructured access patterns
FIO: industry standard for benchmarkingI Does not scale well
I Cannot mix concurrency with both processes and threads
LRIOT: high concurrency / high throughput benchmarking toolI Supports a mixture of processes and threads
I Multiple random and deterministic access patterns
I More deterministic timing measurements
LLNL-PRES-602272 Slide 11
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
LRIOT system setup
Test platform:I 16 core AMD 8356 Opteron system @ 2.3GHzI 64 GiB of DRAMI RHEL 6 2.6.32I 3× 80 GiB SLC NAND Flash Fusion-io ioDrive PCIe 1.1 x4 cards
I striped RAID 0
Benchmark:I uniform random I/O patternI 6.4 million reads (unique pages)→ 24 GiB working setI 128 GiB file
LLNL-PRES-602272 Slide 12
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Read-only LRIOT benchmark
Linux mmap:I Unconstrained performs wellI drops dramatically with 8GiB of
page cache
DI-MMAP:I much better with fixed sized
bufferI only loses 15% performance
from direct I/O with 1 GiB buffer
0 50 100 150 200 250
0e+
001e
+05
2e+
053e
+05
4e+
05
Num. Threads
IOP
S
●
●
●
●
●
●
●
● ●
●
LRIOT read−only: 3x Fusion−io
(0) Direct I/O(1) Memory−mapped I/O (Unconstrained)(2) Memory−mapped I/O 8GiB Buffer(3) DI−MMAP I/O 4GiB Buffer(4) DI−MMAP I/O 1GiB Buffer
LLNL-PRES-602272 Slide 13
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Read-only LRIOT benchmark
Linux mmap:I Unconstrained performs wellI drops dramatically with 8GiB of
page cache
DI-MMAP:I much better with fixed sized
bufferI only loses 15% performance
from direct I/O with 1 GiB buffer
0 50 100 150 200 250
0e+
001e
+05
2e+
053e
+05
4e+
05
Num. Threads
IOP
S
●
●
●
●
●
●
●
● ●
●
LRIOT read−only: 3x Fusion−io
(0) Direct I/O(1) Memory−mapped I/O (Unconstrained)(2) Memory−mapped I/O 8GiB Buffer(3) DI−MMAP I/O 4GiB Buffer(4) DI−MMAP I/O 1GiB Buffer
LLNL-PRES-602272 Slide 13
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Read-only LRIOT benchmark
Linux mmap:I Unconstrained performs wellI drops dramatically with 8GiB of
page cache
DI-MMAP:I much better with fixed sized
bufferI only loses 15% performance
from direct I/O with 1 GiB buffer
0 50 100 150 200 250
0e+
001e
+05
2e+
053e
+05
4e+
05
Num. Threads
IOP
S
●
●
●
●
●
●
●
● ●
●
LRIOT read−only: 3x Fusion−io
(0) Direct I/O(1) Memory−mapped I/O (Unconstrained)(2) Memory−mapped I/O 8GiB Buffer(3) DI−MMAP I/O 4GiB Buffer(4) DI−MMAP I/O 1GiB Buffer
LLNL-PRES-602272 Slide 13
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Read-only LRIOT benchmark
Linux mmap:I Unconstrained performs wellI drops dramatically with 8GiB of
page cache
DI-MMAP:I much better with fixed sized
bufferI only loses 15% performance
from direct I/O with 1 GiB buffer
0 50 100 150 200 250
0e+
001e
+05
2e+
053e
+05
4e+
05
Num. Threads
IOP
S
●
●
●
●
●
●
●
● ●
●
LRIOT read−only: 3x Fusion−io
(0) Direct I/O(1) Memory−mapped I/O (Unconstrained)(2) Memory−mapped I/O 8GiB Buffer(3) DI−MMAP I/O 4GiB Buffer(4) DI−MMAP I/O 1GiB Buffer
LLNL-PRES-602272 Slide 13
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Read-only LRIOT benchmark
Linux mmap:I Unconstrained performs wellI drops dramatically with 8GiB of
page cache
DI-MMAP:I much better with fixed sized
bufferI only loses 15% performance
from direct I/O with 1 GiB buffer
0 50 100 150 200 250
0e+
001e
+05
2e+
053e
+05
4e+
05
Num. Threads
IOP
S
●
●
●
●
●
●
●
● ●
●
LRIOT read−only: 3x Fusion−io
(0) Direct I/O(1) Memory−mapped I/O (Unconstrained)(2) Memory−mapped I/O 8GiB Buffer(3) DI−MMAP I/O 4GiB Buffer(4) DI−MMAP I/O 1GiB Buffer
LLNL-PRES-602272 Slide 13
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Write-only LRIOT benchmark
DI-MMAP:I on par with unconstrained Linux
mmap
I > 2× Linux mmap with 8GiBpage cache
I does not match performance ofdirect I/O(subject to further investigation)
0 50 100 150 200 250
0e+
001e
+05
2e+
053e
+05
4e+
05
Num. Threads
IOP
S
●
●
●●
●
●
●
●●
●
LRIOT write−only: 3x Fusion−io
(0) Direct I/O(1) Memory−mapped I/O (Unconstrained)(2) Memory−mapped I/O 8GiB Buffer(3) DI−MMAP I/O 4GiB Buffer(4) DI−MMAP I/O 1GiB Buffer
LLNL-PRES-602272 Slide 14
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Write-only LRIOT benchmark
DI-MMAP:I on par with unconstrained Linux
mmap
I > 2× Linux mmap with 8GiBpage cache
I does not match performance ofdirect I/O(subject to further investigation)
0 50 100 150 200 250
0e+
001e
+05
2e+
053e
+05
4e+
05
Num. Threads
IOP
S
●
●
●●
●
●
●
●●
●
LRIOT write−only: 3x Fusion−io
(0) Direct I/O(1) Memory−mapped I/O (Unconstrained)(2) Memory−mapped I/O 8GiB Buffer(3) DI−MMAP I/O 4GiB Buffer(4) DI−MMAP I/O 1GiB Buffer
LLNL-PRES-602272 Slide 14
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Write-only LRIOT benchmark
DI-MMAP:I on par with unconstrained Linux
mmap
I > 2× Linux mmap with 8GiBpage cache
I does not match performance ofdirect I/O(subject to further investigation)
0 50 100 150 200 250
0e+
001e
+05
2e+
053e
+05
4e+
05
Num. Threads
IOP
S
●
●
●●
●
●
●
●●
●
LRIOT write−only: 3x Fusion−io
(0) Direct I/O(1) Memory−mapped I/O (Unconstrained)(2) Memory−mapped I/O 8GiB Buffer(3) DI−MMAP I/O 4GiB Buffer(4) DI−MMAP I/O 1GiB Buffer
LLNL-PRES-602272 Slide 14
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Write-only LRIOT benchmark
DI-MMAP:I on par with unconstrained Linux
mmap
I > 2× Linux mmap with 8GiBpage cache
I does not match performance ofdirect I/O(subject to further investigation)
0 50 100 150 200 250
0e+
001e
+05
2e+
053e
+05
4e+
05
Num. Threads
IOP
S
●
●
●●
●
●
●
●●
●
LRIOT write−only: 3x Fusion−io
(0) Direct I/O(1) Memory−mapped I/O (Unconstrained)(2) Memory−mapped I/O 8GiB Buffer(3) DI−MMAP I/O 4GiB Buffer(4) DI−MMAP I/O 1GiB Buffer
LLNL-PRES-602272 Slide 14
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Write-only LRIOT benchmark
DI-MMAP:I on par with unconstrained Linux
mmap
I > 2× Linux mmap with 8GiBpage cache
I does not match performance ofdirect I/O(subject to further investigation)
0 50 100 150 200 250
0e+
001e
+05
2e+
053e
+05
4e+
05
Num. Threads
IOP
S
●
●
●●
●
●
●
●●
●
LRIOT write−only: 3x Fusion−io
(0) Direct I/O(1) Memory−mapped I/O (Unconstrained)(2) Memory−mapped I/O 8GiB Buffer(3) DI−MMAP I/O 4GiB Buffer(4) DI−MMAP I/O 1GiB Buffer
LLNL-PRES-602272 Slide 14
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Microbenchmarks system setup
Test platform:I 8 core AMD 2378 Opteron system @ 2.4GHzI 16 GiB of DRAMI 2× 200 GiB SLC NAND Flash Virident tachIOn Drive PCIe 1.1 x8
Benchmarks:1. Binary search on sorted vector2. Lookup on Ordered Map (Red-Black Tree)3. Lookup on Unordered Map (Hash Table)
I database size ranged from ∼ 112GiB to ∼ 135GiBI each micro-benchmark issued 220 queries
LLNL-PRES-602272 Slide 15
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Microbenchmarks: BST and Ordered MapDI-MMAP:
I significantly exceeds the performance of Linux mmap when each is constrained to anequal amount of buffering
I approaches the performance of mmap with no memory constraint
0 50 100 150 200 250
010
000
2000
030
000
4000
050
000
Num. Threads
Look
ups
per
seco
nd
●
●
●
●
●
●
●
●●
●
Binary Search on Sorted Vector: 2x Virident
(0) Memory−mapped I/O (Unconstrained)(1) Memory−mapped I/O 8GiB Buffer(2) Memory−mapped I/O 4GiB Buffer(3) DI−MMAP I/O 4GiB Buffer
0 50 100 150 200 250
010
000
2000
030
000
4000
050
000
Num. Threads
Look
ups
per
seco
nd
●
●
●
●
●
●
●
●●
●
Lookup on Ordered Map (Red−Black Tree): 2x Virident
(0) Memory−mapped I/O (Unconstrained)(1) Memory−mapped I/O 8GiB Buffer(2) Memory−mapped I/O 4GiB Buffer(3) DI−MMAP I/O 4GiB Buffer
LLNL-PRES-602272 Slide 16
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Microbenchmarks: Unordered mapDI-MMAP:
I significantly exceeds the performance of Linux mmap when each is constrained to anequal amount of buffering
I approaches the performance of mmap with no memory constraint
0 50 100 150 200 250
0e+
002e
+04
4e+
046e
+04
8e+
041e
+05
Num. Threads
Look
ups
per
seco
nd
●
●
●
●
●
●
●
●●
●
Lookup on Unordered Map (Hash Table): 2x Virident
(0) Memory−mapped I/O (Unconstrained)(1) Memory−mapped I/O 8GiB Buffer(2) Memory−mapped I/O 4GiB Buffer(3) DI−MMAP I/O 4GiB Buffer
LLNL-PRES-602272 Slide 17
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Metagenomic Search & Classification
Metagenomics:I sequencing of heterogenous genetic fragmentsI fragments (aka reads) may be derived from many organisms
Application queries a database of genetic markers called k-mers:I length k sequences out of a DNA, RNA, or protein alphabetI k-mer database stored in Flash storageI access patterns to the datasets are extremely randomI classification requires global view of reference database
Two tests:I k-mer lookupI sample classification
LLNL-PRES-602272 Slide 18
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Metagenomics Search & Classification system setup
Test platform:I 4 socket, 40 core, Intel E7 4850 @ 2 GHzI 1 TiB DRAMI Linux kernel 2.6.32 (Red Hat Enterprise 6).I 2× Fusion-io 1.2 TB ioDrive2 cards PCIe-2.0 x4
I RAID 0I block sizes of 4 KiB
I 16 GiB DRAM available for buffer cache
Application:I k = 18I database size is 635 GiB
LLNL-PRES-602272 Slide 19
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
K-mer lookup
Number of Threads0 50 100 150 200 250 300 350
K−
mer
s pe
r se
cond
0
10000
20000
30000
40000
50000
60000
70000
80000
DI−MMAP
standard−fs−mmap
Peak Performance:I 16 threads with Linux mmapI 240 threads for DI-MMAP
Lookups per second with DI-MMAP is 4.92× higher thanwith Linux mmap
LLNL-PRES-602272 Slide 20
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Metagenomic Sample Classification
Input set (metagenome)SRX DRR ERR
Bas
es p
er s
econ
d
0
20000
40000
60000
80000
100000 FS_MMAP_16_threadsDI−MMAP_16_threadsFS_MMAP_160_threadsDI−MMAP_160_threads
Near peak performance:I 16 threads for Linux
mmap
I 160 threads for DI-MMAP
Performance advantage ofDI-MMAP vs. Linux mmap:
I 4.88× for SRX input setI 3.66× for ERR input set
Performance varies with:I % of redundant k-mersI diversity of metagenome
(e.g. ERR)
LLNL-PRES-602272 Slide 21
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Metagenomic Sample Classification
Input set (metagenome)SRX DRR ERR
Bas
es p
er s
econ
d
0
20000
40000
60000
80000
100000 FS_MMAP_16_threadsDI−MMAP_16_threadsFS_MMAP_160_threadsDI−MMAP_160_threads
Near peak performance:I 16 threads for Linux
mmap
I 160 threads for DI-MMAP
Performance advantage ofDI-MMAP vs. Linux mmap:
I 4.88× for SRX input setI 3.66× for ERR input set
Performance varies with:I % of redundant k-mersI diversity of metagenome
(e.g. ERR)
LLNL-PRES-602272 Slide 21
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Metagenomic Sample Classification
Input set (metagenome)SRX DRR ERR
Bas
es p
er s
econ
d
0
20000
40000
60000
80000
100000 FS_MMAP_16_threadsDI−MMAP_16_threadsFS_MMAP_160_threadsDI−MMAP_160_threads
Near peak performance:I 16 threads for Linux
mmap
I 160 threads for DI-MMAP
Performance advantage ofDI-MMAP vs. Linux mmap:
I 4.88× for SRX input setI 3.66× for ERR input set
Performance varies with:I % of redundant k-mersI diversity of metagenome
(e.g. ERR)
LLNL-PRES-602272 Slide 21
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Conclusions
The data-intensive memory-map (DI-MMAP) runtime:1. provides scalable, out-of-core performance for data-intensive applications
2. allows increased performance of algorithms with increased concurrency
3. performance does not significantly degrade with smaller buffer size
DI-MMAP:I provides a viable solution for scalable out-of-core algorithmsI offloads the explicit buffering requirements from the application to the runtimeI allows the application to access its external data through a simple load/store
interfaceI hides much of the complexity of data movementI approaches the raw, peak performance of direct I/O
LLNL-PRES-602272 Slide 22
Lawrence Livermore National Laboratory
Introduction DI-MMAP Runtime Experiments Conclusions
Thank You!
Questions?
Open source release is in progress:
https://computation.llnl.gov/casc/dcca-pub/dcca/Data-centric_architecture.html
LLNL-PRES-602272 Slide 23