Flash Memory for Full-Throttle GPU Acceleration - GTC...
Transcript of Flash Memory for Full-Throttle GPU Acceleration - GTC...
FLASH MEMORY FOR FULL-THROTTLE
GPU ACCELERATION
Vincent Brisebois
• 12 years at Autodesk Media & Entertainment
Tech Support / Product Specialist / Product Designer
• Member of the Visual Effects Society Technology Council
• 2 years at Fusion-io
Entertainment Business Development
Performance Computing Industry Manager
THE DATA SUPPLY PROBLEM LEADS TO IDLE
CAPACITY
3
According to Moore's Law, processing performance doubles every 18 months
CPUs
Memory
Storage Rela
tive P
erf
orm
ance
Growing Performance Gap
2000 2005 1985 1990 1995 2010
FOUNDER & CEO – DAVID FLYNN
TECHNOLOGY ENABLERS
FLASH MEMORY SOFTWARE-ENABLED
REPROGRAMMABLE
CONTROLLERS
PCIe ECOSYSTEM
ARCHITECTURAL HIGHLIGHTS
Reliability
• N+1 redundancy
• Like having a RAID between chips
• Without the capacity sacrifice
• Over Provisioning
• Reserve space for handling individual pixels dying
• Reserve space is adjustable if higher write performance is needed
• High ECC strength
• 72 bit error correction
NAND Flash Chips
Heat Sink/FPGA
Parity Chip
Fusion-io Confidential 7 May 24, 2012
NETWORKED STORAGE DATA SUPPLY
CHAIN FROM APPLICATION TO FLASH
9 Intermediary components required
All adding access delay, cost, complexity, and lowering reliability
(especially the super capacitors)
Requests must do a round trip touching everything TWICE…
Application
Server
Processor
Network
Switch
Storage
Appliance
Processor
Disk RAID
Controller
SAS/SATA
Bus and
Protocol
SSD
Embedded
CPU
SSD
RAM
Battery/Sup
er
Capacitors
NAND
Flash
Network
Adapter
Network
Adapter
Fusion-io Confidential 8 May 24, 2012
SSD DATA SUPPLY CHAIN
FROM APPLICATION TO FLASH
5 Intermediary components required
All adding access delay, cost, complexity, and lowering reliability
(especially the super capacitors)
Application
Server
Processor
Disk RAID
Controller
SAS/SATA
Bus and
Protocol
SSD
Embedded
CPU
SSD
RAM
Battery/Sup
er
Capacitors
NAND
Flash
Fusion-io Confidential 9 May 24, 2012
FUSION-IO DATA SUPPLY CHAIN
FROM APPLICATION TO FLASH
0 Intermediary components required
No need for super capacitors because data is not
"buffered” in DRAM
NAND
Flash
Application
Server
Processor
FUSION-IO FIRST MOVER MILESTONES
May 24, 2012 10
⌃
2006
⌃
2007
⌃
2008
⌃
2009
⌃
2010
⌃
2011
⌃
2012
Mission to
consolidate
memory and
storage
ioMemory
technology
unveiled
First products
launched
1 million IOPS IBM Quicksilver
Dell strategic
investment
HP OEMs
products
IBM OEMs
products
Samsung
strategic
investment
Dell OEMs
products
VSL
introduced
IPO on NYSE
ioTurbine
acquired
ioDrive2
announced
50+ Petabytes
shipped
1 Billion IOPS
2,500
customers
>120 channel
and alliance
partners
ioFX
ioMemory SDK
CHIEF SCIENTIST – STEVE WOZNIAK
12
PERFORMANCE COUNTS
COMPREHENSIVE CUSTOMER SUCCESS
May 24, 2012 13 30+ case studies at http://fusionio.com/casetudies
F I N AN C I AL S M AN U F A C T U R I N G /
G O VER N M E N T W EB T EC H N O L O G Y R ET A I L
FASTER DATA
W AREHOUSE
QUERIES 40x QUERY
PROCESSING
THROUGHPUT 15x
The world’s leading Q&A site
®
FASTER
DATABASE
REPLICATION 30x FASTER DATA
ANALYSIS 5x FASTER
QUERIES 15x
FUSION-IO ACCELERATES
May 24, 2012 14
Analytics Search
ORACLE Text
Messaging
MQ
Databases
INFORMIX
Virtualization
KVM
HPC
GPFS
Big Data
Security/Logging
Collaboration
Lotus
Development Web
LAMP
Caching Workstation
IOMEMORY PLATFORM
May 24, 2012 15
FUSION IOFX MEMORY TIER
▸ Tuned for sustained performance in multithreaded applications
▸ Work on 2K, 4K and 5K digital content interactively, in full resolution
▸ Manipulate stereoscopic content in real-time
▸ Accelerate video and image editing and compositing
▸ Speed video playback
▸ Powerful throughput to maximize GPU processing
▸ Simplify and accelerate encoding and transcoding
▸ Accelerate compiling code for software programmers
May 24, 2012 16
420GB
1.4 GB/s Read
700MB/s Write
42µs
QDP MLC
IOMEMORY PERFORMANCE
May 24, 2012 17
Capacity 365GB Duo 2.4TB 400GB Duo 1.2TB ioFX
NAND Type MLC MLC SLC SLC MLC
Read Bandwidth 910 MB/s 3.0 GB/s 1.4 GB/s 3.0 GB/s 1.5 GB/s
Write Bandwidth 590 MB/s 2.5 GB/s 1.3 GB/s 2.6 GB/s 700 MB/s
Read IOPS (Seq) 415,000 892,000 351,000 702,000
Write IOPS (Seq) 535,000 935,000 511,000 937,000
Read IOPS (Rand) 137,000 285,000
Write IOPS (Rand) 535,000 725,000
Read Latency 68 us 68 us 47 us 47 us 68 us
Write Latency 15 us 15 us 15 us 15 us 15 us
Bus Interface PCIe 2.0 x4 PCIe 2.0 x8 PCIe 2.0 x4 PCIe 2.0 x8 PCIe 2.0 x4
FLASH MEMORY EVOLUTION
Legacy SSDs ioMemory as Block Device
ioMemory as Transparent Cache
ioMemory with direct access I/O
ioMemory with memory semantics
Ap
plic
atio
n
Application
Ap
plic
atio
n
Application Application Application Application
Open Source Extensions
Open Source Extensions
OS Block I/O OS Block I/O OS Block I/O
Direct-access I/O API Family
Memory Semantics API Family
Ho
st
Ho
st
File System File System File System directFS – native file
system service
directFS
VSL
Block Layer Block Layer Block Layer
SAS/SATA
Network VSL
Virtual Storage Layer
directCache
VSL VSL
Rem
ote
RAID Controller VSL
Flash Layer
Read/Write Read/Write Read/Write Read/Write Read/Write Load/Store
Native Access |
May 24, 2012 18
FLASH MEMORY EVOLUTION
Legacy SSDs ioMemory as Block Device
ioMemory as Transparent Cache
ioMemory with direct access I/O
ioMemory with memory semantics
Ap
plic
atio
n
Application
Ap
plic
atio
n
Application Application Application Application
Open Source Extensions
Open Source Extensions
OS Block I/O OS Block I/O OS Block I/O
Direct-access I/O API Family
Memory Semantics API Family
Ho
st
Ho
st
File System File System File System directFS – native file
system service
directFS
VSL
Block Layer Block Layer Block Layer
SAS/SATA
Network VSL
Virtual Storage Layer
directCache
VSL VSL
Rem
ote
RAID Controller VSL
Flash Layer
Read/Write Read/Write Read/Write Read/Write Read/Write Load/Store
Native Access |
May 24, 2012 19
Direct I/O
IOMEMORY AS BLOCK DEVICE
Demo
May 24, 2012 20
SYSTEM DIAGRAM
May 24, 2012 21
QUADRO DUAL COPY ENGINE
May 24, 2012 22
OPENGL PIXEL BUFFER OBJECTS (PBO)
File system direct I/O file_handle = CreateFile(LPCSTR(video_file),
GENERIC_READ, FILE_SHARE_READ,
NULL, OPEN_EXISTING, FILE_FLAG_NO_BUFFERING, NULL);
GPU DMA-able system buffer glGenBuffers(1, &buffer_handle);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, buffer_handle);
glBufferData(GL_PIXEL_UNPACK_BUFFER_ARB, size, NULL,
GL_DYNAMIC_DRAW);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB,0);
May 24, 2012 23
READ FROM IOMEMORY
Map PBO for write glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, buffer_handle);
void pbomem = glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB,
GL_WRITE_ONLY);
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB,0);
Read from ioMemory BOOL ret = ReadFile(file_handle, pbomem, size,
&num_bytes_read, NULL);
Unmap PBO for DMA glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, buffer_handle);
glUnmapBuffer(GL_PIXEL_PACK_BUFFER_ARB);
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB,0);
May 24, 2012 24
TRANSFER TO GPU
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, buffer_handle);
glBindTexture(GL_TEXTURE_2D, texture_handle);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height,
0, GL_BGRA,
GL_UNSIGNED_BYTE, 0);
glBindTexture(GL_TEXTURE_2D, 0);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0);
Barrier sync DMA GLsync
fence = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
glClientWaitSync(fence,0,0);
glDeleteSync(fence);
May 24, 2012 25
PIPELINE
Read from ioMemory
DMA to GPU
Draw from GPU
May 24, 2012 26
ioFX ioFX
CUDA GPU DIRECT
Copy data directly to/from CUDA pinned host memory
Avoid one copy
Peer to peer transfers between GPUs
Utilizes PCIe DMA
Peer to peer memory access between GPUs
NUMA from within CUDA kernels
Pipeline transfers for GP-GPU
Read from ioMemory
Write to ioMemory
Unified Virtual Address Space !
May 24, 2012 27
CUDA
OS-pinned CUDA buffer // Alloc OS-pinned memory
cudaHostAlloc((void**)&h_odata, memSize, (wc) ?
cudaHostAllocWriteCombined : 0);
Read from ioMemory fd = open("/mnt/cudaMemory", O_RDWR | O_DIRECT);
if (fd != 0) {
rc= read(fd, h_odata, memSize);
Copy (DMA) to GPU cudaMemcpyAsync(d_idata, h_odata, memSize,
cudaMemcpyHostToDevice, stream);
May 24, 2012 28
PROGRAMMING PATTERNS
Pipelines
CPU threads
CUDA streams
Ring buffers
Parallel DMA
Direct I/O
But ioMemory is much more than a block device
It’s non-volatile memory with native access semantics…
May 24, 2012 29
EXPLOITING NATIVE CHARACTERISTICS OF IOMEMORY
1. Native log-append writes
incorporates copy-on-write basics
2. Native block mapping and allocation
incorporate file system basics
3. Native large virtual address space
incorporates sparse semantics
4. Native storage methods
incorporate key-value store basics
May 24, 2012 30
SDK INTRO
Fusion-io Software Development Kit Enables Native Flash
Memory Access:
• directPrimitives API,
including Atomic Writes and the MySQL InnoDB extension
• directKey-Value Store API *
• directFS, native file-access layer *
• Auto-Commit Memory API
• Extended Memory API
May 24, 2012 31
FLASH MEMORY EVOLUTION: NATIVE API ACCESS
Legacy SSDs ioMemory as Block Device
ioMemory as Transparent Cache
ioMemory with direct access I/O
Ap
plic
atio
n
Application
Ap
plic
atio
n
Application Application Application
Open Source Extensions
OS Block I/O OS Block I/O OS Block I/O
dir
ect
I/O
P
rim
itiv
es
dir
ect
Ke
y-V
alu
e St
ore
AP
I
dir
ect
Cac
he
AP
I
Ho
st
Ho
st
File System File System File System directFS – native file
system service Block Layer
Block Layer Block Layer SAS/SATA
Network VSL
Virtual Storage Layer
directCache
VSL
Rem
ote
RAID Controller VSL
Flash Layer
Read/Write Read/Write Read/Write Read/Write
Native Access |
May 24, 2012 32
FLASH MEMORY EVOLUTION: NATIVE API ACCESS
Legacy SSDs ioMemory as Block Device
ioMemory as Transparent Cache
ioMemory with direct access I/O
Ap
plic
atio
n
Application
Ap
plic
atio
n
Application Application Application
Open Source Extensions
OS Block I/O OS Block I/O OS Block I/O
dir
ect
I/O
P
rim
itiv
es
dir
ect
Ke
y-V
alu
e St
ore
AP
I
dir
ect
Cac
he
AP
I
Ho
st
Ho
st
File System File System File System directFS – native file
system service Block Layer
Block Layer Block Layer SAS/SATA
Network VSL
Virtual Storage Layer
directCache
VSL
Rem
ote
RAID Controller VSL
Flash Layer
Read/Write Read/Write Read/Write Read/Write
Native Access |
May 24, 2012 33
Direct I/O
KEY-VALUE STORE API LIBRARY
35
Application
Key Value API and Library
VSL – Dynamic provisioning,
Block allocation, logging etc.
Lookup:
exists()
atomic
write()
Atomic
delete
(PTRIM)
Coordinated
Garbage
Collection
Citrusleaf NoSQL Demo – April 2012
400,000 transactions/second on a single server
May 24, 2012
CUDA & KEY-VALUE STORE
OS-pinned CUDA buffer // Alloc OS-pinned memory
cudaHostAlloc((void**)&h_odata, memSize, (wc) ?
cudaHostAllocWriteCombined : 0);
KeyGet from ioMemory rc = directKeyGet(”key”, h_odata, &memSize);
Copy (DMA) to GPU cudaMemcpyAsync(d_idata, h_odata, memSize,
cudaMemcpyHostToDevice, stream);
May 24, 2012 36
DIRECTFS – NATIVE FILE SERVICES LAYER
37
Application
DirectFS – Namespace
File/Offset ->Sparse Address
VSL – Dynamic provisioning,
Block allocation, logging etc.
Lookup:
exists()
atomic
write()
Atomic
delete
(PTRIM)
May 24, 2012
CUDA & DIRECTFS
OS-pinned CUDA buffer // Alloc OS-pinned memory
cudaHostAlloc((void**)&h_odata, memSize, (wc) ?
cudaHostAllocWriteCombined : 0);
Read from ioMemory fd = open("/mnt/cudaMemory", O_RDWR | O_DIRECT);
if (fd != 0) {
rc= read(fd, h_odata, memSize);
Copy (DMA) to GPU cudaMemcpyAsync(d_idata, h_odata, memSize,
cudaMemcpyHostToDevice, stream);
May 24, 2012 38
FLASH MEMORY EVOLUTION: NATIVE API ACCESS
Legacy SSDs ioMemory as Block Device
ioMemory as Transparent Cache
ioMemory with direct access I/O
ioMemory with memory semantics
Ap
plic
atio
n
Application
Ap
plic
atio
n
Application Application Application Application
Open Source Extensions
Open Source Extensions
OS Block I/O OS Block I/O OS Block I/O
dir
ect
I/O
P
rim
itiv
es
dir
ect
Ke
y-V
alu
e St
ore
AP
I
dir
ect
Cac
he
AP
I
Exte
nd
ed
Mem
ory
Ch
eck
-p
oin
ted
M
emo
ry
Au
to-C
om
mit
M
emo
ry
Ho
st
Ho
st
File System File System File System directFS – native file
system service
directFS
VSL
Block Layer Block Layer Block Layer
SAS/SATA
Network VSL
Virtual Storage Layer
directCache
VSL VSL
Rem
ote
RAID Controller VSL
Flash Layer
Read/Write Read/Write Read/Write Read/Write Read/Write Load/Store
Native Access |
May 24, 2012 40
CONCLUSION
Early-access to ioMemory SDK libraries and technical documentation
http://developer.fusionio.com
ioMemory SDK Web Seminars:
May 24, 2012 43
Wednesday
May 2 directPrimitives API,
including Atomic Writes and the MySQL InnoDB extension
May 9 directKey-Value Store API
May 23 directFS, native file-access layer
May 30 Auto-Commit Memory API
June 6 Extended Memory API
WHAT WE WANT TO SEE
We encourage you to “Go Native” and engage us in discussion
as to where you want to see the technology grow.
We would love your input.
May 24, 2012 44
T H A N K Y O U