Download - Out$of$core*Sorting*Acceleration using*GPU*and*Flash*NVMsc15.supercomputing.org/sites/all/themes/SC15images/tech... · 2016-05-10 · 0" 50,000,000" 100,000,000" 150,000,000" 200,000,000"

Transcript
Page 1: Out$of$core*Sorting*Acceleration using*GPU*and*Flash*NVMsc15.supercomputing.org/sites/all/themes/SC15images/tech... · 2016-05-10 · 0" 50,000,000" 100,000,000" 150,000,000" 200,000,000"

0"

50,000,000"

100,000,000"

150,000,000"

200,000,000"

250,000,000"

0 10 20

throughp

ut([record/sec]

num(records([109(record]

in)core)cpu(18) in)core)cpu(36)in)core)cpu(54) in)core)cpu(72)in)core)gpu out)of)core)cpu(18)out)of)core)cpu(36) out)of)core)cpu(54)out)of)core)cpu(72) out)of)core)gpuxtr2sort

Out-­‐of-­‐core  Sorting  Accelerationusing  GPU  and  Flash  NVM

Hitoshi  Sato†‡,  Ryo  Mizote†‡,  Satoshi  Matsuoka†‡† Tokyo  Institute  of  Technology,  ‡ CREST,  JST

Intoroduction

Motivation:  ü How  to  overcome  memory  capacity  limitation?  ü How  to  offload  bandwidth  oblivious  operations  

onto  low  throughput  devices?Proposal:  xtr2sort  (Extreme  External  Sort)

Experiment  

Summary

1. Unsorted  records  are  located  on  Flash  NVM

2. Divide  input  records  into  c  chunks  to  fit  GPU  mem  capacity

3. Then,  sort  the  chunks  on  GPU  in  the  pipeline  w/  data  transfers

4. Partition  each    of  chinks  into  c  buckets  using  randomly  sampledc-­‐1  splitters

5. Then,  swap  the  buckets  between chunks

6. Sort  each  of  the  chunks  on  GPU  in the  pipeline  w/  data  transfer

7. Sorted  records  are  placedon  Flash  NVM

CPU

c"#1"splitters

c chunks

c buckets

Unsorted+records+on+NVM

Sorted"records

In#core"GPU"sorting

Swap"buckets"between"chunks

In#core"GPU"sorting

GPU

CPUGPU

• Sample-­‐Sort-­‐Based  Out-­‐of-­‐core  Sorting  Approach[1][2] for  Deep  Memory  Hierarchy  Systems  w/  GPU  and  Flash  NVM• I/O  Chunking  to  fit  GPU  Memory  Capacity  in  order  to  exploit  Massive  Parallelism  and  Memory  Bandwidth  of  GPU

ü Employ  Asynchronous  Data  Transfers  using  CUDA  Streams  and  cudaMemCpyAsync()  between  CPU  and  GPUü Page-­‐locked  Memory  (a.k.a.  Pinned  Memory)  Volumes  required  

• Pipeline-­‐based  Latency  Hiding  to  overlap  File  I/O  between  Flash  NVM  and  CPU  using  Linux  Asynchronous  I/O  System  Callsü Pros: Fully-­‐overlapped  READ/WRITE  File  I/Oü Cons:  Direct  I/O  required,  e.g.,  O_DIRECT  Flag,  Aligned  File  Offset  Memory  Buffer,  Transfer  Size

• Sorting  is  a  Key  Building  Block  for  Big  Data  Applications  ü e.g.,  Database  Management  Systems  Programming  Frameworks,

Supercomputing  Applications,  etc.ü Large  Memory  Capacity  Requirement

• Towards  Future  Computing  Architecturesü Dropping  Available  Memory  Capacity  per  Core  for  Achieving  Efficient  Bandwidth

by  Increasing  in  Parallelism,  Heterogeneity,  Density  of  ProcessorsØ e.g.,  Multi-­‐core  CPUs,  Many-­‐core  AcceleratorsØ Post  Moore  Era

ü Deeping  Memory/Storage  ArchitecturesØ Device  Memory  on  Many-­‐core  Accelerators,Ø Host  Memory  on  Compute  NodesØ Semi-­‐external  Memory  connected  w/  Compute  Nodes  

such  as  Non-­‐volatile  Memory(NVM),  Storage  Class  Memory  (SCM)

RD R2H H2D EX D2H H2W WR

RD R2H H2D EX D2H H2W WR

RD R2H H2D EX D2H H2W WR

RD R2H H2D EX D2H H2W WR

RD R2H H2D EX D2H H2W WR

RD R2H H2D EX D2H H2W WR

chunk&i

chunk&i+1chunk&i+2

chunk&i+3

chunk&i+4

RD R2H H2D EX D2H H2W WR

chunk&i+5

chunk&i+6

c chunkstime

RD H2D EX D2H WR

RD H2D EX D2H

RD H2D EX D2H

RD H2D EX

RD H2D

WR

EX

D2H

D2H

WR

WR

WR

time

chunk*i

chunk*i+1chunk*i+2

chunk*i+3

chunk*i+4

c((chunks

Regular  Chunk  Size  for  Aligned  File  Offset,  Memory  Buffer,  Transfer  Size

Irregular  Chunk  Size  depending  on  Sampling  (Splitting)  Results

ü 3  CUDA  Streams  for  H2D,  EX,  D2Hü Asynchronous  I/O  for  RD,  WRü 2  READ  Pinned  Buffers  for  RD,  H2D,  and

2  WRITE  Pinned  Buffers  for  D2H,  WR

ü 3  CUDA  Streams  for  H2D,  EX,  D2Hü Asynchronous  I/O  for  RD,  WRü 2  POSIX  Threads  for  R2H,  D2Hü 2  READ  Aligned  Buffers  for  RD,  H2D,

2  WRITE  Aligned  Buffers  forD2H,  WR,  and  4  Device  Pinned  Buffersfor  R2H,  H2D,  D2H,  H2W

5  Stage  Pipeline  Approach  

7  Stage  Pipeline  Approach  

RD READ  I/O  from  NVM

WR WRITE  I/O  from  NVM

R2H

H2W

Memcpy from  Host  (Aligned)  to  Host  (Pinned)

Memcpy from  Host  (Pinned)  to  Host  (Aligned)

H2D EX

D2H

Memcpy from  Host(Pinned)  to  Device

Memcpy from  Deviceto  Host  (Pinned)

Compute  on  Device

Hardware

CPU Intel  Xeon  E5-­‐2699 v3  2.30  GHz  (18  cores)  x  2  sockets,  HT  enabled

MEM DDR4-­‐2133 128  GBGPU NVIDIA  Tesla  K40   w/  12  GB  MemNVM Huawei  ES 3000  v1  PCIe SSD  2.4  TB

Software

OS Linux  3.19.8Compiler gcc 4.4.7CUDA v7.0Thurst v1.8.1

File System xfs

Comparison  using uniformly  distributed  random  records  w/  int64_t  ü in-­‐core-­‐cpu(n):  In-­‐core  CPU  sorting  w/  libc++  Parallel  Mode  using  n  threadsü in-­‐core-­‐gpu:  In-­‐core  GPU  sorting  w/  Thrustü out-­‐of-­‐core-­‐cpu(n):   Same  Technique  as  xtr2sort,  but  only  using  CPU  (Same  Device  Mem,  n  threads)ü out-­‐of-­‐core-­‐gpu:  Same  Technique  as  xtr2sort,  but  only  using  GPU,  no  File  I/Oü xtr2sort:  Proposed  Technique

Sorting  Throughput

Distribution  of  Execution  Time  in  Each  Pipeline  Stage

0

50

100

150

200

250

300

350

400

450

500

RD R2H H2D EX D2H H2W WR

elap

sed'tim

e'[m

s]

• xtr2sort:  Sample-­‐sort-­‐based  Out-­‐of-­‐core  Sorting  for  Deep  Memory  Hierarchy  Systems  w/  GPU  and  Flash  NVM

• Experimental  results  show  that  xtr2sort  achieves  up  to  ü x64  larger  record  size  than  in-­‐core  GPU  sortingü x4  larger  record  size  than  in-­‐core  CPU  sortingü x2.16  faster  than  out-­‐of-­‐core  CPU  sorting  using  72  threads  

• I/O  chunking  and  latency  hiding  approach  works  really  well  for  GPU  and  Flash  NVM• Future  work  includes  performance  modeling,  power  measurement,  etc.

In-­‐core  GPU  Sorting      ~  0.4  G  recordsGPU  Memory  Capacity    Limitation

In-­‐core  CPU  Sorting~  6.4  G  recordsHost  (CPU)  Memory  Capacity  Limitation

xtr2sort~  25.6  G  recordsx64  larger  record  size  than  in-­‐core-­‐gpux4  larger  record  size  than  in-­‐core-­‐cpu

x2.16    faster

Next-­‐gen  NVM  devicesNVMe,  3D  XPoints,  etc.    

NVLink etc.    

Next-­‐gen  Accelerators  (GPU)  etc.

[1]  Peaters et  al.  “Parallel  external  sorting  for  CUDA-­‐enabled  GPUs  with  load  balancing  and  low  transfer  overhead”,  IPDPSW  Phd Forum,  pp  1-­‐8,  2010[2]  Ye  et  al.  “GPU  Mem  Sort:  A  High  Performance  Graphics  Co-­‐ processors  Sorting  Algorithm  for  Large  Scale  In-­‐Memory  Data”,    GSTF  International  Journol on  Computing,  Vol.  1,  No.2,  pp.  23-­‐ 28,  2011