Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... ·...

37
Scalable EndtoEnd DataI/O over Enterprise and DataCenter Networks Yufei Ren now with IBM T. J. Watson ResearchCenter graduated from Stony Brook University Prof. Dantong Yu (Advisor ), BNL & Stony Brook University March 16 th , 2016 2015 SPEC Distinguished Dissertation Award

Transcript of Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... ·...

Page 1: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Scalable  End-­‐to-­‐End  Data  I/O  over  Enterprise  and  Data-­‐Center  Networks

Yufei  Rennow  with IBM  T.  J.  Watson  Research  Centergraduated  from  Stony  Brook  University

Prof.  Dantong Yu  (Advisor),  BNL  &  Stony  Brook  University

March  16th,  2016

2015  SPEC  Distinguished  Dissertation  Award

Page 2: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Data  I/O  on  Modern  Hardware

• Bulk  data  movement• 100  Gbps high  throughput

• Across  WANs  with  long  latency

• Storage  data  caching• Advanced  hardware

• Zero-­‐copy  network• Large-­‐scale  asymmetric  memory  layout

• Existing  solutions• TCP/IP  based• GridFTP,  OpenSSH• iSCSI,  NFS

2

WAN

SAN

Network/Secure  App  Group

Computation  Group

ArgonneData  Center

BrookhavenData  Center

Oak  RidgeData  Center

NERSC

Page 3: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

3

ENTERPRISE DEEP LEARNING AUTOGAMINGDATACENTERPRO

VISUALIZATION

WATSON CHAINER THEANO MATCONVNET

TENSORFLOW CNTK TORCH CAFFE

Page 4: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Research  Challenges

• Achieve  zero-­‐copy  in  end-­‐to-­‐end  data  path• Scale  zero-­‐copy  based  network  transmission  in  WANs• Improve  data  locality  in  large-­‐scale  NUMA  systems

3

Our  Goal:  Build systems  for  scalable  end-­‐to-­‐end  data  I/O  and  Understand system  performance  characteristicsEfficient  bulk  data  movement  in  LAN  and  WAN

Effective  cache  design  for  large  scale  memory  systemsPerformance  evaluation  with  advanced  hardware

Page 5: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Thesis  Outline

4

• Software  design  (Ch3)• RDMA-­‐based  data  transfer  protocol (Ch4)• Storage  caching optimization  (Ch5)• End-­‐to-­‐end  performance  optimization (Ch6)

WAN

SAN

Network/Secure  App  Group

Computation  Group

ArgonneData  Center

BrookhavenData  Center

Oak  RidgeData  Center

NERSC

Ch3

Ch4Ch5

Ch6

Page 6: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

End-­‐to-­‐End  Asynchronous  I/OPublications

• Yufei  Ren,  Tan  Li,  Dantong Yu,  Shudong Jin,  “End-­‐to-­‐End  Asynchronous  Processing  for  High  Throughput  Computing”,  under  revision.

• [SC’13]   Yufei  Ren,  Tan  Li,  Dantong Yu,  Shudong Jin,   Thomas  Robertazzi,  “Design  and  Performance  Evaluation  of  NUMA-­‐Aware  RDMA-­‐Based  End-­‐to-­‐End  Data  Transfer  Systems”,  In  Proceedings  of  the  International  Conference  on  High  Performance  Computing,  Networking,  Storage  and  Analysis   (SC   ’13),  Denver,  Colorado,  November  2013.

• [JSS’13]   Yufei  Ren,  Tan  Li,  Dantong Yu,   Shudong Jin,  Thomas  Robertazzi,  “Design  and  Testbed Evaluation  of  RDMA-­‐based  Middleware  for  High-­‐performance  Data  Transfer  Applications”,   Journal   of  Systems  and  Software  (JSS),   Volume  86,  Issue  7,  July  2013,   Pages  1850-­‐1863,   ISSN  0164-­‐1212,  10.1016/j.jss.2013.01.070.

• [SC’12]   Yufei  Ren,  Tan  Li,  Dantong Yu,  Shudong Jin,   Thomas  Robertazzi,  Brian  L.  Tierney,   Eric  Pouyoul,  “Protocols  for  Wide-­‐Area  Data-­‐intensive  Applications:  Design  and  Performance  Issues”,   In  Proceedings  of  the  International  Conference  on  High  Performance  Computing,  Networking,  Storage  and  Analysis   (SC  ’12),   Salt  Lake  City,  Utah,  November  2012.

Page 7: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Classical  Software  Design

• OpenSSH &  HPN-­‐SCP• Heavily  rely  on  data  copy  based  OS  services• Inter-­‐process  communication• TCP-­‐based  data  transfer• Massive  interrupt  handling• Synchronous   and  blocking  I/O

• Hardware  Performance• Network:  80Gbps• Storage:  100Gbps• CPU  AES  Encryption:  142Gbps

6

pipescpsftp

ssh sshdsocket connection(single stream)

fork

scpsftp

pipe

fork

01020304050607080

Throughput  (G

bps)

Application  -­‐ #  of  parallel  processes

Secure  Data  Movement  (AEC-­‐CBC-­‐128bit)

Page 8: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Network  Transmission  Profile

7

3  x  40  Gbps RoCE

• Memory  Bandwidth:  400Gbps• NUMA  Random  pick:  83.5Gbps• NUMA-­‐aware:  10%  improvement• TCP  wasn’t  able  to  saturate  this  fat  link• TCP  benchmark  takes  shortcut

• User  space  memory  -­‐>  CPU  Cache  -­‐>  Kernel  space  memory  -­‐>  DMA  to  NIC

• TCP:  35%  CPU  is  used  for  memory  copy• copy_user_generic_string()

• RDMA  achieved  98% bare-­‐metal  throughput.

91.883.5

0

20

40

60

80

100

120

TCP RDMA

Throughput  (Gbps) 91.8

117.6

iperf iperf-rdma

Page 9: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Related  work

• High  Performance  TCP-­‐based  solutions• GridFTP• bbcp

• RDMA  Send/Receive  for  bulk  data  movement• Lai  et  al.  ICPP’09

• RDMA  write  based  solution• Tian  et  al.  Journal  of  Comp  &  Elec  Eng ’12• One  RTT,  not  scalable  to  WAN

• Shared  control  channel  and  data  channel• Not  studied  in  WAN  environment

8

Page 10: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

• Efficient  secure  data  transfer• Storage  I/O,  Computing,  network  I/O

• Orchestrate  hardware  and  retain  zero-­‐copy• Observations  on  hardware  trend

• Network  – 2 magnitude  improvement  (1  Gbps -­‐ 100  Gbpswith  a  single  port)

• Storage  – 1 magnitude  improvement  (200MB/s  disk  – 2GB/s  flash)

• Multi-­‐core– 18  cores  per  die  X  8  socket• Memory capacity  improvement throughput  (200Gbps  on  a  NUMA  node)

• Memory-­‐centric  design

Overarching  Goal

9

Page 11: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

ACES  Framework

• Asynchronous  Concurrent  Event-­‐driven  Staged• End-­‐to-­‐end  zero-­‐copy 10

SourceSink

Monitor  thread(Accounting  and  thread  adjustment)

Monitor  thread(Accounting  and  thread  adjustment)

Storage  I/O

Data  Integrity

Data  Security

Network  I/O

Storage  I/O

Data  Integrity

Data  Security

Network  I/O

Disks CPU/GPU CPU/GPU NICs

Task  Queue

Page 12: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Asynchronous  Storage  I/O

• Asynchronous  I/O  separates  I/O  submission  and  I/O  completion• libaio – Linux  asynchronous  I/O  library• io_submit()• io_getevents()

11

New  task:Prepare,  submit

I/O  Completion:Poll,  post  to  next

Storage  I/O  Thread

Incoming  Queue Outgoing   Queue

Page 13: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Parallel  Computing

• Multi-­‐threaded  parallelism• Encryption/Decryption• Advanced  Encryption  Standard  (AES)• Session-­‐based  (i.e.  single  file)

• Data  integrity• Block-­‐based• crc32,  Adler32

• Adjust  #  of  threads  dynamically

12

Computing  Thread

Computing  Thread

Computing  Thread

Incoming  Queue Outgoing   Queue

Page 14: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Zero-­‐copy  Network  I/O• High  throughput

• Achieve  line  speed• Low  cost

• CPU  utilization• Memory  footprints

• Scalability• 100  Gbps and  Beyond• Long  latency  WAN

• Network  independent• InfiniBand:  High  I/O  depth  on  a  single  QP

• RoCE• iWARP:  Multiple  QPs

• Verbs• “Verbs  Programming  is  like  the  assembly  language  version  of  network  programming.”

13

Page 15: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Network  Protocol  Overview• Design  choice

• One-­‐sided  for  bulk  data  movement  (credit-­‐based)• Two-­‐sided  for  control  messages  (send  with  inline)

• Asynchronous  protocol  to  overlap  latency  impact• Multiple  reliable  queue  pairs  for  data  transfer• Proactive  feedback

14

Process  Load  Data

Data

Source

Data

Sink

Control  Msg  QPget_free_blkput_ready_blk put_free_blkget_ready_blk

Bulk  Data  Transfer  QPs

Process  Offload  Data

RFTP:  RDMA-­‐enabled  FTP  Service

Page 16: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Implementation:  Software  Architecture

• Data  Structure• Memory  management• Control  message  management

• Credit  management• Connection  management

• Threads• Multi-­‐threaded• Master-­‐worker

• Event-­‐driven• Polling  and  dispatch

15

ThreadsData  Structure

CQQP-­1 QP-­2 QP-­n

Data  Block  List

Receive  Control  Message  List

Send  Control  Message  List

Remote  MR  Info  List

application

system

Queue  Pair  List

Memory

Sender

CE  dispatcher

...

CE  worker-­2

CE  worker-­1

Logger

Hardware

RNIC

1

234

Receive  a  Block  of  User  Payload

Page 17: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

LAN  Evaluation

16

HP  DL380 IBM  X3650

MellanoxConnectX 3  VPI56Gbps  FDR

Mellanox FDR  SwitchInfiniBandSX6018

MellanoxQDR  SwitchEthernet  SX1036

HP  DL380 IBM  X3650

MellanoxConnectX 3  Ethernet

40Gbps  QDR

RFTPOpenSSH SCPHPN-­SCP

Page 18: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Secure  Bulk  Data  Movement

17

01020304050607080

Throughput  (G

bps)

Application  -­‐ #  of  parallel  processes

Secure  (AES-­‐128bit)  End-­‐to-­‐End  Data  Movement

0102030405060708090

100

CPU  Utiliz

ation  (%)

Application  -­‐ #  of  parallel  processes

CPU  Utilization

Data  Source Data  Sink

Page 19: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

RFTP  vs.  GridFTP

18

0102030405060708090

100

Bandwidth  (G

bps)

25  Minutes

Bandwidth  Comparison

RFTP GridFTP

0

200

400

600

800

1000

1200

1400

RFTP  source

GridFTP  source

RFTP  sink GridFTP  sink

CPU  Utiliz

ation  (%)

CPU  Comparison

user sys wait

Storage  Threshold

3x3x

6x

Page 20: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

40  Gbps Long  Distance  WAN  Testbed

• 40  Gbps RoCEWAN• 4,000  miles  loopback• RTT:  95  millisecond• BDP:  500  MB• Will  RFTP  be  scalable  in  WAN?

19

OaklandCA

ChicagoIL

35

36

37

38

39

40

1M 2M 4M 8M 16MBa

ndwidth  (G

bps)

Block  Size

1 2 4 8 16 Streams

Page 21: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

NUMA-­‐aware  Cache  for  iSCSI  Servers

Publications

• [TPDS’15]  Yufei  Ren,  Tan  Li,  Dantong Yu,  Shudong Jin,  Thomas  Robertazzi,  "Design,   Implementation,  and  Evaluation  of  a  NUMA-­‐Aware  Cache  for  iSCSI  Storage  Servers",  IEEE  Transactions  on  Parallel  &  Distributed  Systems  (TPDS),  vol.26,  no.  2,  pp.  413-­‐422,  Feb.  2015

Page 22: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

SAN  Overview

• Large-­‐scale  asymmetric  memory  caching  system• Massive  I/O  event  processing• iSCSI/iSER:  SCSI  cmd and  data  in  TCP/IP  and  RDMA• Cache:  storage  performance  acceleration

21

iSCSIService

Initiator Target

CacheMassive  I/O  

event  processing

Page 23: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Related  work:  Caching  in  SAN

22

Replacement  Algorithm

Collaborative  Cache  Mgmt

Architecture-­‐aware

Optimization

Aggressive-­‐Collaborative

Hierarchy-­‐aware

NUMA-­‐aware  Cache

Johnson  VLDB  1994

MegiddoFAST2003

Wong  ATC  2002

Zhou  TPDS  2004

Page 24: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Caching  in  a  NUMA  host

23

Node  2Node  3

Node  1Node  0

T

src

dst

NUMA-­‐unaware

T

src dst

NUMA-­‐aware

3.3  GB/s 18.9  GB/sSTREAM

memcpy(dst, src, len);

• [ICPP’13]  Tan  Li,  Yufei  Ren,  Dantong Yu,  Shudong Jin,  Thomas  Robertazzi,  “Characterization  of  Input/OutputBandwidth  Performance  Models  in  NUMA  Architecture  for  Data  Intensive  Applications”,  In  Proceedings  of  the  International  Conference  on  Parallel  Processing  ICPP   '13,  Lyon,  France,  October  2013

Page 25: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

iSCSI/iSER Performance  Modeling  for  Cached  Data

24

ayQueuingDelNetworkBW

OSizeIMemBWOSizeIRTTLatency +++=

//

i

n

iiCachedData MemBWWeightMemBW ∑

=

=1

0

*

• Latency  (from  initiator  perspective)

• Bandwidth

),min( networkmemoryCachedData BWBWBW =

NetworkBW<<MemoryBW NUMARDMA

NetworkBW<MemoryBW

Page 26: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

NUMA-­‐aware  Cache  for  SAN

• Allocate  thread,  cache,  and  network  buffer  in  each  node 25

Master

Worker

Network  Buffers

Worker

NUMA-­‐aware  Cache

node-0

Worker

Network  Buffers

Worker

NUMA-­‐aware  Cache

node-1

Page 27: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Evaluation• 4-­‐node  NUMA  host  – Dell  R820• Intel  Xeon  E5-­‐4620  with  QPI• 768  GB  RAM• Local  memory  bw:  18.71  GB/s• Remote  memory  bw:  3.26  GB/s

• 4  initiators• Network  adapters:  56Gbps  InfiniBand FDR• OS  page  cache  vs.  NUMA-­‐aware  Cache• Fully  cached  data  access• Benchmark:  fiowith  random  read  I/Os

26

IBM  x3650 IBM  x3650

HP  DL380 HP  DL380

Mellanox FDR  IB  Switch

Initiators

Target

node0 node1

node3 node2

Dell  R820

Page 28: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Latency  (from  Initiator  perspective)

• Prompt  response• Overhead  vs.  NUMA-­‐aware  Benefit

27

0100200300400500600700

4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB

Latency  (us)

Request  Size

NUMA-­‐aware  Cache Page  Cache

overhead  > benefit

overhead  == benefit

overhead  < benefit

23%

Page 29: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Throughput  Comparison

28

0

2000

4000

6000

8000

10000

12000

16k 32k 64k 128k 256k 512k 1m

Throughput  (M

B/s)

I/O  Request  Size

Throughput

NUMA-­‐aware  Cache OS  page  cache

80%

0

100

200

300

400

500

600

700

16k 32k 64k 128k 256k 512k 1mCPU  Utiliz

ation  (%)

I/O  Request  Size

CPU  Utilization

NUMA-­‐aware  Cache OS  page  cache

Save  3  CPU  cores

• Concurrent  requests  serving• Aggregate  throughput

Page 30: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Throughput  Comparison

• Master  thread  becomes  the  bottleneck29

0

2000

4000

6000

8000

10000

12000

16k 32k 64k 128k 256k 512k 1m

Throughput  (M

B/s)

I/O  Request  Size

Throughput

NUMA-­‐aware  Cache OS  page  cache

0

100

200

300

400

500

600

700

16k 32k 64k 128k 256k 512k 1mCPU  Utiliz

ation  (%)

I/O  Request  Size

CPU  Utilization

NUMA-­‐aware  Cache OS  page  cache

80%

Memory  Bottleneck

Network  Bottleneck Save  3  CPU  cores

Page 31: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Decentralized  Event  Processing

• Standard  iSCSI:  Centralized  event  processing  with  a  single  thread• Create  multiple  threads  and  distribute  event  processing• Processing  events  is  3X  as  the  number  of  requests

30

Connection  Requests Master  

Thread I/O  Thread

Ack  Helper  Thread

Scheduling  Events  List

I/O  Request  List

I/O  Thread

I/O  Thread

I/O  Thread

Finished  List

Notification  Pipe

Initiator

Initiator

Initiator

Page 32: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Event  Processing  Scalability

31

01002003004005006007008009001000

1 2 4 8 16 32 64

Throughput  (K

-­‐IOPS)

#  of  concurrent  clients

IOPS  with  512  Byte  Small  Requests

Decentralized Standard

1  Million   IOPS

Page 33: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Real-­‐life  workloads

• YCSB  – MongoDB – iSER Caching  and  Storage32

0

2000

4000

6000

8000

10000

12000

9:1 8:2 5:5

Transactions/sec

read-­‐to-­‐update  ratio

Throughput   (transactions/sec)

NUMA-­‐aware  Cache Page  Cache

12.8%

3.5%

0%

YCSB

MongoDB

NUMA-­‐awareCache Page  Cache

Disks

tgtd

localhost

InfiniBand

Page 34: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Conclusions

Page 35: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

ConclusionsChapter Contributions ResultsACES  (ch3) Orchestrate  resources  on  memory-­‐centric  

architectureUnder revision

RFTP  (ch4) Scalable zero-­‐copy  based  protocol  for  WAN [SC12,  JSS13,  HPGC12]

NUMA-­‐aware Cache  (ch5)

Improve  SAN  caching data  operation  locality

[TPDS15]

End-­‐to-­‐EndOptimization  (ch6)

Extract  line-­‐speed I/O  performance  alongthe  end-­‐to-­‐end  data  path

[SC13]

34

Page 36: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Software  Contributions

• RFTP:  RDMA-­‐base  FTP  Software• http://ftp100.cewit.stonybrook.edu/rftp

• The  Three  Clause  BSD  License• User:  High  Energy  Physics  at  Caltech

• RDMA  benchmark• fio:  RDMA  engine  and  NUMA  module

• Chelsio:  http://www.chelsio.com/wp-­‐content/uploads/resources/Lustre-­‐Over-­‐iWARP-­‐vs-­‐IB-­‐FDR.pdf

• SanDisk:  http://events.linuxfoundation.org/sites/events/files/eeus13_shelton.pdf

• iperf-­‐rdma• https://github.com/yufeiren/iperf-­‐rdma

• China  MinSheng Bank

• NUMA-­‐aware  Cache  for  iSCSI/iSER• Linux  SCSI  Target  Framework  tgtd

• https://bitbucket.org/dtyu/tgt

35

Page 37: Scalable’End+to+End’Data’I/O’over’Enterprise’ and’Data ... · and’Data+Center’Networks ... International&Conference&on&High&Performance&Computing,&Networking,&Storage&and&Analysis&

Thank  you Q  &  A

36