Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& •...

28
Enabling CostEffec1ve Flash Based Caching with an Array of Commodity SSDs Yongseok Oh* (SK telecom) Eunjae Lee, Choulseung Hyun (University of Seoul) Jongmoo Choi (Dankook University) Donghee Lee (University of Seoul) Sam H. Noh (UNIST) *Work done while student at University of Seoul

Transcript of Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& •...

Page 1: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

Enabling  Cost-­‐Effec1ve  Flash  Based  Caching    with  an  Array  of  Commodity  SSDs  

Yongseok Oh* (SK telecom)Eunjae Lee, Choulseung Hyun (University of Seoul)

Jongmoo Choi (Dankook University)Donghee Lee (University of Seoul)

Sam H. Noh (UNIST) *Work  done  while  student  at  University  of  Seoul  

Page 2: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

Unreliable  SSD  based  Cache  

Backend  Storage  Cache  

[Datacenter  Environment]  

…  

1  

1  GB/s  8  MB/s  

•  Cache  plays  a  CRITICAL  ROLE  in  enhancing  performance  •  However,  Flash  is  sKll  UNRELIABLE  media  [Liu  FAST’12]

[Jeong  FAST’14][Cai  DSN’15]  –  LifeKme,  read  disturb,  retenKon,  and  so  on  

•  Write-­‐back  caching  incurs  RISKY  scenario  [Qin  ATC’14]  •  Re-­‐warming  up  takes  HOURS  to  DAYS  [Zhang  FAST’13]  

Page 3: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

Our  Idea:  Take  Advantage  of  RAID  

Backend  Storage  

[Datacenter  Environment]  

…  

– High  performance  – High  reliability  – Online  replacement  –  Performance  scaling  

SSD  RAID  as  a  Cache  2  

Failure Recovery via Parity

1  GB/s  8  MB/s  

Page 4: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

Our  ContribuKons  

•  This  is  the  first  study  to  exploit  SSD  RAID  as  a  cache  

•  We  build  two  fast  prototypes  in  Linux  –  Bcache  and  Flashcache  configured  with  Linux  RAID  

•  We  propose  a  new  soluKon,  namely  SRC  (SSD  RAID  as  a  Cache)    –  Borrow  LFS  and  RAID  techniques  –  Propose  opKmizaKon  schemes  for  performance  and  reliability  

•  We  evaluate  SRC  with  other  soluKons  –  Cost-­‐effecKve  analysis    –  SATA  SSDs  vs  NVMe  SSD  

3  

Page 5: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

Outline  of  Contents  

•  IntroducKon    •  Exis1ng  Solu1ons  (Bcache  and  Flashcache)  •  SRC  (SSD  RAID  as  a  Cache)  •  Performance  EvaluaKon  •  Cost-­‐effecKve  Analysis  (SATA  vs  NVMe  SSD)  •  Conclusion  

4  

Page 6: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

Open  Source  SoluKons  

•  Flashcache    – Read  opKmized  layout  (e.g.,  hash  bucket)  

•  Bcache    – Write  opKmized  layout  (e.g.,  Log  based  B+-­‐tree)  

•  They  provide  several  opKons  – Write-­‐through  and  write-­‐back  policies  – FIFO  and  LRU  policies  

•  They  don’t  provide  the  RAID  feature  – Only  single  SSD  can  be  employed  (as  of  2014)  

5  

Page 7: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

Bcache  /  Flashcache  

SSD  Caching  on  RAID    

6  

Worker   Worker   Worker   Worker  …  

8TB  (8  x  7.2K  2TB  Disks)  MD  RAID10  

Backend  Storage  4  x  Samsung  840  Pro  SSDs  

RAID  Volume  used  as  Caching  Space  

Ubuntu  13.10  server    (Kernel  3.11.7)  Intel  Xeon  CPU  (E5-­‐2640)  32GB  RAM  

iSCSI  1Gbps  

Benchmark  

Linux  RAID  

Page 8: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

RAID  is  Beneficial  for  SSD  Caching  

•  FIO  benchmark  (4KB  random)  –  Single  SSD  Caching:  SATA  SSD  128GB    –  RAID-­‐4/-­‐5  Caching:  4  X  SATA  SSD  128GB  

•  Single  SSD  caching  is  comparable  or  faster    

7  

0  

20  

40  

60  

80  

100  

120  

Bcache     FlashCache    

Band

width  (M

B/s)  

Single  SSD   RAID-­‐4  (4  SSDs)   RAID-­‐5  (4  SSDs)  

80%  17%  

NOT  ?  .  

Page 9: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

Analysis  of  I/O  Caching  Path  

8  

4KB  Random  Write  

Cache  layer  

RAID-­‐4/-­‐5  

Read  Write  

SSDs  

FTL  GC  data   meta  

D’   P’   D   P   D’   P’  

Data  

D   P  P:  Parity  D:  Data  

D   P   D’   P’   GC   D   P   D’   P’   GC  

Page 10: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

Analysis  of  I/O  Caching  Path  

9  

4KB  Random  Write  

Cache  layer  

RAID-­‐4/-­‐5  

Read  Write  

SSDs  

FTL  GC  data   meta  

D’   P’   D   P   D’   P’  

Data  

D   P  P:  Parity  D:  Data  

D   P   D’   P’   GC   D   P   D’   P’   GC  

Single write request requires 8 I/Os plus possibly numerous FTL GCs.

Page 11: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

Our  Approach  

10  

4KB  Random  Write  

Cache  layer  

RAID-­‐4/-­‐5  

Read  Write  

SSDs  

FTL  GC  data   meta  

D’   P’   D   P   D’   P’  

Data  

D   P  P:  Parity  D:  Data  

D   P   D’   P’   GC   D   P   D’   P’   GC  

1.  Log-­‐structured  Layout  >  We  pack  caching  data,  metadata,  and    parity  together  into  a  segment  like  LFS  

Reduce  I/O  amplifica1on  

2.  Write  Larges    We  make  large  writes  to  SSDs  (e.g.,  256MB)        

Remedy  FTL  GC  cost  in  SSDs  

Page 12: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

•  Large  Write  reduces  internal  GCs    [Min  FAST’12],  [Li  ATC’13],  [Tang  FAST’14]  –  e.g.,  erase-­‐before-­‐write  property  of  NAND  Flash  

•  256MB  writes  achieve  maximum  performance  for  our  case  –  Regardless  of  over-­‐provisioned  space  (OPS)  used  by  GC  

•  Considered  as  a  cache  replacement  unit  

0  100  200  300  400  

1   2   4   8   16  

32  

64  

128  

256  

512  

1024  

Band

width  (M

B/s)  

Write  Size  (MB)  

Company  X  

Large  Write  Size  for  SSD  Caching  

11  

OPS  (40%)  

OPS  (20%)  

Model  (Year)   Large  Write  

Company  X  (‘13)   256MB  

Company  Y  (‘14)   512MB  

Company  Z  (‘15)   1024MB  

Size  Con1nues  to  Increase  

Page 13: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

Outline  of  Contents  

•  IntroducKon    •  ExisKng  SoluKons  (Bcache  and  Flashcache)  •  SRC  (SSD  RAID  as  a  Cache)  •  Performance  EvaluaKon  •  Cost-­‐effecKve  Analysis  (SATA  vs  NVMe  SSD)  •  Conclusion  

12  

Page 14: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

SRC  Architecture  

13  

Cache  Manager  

Disk   Disk   Disk   Disk   Disk  

Primary  Storage  

Dirty  Seg  Buffer   Clean  Seg  Buffer  

SSD   SSD   SSD  SSD  

SRC  Layer  

I/O  Request  

NAS/SAN  

Miss  

RAID  Manager  

Hit  

•  Segment Group layout•  LFS layout•  Selective GC•  No parity for GC•  Erasure coding

§  E.g., RAID-4, -5, -6§  RAID-5 (Default)

§  Replacement§  FIFO, Greedy

Key Features

In  th

e  pape

r  

SSD  

Focus  

Page 15: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

14  Disk  

NAS/SAN  

SG  =  Large  Write*#  of  SSDs  1GB  =  256MB*4  

SSD0  SSD1  SSD2  SSD3  

…  Segment   Segment   Segment  

SG   SG   SG  SG  

Ac1ve  SG  

Log  

SG  (Segment  Group)   I/O  

Seg.  =  Max  Transfer  Size  *  #  of  SSDs  2MB  =  512KB*4  

Large  Write  unit  for  each  SSD  

Ø With  Large  Write,  sustained  performance  is  sa1sfactory    

Large  Write  Aware  Segment  Group  Layout  

Page 16: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

Caching  Process  based  Log-­‐structured  Layout  

15  Disk  

NAS/SAN  

SSD0  SSD1  SSD2  SSD3  

SG   SG   SG  SG  

Ac1ve  SG  

SRC  Layer  

1.  Collect  dirty  data   D D D D

Dirty  (Write)  Caching   Clean  (Read)  Caching  

4.  Submit  Seg.  I/O  with  flush  

Dirty  Buffer   Clean  Buffer  

+  D D D D M2.  Embed  metadata  

XOR  3.  Calculate  parity       ⊕  D D D D M P

C C C C

No  Parity    for  Clean  

Log  

C C C C M+  C

C C C C C M

-­‐LBA  -­‐Gen  No  -­‐Checksum  

Ø Mul1ple  random  writes  are  aggregated  

Page 17: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

Free  Space  Reclama1on:  Selec1ve  GC  

16  

Disk  

SG  0   SG  1   SG  2   SG  3   SG  4  Valid  cached  block   Invalid  cached  block  

•  S2S  (SSD  to  SSD)  GC  (if  current  uKlizaKon  <  uMAX)  ü  Re-­‐insert  valid  cached  blocks  

Disk  

SG  0   SG  1   SG  2   SG  3   SG  4  

•  S2D  (SSD  to  Disk)  GC  (if  current  uKlizaKon  >=  uMAX)  ü  Destage  valid  cached  blocks  

Less  uKlized  space  

More  uKlized  space  

SSD0  SSD1  SSD2  SSD3  

SSD0  SSD1  SSD2  SSD3  

pre-­‐defined  value  

HOT  data    

COLD  data  

Ø  Larger  SG  may  be  under-­‐u1lized,  resul1ng  in  low  hit  ra1o    

Page 18: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

Outline  of  Contents  

•  IntroducKon    •  ExisKng  SoluKons  (Bcache  and  Flashcache)  •  SRC  (SSD  RAID  as  a  Cache)  •  Performance  Evalua1on  •  Cost-­‐effecKve  Analysis  (SATA  vs  NVMe  SSD)  •  Conclusion  

17  

Page 19: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

SRC  /  Bcache  /  Flashcache  

EvaluaKon  Setup  

18  

Worker   Worker   Worker   Worker  …  

8TB  (8  x  7.2K  2TB  Disks)  MD  RAID10  

Backend  Storage  

4  x  Samsung  840  Pro  SSDs  SSD  RAID  as  a  Cache  

Ubuntu  13.10  server    (Kernel  3.11.7)  Intel  Xeon  CPU  (E5-­‐2640)  32GB  RAM  

iSCSI  1Gbps  

Trace  Replay  

q We  developed  SRC  based  on  DM-­‐Writeboost  q We  implemented  trace  replay  tool  to  mimic  VM  like  workloads  

Page 20: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

RealisKc  Workloads  

•  Several  block-­‐level  traces    -  From  SNIA  and  UMASS  repositories    

•  Read  group  (5  traces  included)  –  read  intensive  -  ts0,  usr0,  …,  msn0  traces  

•  Write  group  (10  traces  included)  –  write  intensive  -  prxy0,  exch9,  …,  src22  

•  Mixed  group  (7  traces  included)  –  read  &  write  mixed  -  rsrch0,  hm0,  …,  prn0  traces  

19  

Page 21: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

0  

200  

400  

600  

800  

Write   Mix   Read  

Throughp

ut  (M

B/s)  

SRC5   Bcache5   FlashCache5  

Comparison  with  Bcache  and  Flashcache  

•  SRC  outperforms  Bcache  and  Flashcache  – Up  to  2.7  X  

•  Flashcache  is  bexer  than  Bcache  due  to  no  flush  – Bcache  issues  flush  for  metadata  durability  

20  

2.5X   2.7X  

2.3X  

Page 22: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

InteresKng  Results  in  Paper  

•  SEL-­‐GC  outperforms  tradiKonal  destage  by  60%  –  Hot  data  are  re-­‐inserted  back  to  SSDs  

•  SEL-­‐GC  is  opKmized  at  uMAX  90%  

•  No  parity  for  clean  brings  17%  improvement  

•  FIFO  is  bexer  for  Write  intensive  workload  –  Greedy  is  bexer  for  Read  intensive  workload  

•  Performance  of  RAID-­‐5  degrades  by  27%  –  Compared  to  that  of  RAID-­‐0  

•  Please,  refer  to  our  paper  for  more  informaKon  

21  

Page 23: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

Outline  of  Contents  

•  IntroducKon    •  ExisKng  SoluKons  (Bcache  and  Flashcache)  •  SRC  (SSD  RAID  as  a  Cache)  •  Performance  EvaluaKon  •  Cost-­‐effec1ve  Analysis  (SATA  vs  NVMe  SSD)  •  Conclusion  

22  

Page 24: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

SATA  and  NVMe  based  SRCs  

23  

vs  MLC  based  NVMe  Card  MLC  or  TLC  based  SATA  SSDs  

4  x  128GB  SATA  SSDs  (including  parity  SSD)  

1  x  400GB  SSD  (No  parity  SSD)  

Interface   SATA  3.0   NVMe  1.0  

Vendor   Company  A   Company  B   Company  C  

NAND   MLC   TLC   MLC   TLC   MLC  

 Capacity   4  x  128GB   4  x  120GB   4  x  128GB   4  x  128GB   1  x  400GB  

Cost   $418   $272   $374   $222   $496  

Cheaper   More  expensive  

Page 25: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

•  Performance  per  Dollar  ($)  

•  LifeKme(days)  per  Dollar  ($)  

EsKmaKon  of  Cost-­‐effecKveness  

24  

MB/s   Total  Price  ($)  

Capacity  x  P/E  cycles  WAF  x  Wday  

Total  Price  ($)  ÷  

÷  

Expected  days  to  live  [Jeong  FAST’14]    

Page 26: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

Analysis  of  Cost  EffecKveness    •  B  TLC(SATA)  SRC  is  bexer  in  terms  of  (MB/s)/$  •  B  MLC(SATA)  SRC  is  bexer  in  terms  of  LifeKme(days)/$ •  SATA  SSD  based  SRCs  are  sKll  superior  to  NVMe  based  SRC    

25  

0  

0.5  

1  

1.5  

2  

2.5  

Write   Mix   Read  

(MB

/s)/$

0  2  4  6  8  10  12  14  16  

Write   Mix   Read  

Life

time(

days

)/$

Page 27: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

Conclusions  

26  

•  Bcache  and  FlashCache  are  NOT  TRULY  opKmized  for  SSD  and  RAID  –  80%  performance  reducKon  

•  We  proposed  SRC  (SSD  RAID  as  a  Cache)  –  Segment  Group  (SG)  to  align  to  Large  Writes  –  Log-­‐based  segment  to  pack  data,  meta,  and  parity  –  SEL-­‐GC  to  recover  less  uKlized  SG    

•  Experimental  Results    –  2.7X  bexer  than  Bcache  and  Flashcache    –  SATA  SRC  is  bexer  than  NVMe  SRC  

•  Performance  by  20%  •  LifeKme  by  60%  

Page 28: Middleware15 SRC YongseokOh final 20151209 web · 12/9/2015  · Our&ContribuKons& • This&is&the&firststudy&to&exploitSSD&RAID&as&acache& • We&build&two&fastprototypes&in&Linux&

27