DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017...

24
PDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor, Garth Gibson Brad Settlemyer, Gary Grider, Fan Guo Carnegie Mellon University Los Alamos National Laboratory (LANL) DeltaFS Indexed Massive Dir Software-Defined Storage For Fast Query

Transcript of DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017...

Page 1: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

PDSW-DISCS 2017

Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael KuchnikChuck Cranor, Garth Gibson

Brad Settlemyer, Gary Grider, Fan GuoCarnegie Mellon University

Los Alamos National Laboratory (LANL)

DeltaFS Indexed Massive Dir

Software-Defined StorageFor Fast Query

Page 2: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

Key features1. Require no dedicated resources

2. Almost no post-processing is needed

3. Low I/O overhead

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 2

DeltaFS Indexed Massive Dir

Page 3: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

Target workloads1. Data-intensive HPC simulations

2. Not designed for indexing checkpoints

3. I/O bandwidth is limited

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 3

DeltaFS Indexed Massive Dir

Page 4: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

AgendaPart 1 – Motivation

Part 2 – In-situ indexing design

Part 3 – API, LANL VPIC integration

Conclusion

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 4

Page 5: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

Existing HPC builds indexes during post-processing

Delay queries until post-processing done (5-20% simulation time)

App Lustre

Queries

IndexingWrite

Tmp

1 2

3

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 5

Page 6: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

Problem faced:

The increasing time-to-scienceDue to the growing gap between compute and I/O

Inefficient support on small data

simulation start query finish

Page 7: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

Processing data in-transit while data is written to storage

Need separate resources for sorting and indexing

App Lustre QueriesIndexing

Tmp

MapReduce

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 7

Page 8: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

In-situ indexing directly on app nodes using app resources

Lustre

Queries

data + index

Tmp

No need for a separate indexing cluster

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 8

App + Indexing

Page 9: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 9

Key idea:Reuse storage write-back buffering and

idle CPU cycles for in-situ indexing

Page 10: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

Example app: LANL VPIC

VPIC simulation Each VPIC process simulates millions of particles

Particle40 bytes

Particles move across processes during a simulation

Small random writesAfter simulation: high-selective queries

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 10

Page 11: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

TBs I/O per trajectory fetch

Query a single particle trajectoryA B C TBs search

Data object 1M...

Simulation procs

One output file per VPIC process

AB

EC

DF

P P P

...

1M...

1MACE

file-per-process

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 11

Page 12: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

5,000x faster than baseline with DeltaFS in-situ indexing

0.0625 0.25 1 4 16 64 256 1024 4096Query Time (sec)

DeltaFS (w/ 1 CPU core) Baseline (Full-system parallel scan w/ 3k CPU cores)

Time for reading a single particle trajectory(10TB, 48 billion particles)

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 12

Page 13: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

System design:

Light-weight in-situ indexing

1. Tiny mem footprint2. Zero write amplification

3. No read back

Part II

Page 14: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

Resource-efficient indexing by log-structured I/O

Tiny mem footprint, full storage b/w util.

data log

index

Lustre

bufferApp thread

Indexing thread

App proc

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 14

Page 15: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

LSM-Trees compacts all the time, but we can’t afford it

Must aim for low I/O overhead at 10%-20%

Compute I/O Compute I/O

Total simulation

Compaction easily causes 1000% I/O overhead by reading/writing previously written data

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 15

Page 16: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

In-situ indexing by aggressive data partitioning

A BC D EF

A B C D E FAll-to-all shuffle

App process #0 App process #1 App process #2

…Compute I/O Compute I/O

Bound the number of data needed per query per timestep

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 16

Page 17: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

...

data blockindex block

filter

...

data blockdata block

In-situ indexing as a file system lib component

No dedicated cluster needed

shuffle receiver

Index Log

WriteBuffer

Data Log

shuffle sender

App data

All-to-all shuffle

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 17

Page 18: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

Programming interface:

Indexed Massive Directory (IMD)

Part III

In-situ indexing keyed on filenames

mkdir(“./particles”, DELTAFS_IMD)

Page 19: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 19

How to use Indexed Massive Dir (IMD)

1. Data searched together go into a single IMD file e.g. one file for each particle

2. Create as many IMD files as you wante.g. 1 trillion files for 1 trillions particles

Query you data by “open-read-close”

Page 20: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

VPIC using DeltaFS IMD

Simulation procs

One IMD file per VPIC particle

P P P 1M...

1T Indexed Massive Directory

file-per-particle

AA

DD

BB

EE

CC

FF...

Data object1M...Index object

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 20

A B C TBs MBs search

Page 21: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

LANL Trinity Experiments

Compute Node32 cores/node

1-99 compute nodes, 496 million – 48 billion particles

bufferVPIC

VPIC-DeltaFS

buffer

VPIC-Baseline

VPIC

Queries

No post-processing

SSD

Burst-buffer Lustre

HDD

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 21

DeltaFS indexing

Page 22: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

245x 665x 532x 625x 992x 2221x 4049x 5112x

0.0156250.0625

0.2514

1664

25610244096

496 992 1,984 3,968 7,936 16,368 32,736 49,104

Quer

y Tim

e (se

c)

Simulation Size (million particles)

Baseline (Full-system parallel scan)DeltaFS (w/ 1 CPU core)

1 node 99 nodes2 nodes 4 node 8 node 66 nodes33 nodes16 nodes

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 22

Page 23: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

9.63x 4.78x 2.42x1.56x

1.29x

1.13x 1.15x 1.13x

0

40

80

120

160

200

496 992 1,984 3,968 7,936 16,368 32,736 49,104I/O Ti

me p

er D

ump (

sec)

Simulation Size (million particles)

Baseline DeltaFS

Tiny simulations Bigger simulations

1 node 99 nodes2 nodes 4 node 8 node 66 nodes33 nodes16 nodes

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 23

Page 24: DeltaFS Indexed Massive Dir Software-Defined …qingzhen/talk/deltafs_talk_pdsw17.pdfPDSW-DISCS 2017 Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Michael Kuchnik Chuck Cranor,

Conclusion

• Indexed Massive Dir (~3% app mem, compaction-free, POSIX API)

• Powered by Mercury RPC

• DeltaFS is one of the Mochi micro-services

PDSW-DISCS 2017http://www.pdl.cmu.edu/ 24

In-situ indexing for transparent, almost-free query accelerationno dedicated nodes, no post-processing, ~15% I/O overhead

https://mercury-hpc.github.io/

https://press3.mcs.anl.gov/mochi/

https://github.com/pdlfs/deltafs