1 HDF5 Life cycle of data Boeing September 19, 2006.

35
1 HDF5 Life cycle of data Boeing September 19, 2006

Transcript of 1 HDF5 Life cycle of data Boeing September 19, 2006.

Page 1: 1 HDF5 Life cycle of data Boeing September 19, 2006.

1

HDF5Life cycle of data

Boeing

September 19, 2006

Page 2: 1 HDF5 Life cycle of data Boeing September 19, 2006.

2

Overview• “Life cycle” of HDF5 data• I/O operations for datasets with

different storage layouts• Compact dataset• Contiguous dataset

• Datatype conversion • Partial I/O for contiguous dataset

• Chunked dataset• I/O for chunked dataset

• Variable length datasets and I/O

Page 3: 1 HDF5 Life cycle of data Boeing September 19, 2006.

3

• Life cycle: what does happen to data when it is transferred from application buffer to HDF5 file?

File or other “storage”

Virtual file I/O

Library internals

Object API

ApplicationApplication Data buffer

H5Dwrite

Magic box

Unbuffered I/O

Data in a file

Page 4: 1 HDF5 Life cycle of data Boeing September 19, 2006.

4

“Life cycle” of HDF5 data: inside the magic box• Operations on data inside the magic box

• Datatype conversion• Scattering - gathering • Data transformation (filters, compression)• Copying to/from internal buffers

• Concepts involved• HDF5 metadata, metadata cache• Chunking, chunk cache

• Data structures used• B-trees (groups, dataset chunks)• Hash tables• Local and Global heaps (variable length data: link names,

strings, etc.)

Page 5: 1 HDF5 Life cycle of data Boeing September 19, 2006.

5

“Life cycle” of HDF5 data: inside the magic box

• Understanding of what is happening to data inside the magic box will help to write efficient applications

• HDF5 library has mechanisms to control behavior inside the magic box

• Goals of this and the next talk are to • Introduce the basic concepts and internal data

structures and explain how they affect performance and storage sizes

• Give some “recipes” for how to improve performance

Page 6: 1 HDF5 Life cycle of data Boeing September 19, 2006.

6

Operations on data inside the magic box• Datatype conversion

• Examples: • float integer• LE BE• 64-bit integer to 16-bit integer (overflow may occur!)

• Scattering - gathering • Data is scattered/gathered from/to user’s buffers into internal

buffers for datatype conversion and partial I/O• Data transformation (filters, compression)

• Checksum on raw data and metadata (in 1.8.0)• Algebraic transform• GZIP and SZIP compressions• User-defined filters

• Copying to/from internal buffers

Page 7: 1 HDF5 Life cycle of data Boeing September 19, 2006.

7

“Life cycle” of HDF5 data: inside the magic box

• HDF5 metadata• Information about HDF5 objects used by the library• Examples: object headers, B-tree nodes for group, B-

Tree nodes for chunks, heaps, super-block, etc. • Usually small compared to raw data sizes (KB vs. MB-

GB)• Metadata cache

• Space allocated to handle pieces of the HDF5 metadata • Allocated by the HDF5 library in application’s memory

space• Cache behavior affects overall performance• Will cover in the next talk

Page 8: 1 HDF5 Life cycle of data Boeing September 19, 2006.

8

“Life cycle” of HDF5 data: inside the magic box

• Chunking mechanism• Chunking – storage layout where a dataset is

partitioned in fixed-size multi-dimensional tiles or chunks

• Used for extendible datasets and datasets with filters applied (checksum, compression)

• HDF5 library treats each chunk as atomic object• Greatly affects performance and file sizes

• Chunk cache• Created for each chunked dataset• Default size 1MB

Page 9: 1 HDF5 Life cycle of data Boeing September 19, 2006.

9

HDF5 file structure

User block File header infoVersion #, etc.

Root group

Symbol Table

Group

Symbol Table

Object

Global Heap

Local Heap

Page 10: 1 HDF5 Life cycle of data Boeing September 19, 2006.

10

Writing a contiguous dataset of atomic type

DataMetadataDataspace

3

RankRank

Dim_2 = 5Dim_1 = 4

DimensionsDimensions

Time = 32.4

Pressure = 987

Temp = 56

AttributesAttributes

Chunked

Compressed

Dim_3 = 7

Storage infoStorage info

IEEE 32-bit float

DatatypeDatatype

Page 11: 1 HDF5 Life cycle of data Boeing September 19, 2006.

11

I/O operations for HDF5 datasets with different storage layouts

• Storage layouts• Compact• Contiguous• Chunked

• I/O performance depends on • Dataset storage properties• Chunking strategy• Metadata cache performance• Etc.

Page 12: 1 HDF5 Life cycle of data Boeing September 19, 2006.

12

Application memory

Writing a compact dataset

Dataset header

………….Datatype

Dataspace………….Attribute 1Attribute 2

Data

Metadata cache

File

Raw data is stored within the dataset header

Page 13: 1 HDF5 Life cycle of data Boeing September 19, 2006.

13

Writing a contiguous dataset with no datatype conversion

User buffer (matrix 5x4x7)

Dataset header

………….Datatype

Dataspace………….Attribute 1Attribute 2………… Application memory

Metadata cache

File

Dataset header Dataset raw data

Page 14: 1 HDF5 Life cycle of data Boeing September 19, 2006.

14

Writing a contiguous dataset with conversion

Dataset header

………….Datatype

Dataspace………….Attribute 1Attribute 2………… Application memory

Metadata cache

File

Dataset header Dataset raw data

Conversion buffer 1MB

Dataset raw data

Page 15: 1 HDF5 Life cycle of data Boeing September 19, 2006.

15

Sub-setting of contiguous datasetSeries of adjacent rows

File

N

Application data in memory

Data is contiguous in a file

One I/O operation

M rows

M

Page 16: 1 HDF5 Life cycle of data Boeing September 19, 2006.

16

Sub-setting of contiguous datasetAdjacent, partial rows

File

N

M

Application data in memory

Data is scattered in a file in M contiguous blocks

Several small I/O operation

N elements

Page 17: 1 HDF5 Life cycle of data Boeing September 19, 2006.

17

Sub-setting of contiguous datasetExtreme case: writing a column

File

N

M

Application data in memory

Data is scattered in a file in M contiguous blocks

Several small I/O operation

1 element

Page 18: 1 HDF5 Life cycle of data Boeing September 19, 2006.

18

Sub-setting of contiguous datasetData sieve buffer

File

N

M

Application data in memory

Data is scattered in a file in M contiguous blocks

1 element

Data is gathered in a sieve buffer in memory 64K

memcopy

Page 19: 1 HDF5 Life cycle of data Boeing September 19, 2006.

19

Performance tuning for contiguous dataset

• Datatype conversion• Avoid for better performance• Use H5Pset_buffer function to customize

conversion buffer size

• Partial I/O• Write/read in big contiguous blocks (at least the

size of a block on FS)• Use H5Pset_sieve_buf_size to improve

performance for complex subsetting

Page 20: 1 HDF5 Life cycle of data Boeing September 19, 2006.

20

Possible tuning work

• Datatype conversion• Use of multiple threads for datatype conversion

• Partial I/O• OS vector I/O

• Asynchronous I/O

Page 21: 1 HDF5 Life cycle of data Boeing September 19, 2006.

21

Writing chunked dataset

Dataset is partitioned into fixed-size multi-dimensional chunks of sizes X/4 x Y/2 x Z

Dimension sizes X x Y x Z

Page 22: 1 HDF5 Life cycle of data Boeing September 19, 2006.

22

Extending chunked dataset in any dimension

•Data can be added in any dimensions•Compression is applied to each chunk•Datatype conversion is applied to each chunk

Page 23: 1 HDF5 Life cycle of data Boeing September 19, 2006.

23

Writing chunked dataset

C BA

…………..

• Each chunk is written as a contiguous blob• Chunks may be scattered all over the file • Compression is performed when chunk is evicted from the chunk cache• Other filters when data goes through filter pipeline (e.g. encryption)

AB C

C

File

Chunk cacheChunked dataset

Filter pipeline

Page 24: 1 HDF5 Life cycle of data Boeing September 19, 2006.

24

Writing chunked dataset

Dataset_1 header

…………

Application memory

Metadata cache

Chunking B-tree nodesChunk cache

Default size is 1MB

• Size of chunk cache is set for file • Each chunked dataset has its own chunk cache• Chunk may be too big to fit into cache• Memory may grow if application keeps opening datasets

Dataset_N header

…………

………

Page 25: 1 HDF5 Life cycle of data Boeing September 19, 2006.

25

Partial I/O for chunked dataset

• Build list of chunks and loop through the list• For each chunk:• Bring chunk into memory• Map selection in memory to selection in file• Gather elements into conversion buffer and perform conversion• Scatter elements back to the chunk• Perform conversion when chunk is flushed from chunk cacheFor each element 3 memcopy performed

1 2

3 4

Page 26: 1 HDF5 Life cycle of data Boeing September 19, 2006.

26

Partial I/O for chunked dataset

3

Application memory

memcopy

Application buffer

Chunk

Elements participated in I/O are gathered into corresponding chunk

Page 27: 1 HDF5 Life cycle of data Boeing September 19, 2006.

27

Partial I/O for chunked dataset

3

Conversion bufferGather data

Scatter dataApplication memory

Chunk cache

On eviction from cache chunk is compressed and is written to the file

File Chunk

Page 28: 1 HDF5 Life cycle of data Boeing September 19, 2006.

28

Variable length datasets and I/O

• Examples of variable-length data• String

A[0] “the first string we want to write”…………………………………A[N-1] “the N-th string we want to write”

• Each element is a record of variable-lengthA[0] (1,1,0,0,0,5,6,7,8,9) length of the first record is

10 A[1] (0,0,110,2005)………………………..A[N] (1,2,3,4,5,6,7,8,9,10,11,12,….,M) length of the

N+1 record is M

Page 29: 1 HDF5 Life cycle of data Boeing September 19, 2006.

29

Variable length datasets and I/O

• Variable length description in HDF5 applicationtypedef struct {

size_t length;

void *p;

}hvl_t;

• Base type can be any HDF5 typeH5Tvlen_create(base_type)

• ~ 20 bytes overhead for each element

• Raw data cannot be compressed

Page 30: 1 HDF5 Life cycle of data Boeing September 19, 2006.

30

Variable length datasets and I/O

Global heapGlobal heap

Application bufferApplication buffer

Raw dataRaw data

Elements in application buffer point to global heaps where actual data is stored

Global heapGlobal heap

Page 31: 1 HDF5 Life cycle of data Boeing September 19, 2006.

31

Writing VL datasets

Dataset header

…………

Application memoryMetadata cache B-tree nodes

Chunk cache

………

Conversion buffer

Raw data

Global heap

Chunk cache

VL chunked dataset with selected region

File

Filter pipeline

Page 32: 1 HDF5 Life cycle of data Boeing September 19, 2006.

32

VL chunked dataset in a file

File

Dataset header

Chunking B-tree

Dataset chunksRaw data

Page 33: 1 HDF5 Life cycle of data Boeing September 19, 2006.

33

Variable length datasets and I/O

• Hints • Avoid closing/opening a file while writing VL

datasets • global heap information is lost• global heaps may have unused space

• Avoid writing VL datasets interchangeably • data from different datasets will is written to the

same heap

• If maximum length of the record is known, use fixed-length records and compression

Page 34: 1 HDF5 Life cycle of data Boeing September 19, 2006.

34

Example: Boeing time-segment library application

• Multiple extendible 1-dim arrays of variable-length records• Uses HDF5 Packet Table APIs (H5PT)• HDF5 features used

• Chunked storage• Chunk cache• Compound and VL datatypes • Datatype conversion• Partial I/O

• Complexity affects performance• Performance tuning is needed

Page 35: 1 HDF5 Life cycle of data Boeing September 19, 2006.

35

Thank you!

Questions ?