1 Overview of HDF5 HDF Summit Boeing Seattle The HDF Group (THG) September 19, 2006.

Post on 12-Jan-2016

219 views 0 download

Tags:

Transcript of 1 Overview of HDF5 HDF Summit Boeing Seattle The HDF Group (THG) September 19, 2006.

1

Overview of HDF5 HDF Summit

Boeing SeattleThe HDF Group (THG)September 19, 2006

2

Topics• What is HDF?• Sample uses of HDF• THG the Company

3

What is HDF?

4

Answering big questions …

Matter & the universeMatter & the universe

August 24, 2001August 24, 2001August 24, 2001August 24, 2001 August 24, 2002August 24, 2002August 24, 2002August 24, 2002

Total Column Ozone (Dobson)Total Column Ozone (Dobson)Total Column Ozone (Dobson)Total Column Ozone (Dobson)

60 385 61060 385 61060 385 61060 385 610

Weather and climateWeather and climate

Life and natureLife and nature

5

involves big data …

6

varied data…

caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg

7

Contig Summaries

Discrepancies

Contig Qualities

Coverage Depth

and complex relationships…

Read Read qualityquality

Aligned bases

ContigContig

Reads

Percent match

SNP ScoreSNP Score

TraceTrace

8

on big computers…

9

and on little computers.

10

How do we…• Describe the data? • Read it? Store it? Find it? Share it? Mine it?

• Move it into, out of, and between computers

and repositories

11

HDF is• A file format for managing any kind of

data• Software to store and access data in

the format• Suited especially to large or complex data

collections• Suited for every size of system• Platform independent – runs almost

anywhere• Open – both file formats and software

12

HDF solution

I/O software & tools

CommonCommonData Data

modelsmodels

StandardAPIs

Scientific data file format

Efficient storage, I/O

13

An HDF file is a container…

lat | lon | temp----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6

palette

palette

……into into which you which you can put can put your data your data objects.objects.

14

HDF structures for organizing objects in files

palettepalette

Raster imageRaster image

3-D array3-D array

2-D array2-D arrayRaster imageRaster image

lat | lon | templat | lon | temp----|-----|---------|-----|----- 12 | 23 | 3.112 | 23 | 3.1 15 | 24 | 4.215 | 24 | 4.2 17 | 21 | 3.617 | 21 | 3.6

TableTable

““/” /” (root)(root)““/” /” (root)(root)

““/foo”/foo”““/foo”/foo”

16

Mesh Example, in HDFView

17

HDF5 Software

Tools & ApplicationsTools & ApplicationsTools & ApplicationsTools & Applications

HDF FileHDF FileHDF FileHDF File

HDF I/O LibraryHDF I/O LibraryHDF I/O LibraryHDF I/O Library

18

Goals of HDF5 Library• Flexible API to support a wide range of

operations on data• High performance access in serial and

parallel computing environments• Compatibility with common data models

and programming languages

19

Features• Ability to create complex data structures• Complex subsetting• Efficient storage• Flexible I/O (parallel, remote, etc.)• Ability to transform data during I/O• Support for key language models

• OO compatible• C & Fortran primarily• Also Java, C++

20

Sample uses of HDF

21

1. NASA Earth Observing System (EOS)

Aqua (6/01)Aura

TES HRDLSMLS OMI

Terra

CERES MISR

MODIS MOPITT

AquaCERES MODIS

AMSR

22

2. Advanced Simulation & Computing (ASC)

Question: How do we maintain a nuclear stockpile in the absence

of testing?

23

Answer: Very large simulations

on very large computers

24

ASC Data requirements• Large datasets (> a terabyte) • Good I/O performance on massive

parallel systems Complex data and extensive metadata

25

26

3. Bioinformatics

--

Managing genomic data

caacaagccaaaactcgtacaacaacaagccaaaactcgtacaaCgagatatctcttggaaaaactCgagatatctcttggaaaaactgctcacaatattgacgtacaaggctcacaatattgacgtacaaggttgttcatgaaactttcggtagttgttcatgaaactttcggtaAcaatcgttgacattgcgacctAcaatcgttgacattgcgacctaatacagcccagcaagcagaataatacagcccagcaagcagaat

27

DNA sequencing workflows• Diverse formats• Highly redundant data• Repeated file

processing• Disconnected

programs• Non-scalable storage• Lack of persistence

28

Multiple levels and relationships

Contig Summaries

Discrepancies

Contig Qualities

Coverage Depth

Read Read qualityquality

Aligned bases

ContigContig

Reads

Percent match

SNP ScoreSNP Score

TraceTrace

29

HDF5 as binary format for bioinformatics

30

4. Flight test data--

31

3. Boeing flight test

32

Flight test data requirements• Fast data acquisition from 1000s of

sources• Wide variety of data types• Active archive • Standardization for data/software

exchange• Special features

35

THG the Company

36

What is the HDF Group?• 18 years at National Center for

Supercomputing Center (NCSA) at University of Illinois

• Recent spin-off U of I• Non-profit 501(c)(3)• 17 scientific, technology, and professional

staff• 5 students• 2+million product users world-wide• Cross industry sectors and disciplines

37

THG missionTo support the vast

community of HDF users and to ensure the sustainable

development of HDF technologies and the

ongoing accessibility of HDF-stored data.

38

Business model• Non-profit: mission driven• Intellectual property:

• U of I plans to assign ownership to THG• The HDF formats will remain free, and

HDF software will remain open source.

• Continue close ties to U of I and NCSA.

39

Income-generating activities• Major client support• Targeted HDF development• Grant-supported R&D• Consulting

40

Thank you

41

HDF Information• HDF Information Center

• http://hdfgroup.org/

• HDF Help email address• hdfhelp@hdfgroup.org/

• HDF users mailing list• hdfnews@hdfgroup.org/