DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign...
-
Upload
clement-copeland -
Category
Documents
-
view
217 -
download
3
Transcript of DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign...
![Page 1: DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649e7a5503460f94b7ab47/html5/thumbnails/1.jpg)
DM_PPT_NP_v01SESIP_0715_GH2
Putting some into HDF5
Gerd Heber & Joe LeeThe HDF Group
Champaign Illinois USA
This work was supported by NASA/GSFC under Raytheon Co. contract number
NNG10HP02C
![Page 2: DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649e7a5503460f94b7ab47/html5/thumbnails/2.jpg)
DM_PPT_NP_v01SESIP_0715_GH2
2
The Return of
![Page 3: DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649e7a5503460f94b7ab47/html5/thumbnails/3.jpg)
DM_PPT_NP_v01SESIP_0715_GH2 3
Outline
• “The Big Schism”• A Shiny New Engine• Getting off the Ground• Future Work
July 14 – 17, 2015
![Page 4: DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649e7a5503460f94b7ab47/html5/thumbnails/4.jpg)
DM_PPT_NP_v01SESIP_0715_GH2 4
“The Big Schism”
• An HDF5 file is a Smart Data Container• “This is what happens, Larry, when you
copy an HDF5 file into HDFS!” (Walter Sobchak)
July 14 – 17, 2015
Natural Habitat: Traditional File System Block Store: Hadoop “File System” (HDFS)
Ouch!Don’t m
ess with HDF5!
![Page 5: DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649e7a5503460f94b7ab47/html5/thumbnails/5.jpg)
DM_PPT_NP_v01SESIP_0715_GH2 5
Now What?
• Ask questions:– Who want’s HDF5 files in Hadoop? (volatile)
• Who wants to program MapReduce? (nobody)
– How big are your HDF5 files? (long tailed distrib.)
• No size (solution) fits all...
• Do experiments:– Reverse-engineer the format (students,
weirdos)– In-core processing (fiddly)– Convert to Avro (some success)
• Sit tight and wait for something better!July 14 – 17, 2015
![Page 6: DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649e7a5503460f94b7ab47/html5/thumbnails/6.jpg)
DM_PPT_NP_v01SESIP_0715_GH2 6
Spark Concepts
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be only created through deterministic operations on either (1) a dataset in stable storage or (2) other existing RDDs.
July 14 – 17, 2015
![Page 7: DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649e7a5503460f94b7ab47/html5/thumbnails/7.jpg)
DM_PPT_NP_v01SESIP_0715_GH2 7
What’s Great about Spark
• Refreshingly abstract• Supports Python• Typically runs in RAM• Has batteries included
July 14 – 17, 2015
?
![Page 8: DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649e7a5503460f94b7ab47/html5/thumbnails/8.jpg)
DM_PPT_NP_v01SESIP_0715_GH2 8
Experimental Setup
• GSSTF_NCEP.3 collection 7/1/1987 to 12/31/2008• 7,850 HDF-EOS5 files, 16 MB per file, ~120 GB total• 4 variables on daily 1440x720 grid
– Sea level pressure (hPa)– 2m air temperature (C)– Sea surface skin temperature (C)– Sea surface saturation humidity (g/kg)
• Lenovo ThinkPad X230T– Intel Core i5-3320M (2 cores, 4 threads), 8GB of RAM,
Samsung SSD 840 Pro– Windows 8.1 (64-bit), Apache Spark 1.3.0
July 14 – 17, 2015
![Page 9: DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649e7a5503460f94b7ab47/html5/thumbnails/9.jpg)
DM_PPT_NP_v01SESIP_0715_GH2 9
Getting off the Ground
July 14 – 17, 2015
Where do they dwell?
![Page 10: DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649e7a5503460f94b7ab47/html5/thumbnails/10.jpg)
DM_PPT_NP_v01SESIP_0715_GH2 10
General Strategy
1. Create our first RDD – “list of file names/paths/...”
a. Traverse base directory, compile list of HDF5 filesb. Partition the list via SparkContext.parallelize()
2. Use the RDD’s flatMap method to calculate something interesting, e.g., summary statistics
July 14 – 17, 2015
RDD
Calculating Tair_2m mean and median for 3.5 years took about 10 seconds on my notebook.
![Page 11: DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649e7a5503460f94b7ab47/html5/thumbnails/11.jpg)
DM_PPT_NP_v01SESIP_0715_GH2 11
Variations
• Instead of traversing directories, you can provide a CSV file of [HDF5 file names, path names, hyperslab selections, etc.] to partition
• A fast SSD array goes a long way• If you have a distributed file system (e.g.,
GPFS, Lustre, Ceph), you should be able to feed large numbers of Spark workers (running on a cluster)
• If you don’t have a parallel file system and use most of the data in a file, you can stage (copy) the files first on the cluster nodes
July 14 – 17, 2015
![Page 12: DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649e7a5503460f94b7ab47/html5/thumbnails/12.jpg)
DM_PPT_NP_v01SESIP_0715_GH2 12
Conclusion
• Forget MapReduce, stop worrying about HDFS
• With Spark, exploiting data parallelism has never been more accessible (easier and cheaper)
• Current HDF5 to Spark on-ramps can be effective under the right circumstances, but are kludgy
• Work with us to build the right things right!
July 14 – 17, 2015
![Page 13: DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649e7a5503460f94b7ab47/html5/thumbnails/13.jpg)
DM_PPT_NP_v01SESIP_0715_GH2 13
References
July 14 – 17, 2015
[BigHDF]
https://www.hdfgroup.org/pubs/papers/Big_HDF_FAQs.pdf
[Blog] https://hdfgroup.org/wp/2015/04/putting-some-spark-into-hdf-eos/
[Report]
Zaharia et al., Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, UCBerkeley 2011.
[Spark] https://spark.apache.org/
[YouTube]
Mark Madsen: Big Data, Bad Analogies, 2014.
![Page 14: DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649e7a5503460f94b7ab47/html5/thumbnails/14.jpg)
DM_PPT_NP_v01SESIP_0715_GH2
14
THANK YOU
![Page 15: DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649e7a5503460f94b7ab47/html5/thumbnails/15.jpg)
DM_PPT_NP_v01SESIP_0715_GH2
15
This work was supported by NASA/GSFC under Raytheon Co. contract number NNG10HP02C