15 minute presentation about Thesis
description
Transcript of 15 minute presentation about Thesis
![Page 1: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/1.jpg)
Too much Data!Sven Meys
Saturday 9 February 13
![Page 2: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/2.jpg)
On-demand
Information Extraction from
Remote Sensing Images
with
MapReduce
Onderwerp
Saturday 9 February 13
![Page 3: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/3.jpg)
Inhoud
• Context
• Literatuurstudie
• Planning
Saturday 9 February 13
![Page 4: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/4.jpg)
Context
• VITO
• Remote Sensing
• Probleemstelling
• Onderzoeksvragen
Saturday 9 February 13
![Page 5: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/5.jpg)
700 €103 Milj.84%
16%
GovernmentPrivate
Saturday 9 February 13
![Page 6: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/6.jpg)
Energy Industrial Innovation
Energy Technology
Transition Energy &
Environment
Environ- mental
Analysis &
Techno- logy
Material Techno-
logy
Separation &
Conversion Technology
Quality of Environment
Remote Sensing
Environ- mental
Modelling
Environ- mental Health
Saturday 9 February 13
![Page 7: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/7.jpg)
Context
• VITO
• Remote Sensing
• Probleemstelling
• Onderzoeksvragen
Saturday 9 February 13
![Page 8: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/8.jpg)
Saturday 9 February 13
![Page 9: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/9.jpg)
Saturday 9 February 13
![Page 10: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/10.jpg)
Remote Sensing
Saturday 9 February 13
![Page 11: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/11.jpg)
1 km2 per pixel0.5 miljard pixels1.2 GB
Saturday 9 February 13
![Page 12: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/12.jpg)
RS Toepassingen
Saturday 9 February 13
![Page 13: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/13.jpg)
01-01-2001
01-01-2012
NDVI
Time Series:
Algorithm:
MeanOutput:
SUBMITSaturday 9 February 13
![Page 14: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/14.jpg)
Context
• VITO
• Remote Sensing
• Probleemstelling
• Onderzoeksvragen
Saturday 9 February 13
![Page 15: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/15.jpg)
Probleemstelling
Betere sensorenBetere beelden
Meer data Duurdere opslag
Meer informatie
Data Transport
Meer rekenwerkDure supercomputersParallel Processing
Saturday 9 February 13
![Page 16: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/16.jpg)
Doelstellingen
• Snel genoeg
• Betaalbaar
• Schaalbaar Bestandssysteem+
Software framework
Saturday 9 February 13
![Page 17: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/17.jpg)
Onderzoeksvragen• Hoe kunnen grote satellietbeelden in
een HDFS filesysteem opgeslagen worden zodat ze op een efficiënte manier in parallel verwerkt kunnen worden?
• Welke algoritmes kunnen gebruikt worden met deze opslagtechniek en MapReduce?
Saturday 9 February 13
![Page 18: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/18.jpg)
Inhoud
• Context
• Literatuurstudie
• Planning
Saturday 9 February 13
![Page 19: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/19.jpg)
Literatuurstudie• Interessante projecten
• HDFS
• MapReduce
• Implementaties
• Distributies
• Huidige Literatuur
Saturday 9 February 13
![Page 20: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/20.jpg)
Interessante projecten• NA (12)
• Center for Climate Simulation
• Square Kilometer Array: 700 TB/sec
• Open Cloud Consortium(13)
• Project Matsu: Elastic Clouds for Disaster Relief
• : Large Hadron Collider (14)
• 20 PB/jaarSaturday 9 February 13
![Page 21: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/21.jpg)
HDFS
• Gedistribueerd bestandssysteem
• Gebaseerd op the Google File System(1)
• Grote blokken (128 MiB)
• Commodity hardware
• Falen = standaard
• Read & append (1)
1
2
...
...n
Saturday 9 February 13
![Page 22: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/22.jpg)
HDFS
Calvalus Final Report Brockmann Consult GmbH
Page 8 / 43 Copyright © Brockmann Consult GmbH
3 Technical Approach
3.1 Hadoop Distributed Computing The basis of the Calvalus processing system is Apache Hadoop. Hadoop is an industry proven open-source software capable of running clusters of tens to ten thousands of computers and processing ultra large amounts of data based on massive parallelisation and a distributed file system.
3.1.1 Distributed File System (DFS) In opposite to a local file system, the Network File System (NFS) or the Common Internet File System (CIFS), a distributed file system (DFS) uses multiple nodes in a cluster to store the files and data resources [RD-5]. A DFS usually accounts for transparent file replication and fault tolerance and furthermore enables data locality for processing tasks. A DFS does this by subdividing files into blocks and replicating these blocks within a cluster of computers. Figure 2 shows the distribution and replication (right) of a file (left) subdivided into three blocks.
Figure 2: File blocks, distribution and replication in a distributed file system
Figure 3 demonstrates how the file system handles node-failure by automated recovery of under-replicated blocks. HDFS further uses checksums to verify block integrity. As long as there is at least one integer and accessible copy of a block, it can automatically re-replicate to return to the requested replication rate.
Figure 3: Automatic repair in case of cluster node failure by additional replication
Figure 4 shows how a distributed file system re-assembles blocks to retrieve the complete file for external retrieval.
1
3
2
1
1
2
3
1
3
2
2
3
1
3
2
1
1
3
2
2
3
1
3
2
2
3
3
2
1
1
3
2
Saturday 9 February 13
![Page 23: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/23.jpg)
HDFS Brockmann Consult GmbH Calvalus Final Report
Copyright © Brockmann Consult GmbH Page 9 / 43
Figure 4: Block assembly for data retrieval from the distributed file system
3.1.2 Data Locality Data processing systems that need to read and write large amounts of data perform best if the data I/O takes place on local storage devices. In clusters, where storage nodes are separated from compute nodes, two situations are likely:
1. Network bandwidth is the bottleneck, especially when multiple tasks work in parallel on the same input data but from different compute nodes and when storage nodes are separated from compute nodes.
2. Transfer rates of the local hard drives are the bottleneck, especially when multiple tasks are working in parallel on single (multi-CPU, multi-core) compute nodes.
A solution to these problems is to first use a cluster whose nodes are both, compute and storage nodes. Secondly, it is to distribute the processing tasks and execute them on the nodes that are “close” to the data, with respect to the network topology (see Figure 5). Parallel processing of inputs is done on splits. A split is a logical part of an input file that usually has the size of the blocks that store the data, but in contrast to a block that ends at an arbitrary byte position, a split is always aligned at file format specific record boundaries (see next chapter, step 1). Since splits are roughly aligned with file blocks, processing of input splits can be performed data-local.
Figure 5: Data-local processing and result assembly for retrieval
3.1.3 MapReduce Programming Model The MapReduce programming model has been published in 2004 by the two Google scientists J. Dean and S. Ghemawat [RD 4]. It is used for processing and generation of huge datasets on clusters for certain kinds of distributable problems. The model is composed of a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate keys. Many real world problems can be expressed in terms of this model and programs written in this functional style can be easily parallelised.
1
3
2 3
2
1
1
2
3
1
3
2
1
3
3
1
1
3
2
2
1
2
3
Saturday 9 February 13
![Page 24: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/24.jpg)
HDFS
Calvalus Final Report Brockmann Consult GmbH
Page 8 / 43 Copyright © Brockmann Consult GmbH
3 Technical Approach
3.1 Hadoop Distributed Computing The basis of the Calvalus processing system is Apache Hadoop. Hadoop is an industry proven open-source software capable of running clusters of tens to ten thousands of computers and processing ultra large amounts of data based on massive parallelisation and a distributed file system.
3.1.1 Distributed File System (DFS) In opposite to a local file system, the Network File System (NFS) or the Common Internet File System (CIFS), a distributed file system (DFS) uses multiple nodes in a cluster to store the files and data resources [RD-5]. A DFS usually accounts for transparent file replication and fault tolerance and furthermore enables data locality for processing tasks. A DFS does this by subdividing files into blocks and replicating these blocks within a cluster of computers. Figure 2 shows the distribution and replication (right) of a file (left) subdivided into three blocks.
Figure 2: File blocks, distribution and replication in a distributed file system
Figure 3 demonstrates how the file system handles node-failure by automated recovery of under-replicated blocks. HDFS further uses checksums to verify block integrity. As long as there is at least one integer and accessible copy of a block, it can automatically re-replicate to return to the requested replication rate.
Figure 3: Automatic repair in case of cluster node failure by additional replication
Figure 4 shows how a distributed file system re-assembles blocks to retrieve the complete file for external retrieval.
1
3
2
1
1
2
3
1
3
2
2
3
1
3
2
1
1
3
2
2
3
1
3
2
2
3
3
2
1
1
3
2
Saturday 9 February 13
![Page 25: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/25.jpg)
HDFS - Overzicht
• Schaalbaar
• Snel lezen/schrijven
• Robuust
• Factor 10 goedkoper (2)
Saturday 9 February 13
![Page 26: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/26.jpg)
MapReduce
Saturday 9 February 13
![Page 27: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/27.jpg)
MapReduce - WordCount
Saturday 9 February 13
![Page 28: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/28.jpg)
MapReduce - Overzicht
• Based on Google MapReduce (3)
• Data Locality
• Key/Value pairs
• Zeer snel
• Andere manier van denken
Saturday 9 February 13
![Page 29: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/29.jpg)
Implementaties
• Apache Software Foundation
• Anderen: outdated, commercieel, weinig support (4-6)
Hadoop Stratosphere HPCCSupport + - +
Extensions + - ?Community +++ +/- -
Target ANY EDU BI
Saturday 9 February 13
![Page 30: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/30.jpg)
Distributies
• Hortonworks (7)
•
• Cloudera : Cloudera Manager (9)
• Web Interface
• 1-Click install. (yeah right...)
• Interessant licentie model
(8)
Saturday 9 February 13
![Page 31: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/31.jpg)
Algemeen
• Vooral tekstverwerking
• Voor kleine afbeeldingen (10)
• Weinig detail
• Commercieel (11)
Saturday 9 February 13
![Page 32: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/32.jpg)
Inhoud
• Context
• Literatuurstudie
• Planning
Saturday 9 February 13
![Page 33: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/33.jpg)
Planning
literatuur
fase 1
fase 2fase 3fase 4
vandaagverslag
inleverenmasterproef
01/0215/03
20/05
stage
01/09
Saturday 9 February 13
![Page 34: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/34.jpg)
Fase 1 - Done
Sven
Master
Workstation
Patrick
Workstation
Bruno
Workstation
Tim
DN
DN DN DNNN
JT TTTTTT
TT
192.168.10.245 192.168.10.246 192.168.10.247
192.168.10.248
192.168.10.249
TT
JT NN
DN
= Job Tracker
= Task Tracker
= Name Node
= Data Node
= RedHat 6.2 Workstation
= RedHat 6.2 Virtual Machine
Saturday 9 February 13
![Page 35: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/35.jpg)
Fase 2
• Eenvoudig algoritme
• Beeld draaien
• Standaard IO
• HDFS
Saturday 9 February 13
![Page 36: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/36.jpg)
Fase 3
• Meer complexiteit: MapReduce
• Spatiaal: Convolutiemasker, ROI
• Temporeel/Spectraal: Meerdere afbeeldingen
•
Saturday 9 February 13
![Page 37: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/37.jpg)
Fase 4• Performantie in functie van pixel
afstand
Saturday 9 February 13
![Page 38: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/38.jpg)
Planning
literatuur
fase 1
fase 2fase 3fase 4
vandaagverslag
inleverenmasterproef
01/0215/03
20/05
stage
01/09
Saturday 9 February 13
![Page 39: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/39.jpg)
The End• Veel data
• Anders denken
• Veel mogelijkheden• RLZ of nieuw keuzevak Big Data? ;)
• Mapreduce + OpenCL?
• Veel uitdagingen
• Veel vragenSaturday 9 February 13
![Page 40: 15 minute presentation about Thesis](https://reader031.fdocuments.us/reader031/viewer/2022020110/54b41cab4a7959620e8b45dd/html5/thumbnails/40.jpg)
Referenties(1) Ghemawat, S., Gobioff, H. and Leung, S.-T. (2003), ‘The google file system’
(2) Krishnan, S., Baru, C. and Crosby, C. (2010), ‘Evaluation of mapreduce for gridding lidar data’
(3) Dean, J., Ghemawat, S. and Inc, G. (2004), ‘Mapreduce: simplified data processing on large clusters’
(4) http://hadoop.apache.org/
(5) Warneke, D. and Kao, O. (2009), ‘Nephele: Efficient parallel data processing in the cloud’, http://www.stratosphere.eu
(6) http://hpccsystems.com/
(7) http://hortonworks.com/
(8) http://mapr.com/
(9) http://cloudera.com/
(10) Sweeney, C. (2011), ‘Hipi: Hadoop image processing interface for image-based mapreduce’
(11) Guinan, O. (2011), ‘Indexing the earth - large scale satellite image processing using hadoop’, http://www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/hadoop-world-2011-presentation-video-indexing-the-earth-large-scale-satellite-image-processing-using-hadoop.htmt(12) Q. Duffy, D. (2013), ‘Untangling the computing landscape for NASA climate simulations’. URL: http://www.nas.nasa.gov/SC12/demos/demo20.html(13) http://www.slideshare.net/rgrossman/project-matsu-elastic-clouds-for-disaster-relief
(14) Lassnig, M., Garonne, V., Dimitrov, G. and Canali, L. (2012), ‘Atlas data management accounting with hadoop pig and hbase’.
Saturday 9 February 13