15 minute presentation about thesis
DESCRIPTION
TRANSCRIPT
Too much Data!Sven Meys
Saturday 9 February 13
On-demand
Information Extraction from
Remote Sensing Images
with
MapReduce
Onderwerp
Saturday 9 February 13
Inhoud
• Context
• Literatuurstudie
• Planning
Saturday 9 February 13
Context
• VITO
• Remote Sensing
• Probleemstelling
• Onderzoeksvragen
Saturday 9 February 13
700 €103 Milj.84%
16%
GovernmentPrivate
Saturday 9 February 13
Energy Industrial Innovation
Energy Technology
Transition Energy &
Environment
Environ- mental
Analysis &
Techno- logy
Material Techno-
logy
Separation &
Conversion Technology
Quality of Environment
Remote Sensing
Environ- mental
Modelling
Environ- mental Health
Saturday 9 February 13
Context
• VITO
• Remote Sensing
• Probleemstelling
• Onderzoeksvragen
Saturday 9 February 13
Saturday 9 February 13
Saturday 9 February 13
Remote Sensing
Saturday 9 February 13
1 km2 per pixel0.5 miljard pixels1.2 GB
Saturday 9 February 13
RS Toepassingen
Saturday 9 February 13
01-01-2001
01-01-2012
NDVI
Time Series:
Algorithm:
MeanOutput:
SUBMITSaturday 9 February 13
Context
• VITO
• Remote Sensing
• Probleemstelling
• Onderzoeksvragen
Saturday 9 February 13
Probleemstelling
Betere sensorenBetere beelden
Meer data Duurdere opslag
Meer informatie
Data Transport
Meer rekenwerkDure supercomputersParallel Processing
Saturday 9 February 13
Doelstellingen
• Snel genoeg
• Betaalbaar
• Schaalbaar Bestandssysteem+
Software framework
Saturday 9 February 13
Onderzoeksvragen• Hoe kunnen grote satellietbeelden in
een HDFS filesysteem opgeslagen worden zodat ze op een efficiënte manier in parallel verwerkt kunnen worden?
• Welke algoritmes kunnen gebruikt worden met deze opslagtechniek en MapReduce?
Saturday 9 February 13
Inhoud
• Context
• Literatuurstudie
• Planning
Saturday 9 February 13
Literatuurstudie• Interessante projecten
• HDFS
• MapReduce
• Implementaties
• Distributies
• Huidige Literatuur
Saturday 9 February 13
Interessante projecten• NA (12)
• Center for Climate Simulation
• Square Kilometer Array: 700 TB/sec
• Open Cloud Consortium(13)
• Project Matsu: Elastic Clouds for Disaster Relief
• : Large Hadron Collider (14)
• 20 PB/jaarSaturday 9 February 13
HDFS
• Gedistribueerd bestandssysteem
• Gebaseerd op the Google File System(1)
• Grote blokken (128 MiB)
• Commodity hardware
• Falen = standaard
• Read & append (1)
1
2
...
...n
Saturday 9 February 13
HDFS
Calvalus Final Report Brockmann Consult GmbH
Page 8 / 43 Copyright © Brockmann Consult GmbH
3 Technical Approach
3.1 Hadoop Distributed Computing The basis of the Calvalus processing system is Apache Hadoop. Hadoop is an industry proven open-source software capable of running clusters of tens to ten thousands of computers and processing ultra large amounts of data based on massive parallelisation and a distributed file system.
3.1.1 Distributed File System (DFS) In opposite to a local file system, the Network File System (NFS) or the Common Internet File System (CIFS), a distributed file system (DFS) uses multiple nodes in a cluster to store the files and data resources [RD-5]. A DFS usually accounts for transparent file replication and fault tolerance and furthermore enables data locality for processing tasks. A DFS does this by subdividing files into blocks and replicating these blocks within a cluster of computers. Figure 2 shows the distribution and replication (right) of a file (left) subdivided into three blocks.
Figure 2: File blocks, distribution and replication in a distributed file system
Figure 3 demonstrates how the file system handles node-failure by automated recovery of under-replicated blocks. HDFS further uses checksums to verify block integrity. As long as there is at least one integer and accessible copy of a block, it can automatically re-replicate to return to the requested replication rate.
Figure 3: Automatic repair in case of cluster node failure by additional replication
Figure 4 shows how a distributed file system re-assembles blocks to retrieve the complete file for external retrieval.
1
3
2
1
1
2
3
1
3
2
2
3
1
3
2
1
1
3
2
2
3
1
3
2
2
3
3
2
1
1
3
2
Saturday 9 February 13
HDFS Brockmann Consult GmbH Calvalus Final Report
Copyright © Brockmann Consult GmbH Page 9 / 43
Figure 4: Block assembly for data retrieval from the distributed file system
3.1.2 Data Locality Data processing systems that need to read and write large amounts of data perform best if the data I/O takes place on local storage devices. In clusters, where storage nodes are separated from compute nodes, two situations are likely:
1. Network bandwidth is the bottleneck, especially when multiple tasks work in parallel on the same input data but from different compute nodes and when storage nodes are separated from compute nodes.
2. Transfer rates of the local hard drives are the bottleneck, especially when multiple tasks are working in parallel on single (multi-CPU, multi-core) compute nodes.
A solution to these problems is to first use a cluster whose nodes are both, compute and storage nodes. Secondly, it is to distribute the processing tasks and execute them on the nodes that are “close” to the data, with respect to the network topology (see Figure 5). Parallel processing of inputs is done on splits. A split is a logical part of an input file that usually has the size of the blocks that store the data, but in contrast to a block that ends at an arbitrary byte position, a split is always aligned at file format specific record boundaries (see next chapter, step 1). Since splits are roughly aligned with file blocks, processing of input splits can be performed data-local.
Figure 5: Data-local processing and result assembly for retrieval
3.1.3 MapReduce Programming Model The MapReduce programming model has been published in 2004 by the two Google scientists J. Dean and S. Ghemawat [RD 4]. It is used for processing and generation of huge datasets on clusters for certain kinds of distributable problems. The model is composed of a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate keys. Many real world problems can be expressed in terms of this model and programs written in this functional style can be easily parallelised.
1
3
2 3
2
1
1
2
3
1
3
2
1
3
3
1
1
3
2
2
1
2
3
Saturday 9 February 13
HDFS
Calvalus Final Report Brockmann Consult GmbH
Page 8 / 43 Copyright © Brockmann Consult GmbH
3 Technical Approach
3.1 Hadoop Distributed Computing The basis of the Calvalus processing system is Apache Hadoop. Hadoop is an industry proven open-source software capable of running clusters of tens to ten thousands of computers and processing ultra large amounts of data based on massive parallelisation and a distributed file system.
3.1.1 Distributed File System (DFS) In opposite to a local file system, the Network File System (NFS) or the Common Internet File System (CIFS), a distributed file system (DFS) uses multiple nodes in a cluster to store the files and data resources [RD-5]. A DFS usually accounts for transparent file replication and fault tolerance and furthermore enables data locality for processing tasks. A DFS does this by subdividing files into blocks and replicating these blocks within a cluster of computers. Figure 2 shows the distribution and replication (right) of a file (left) subdivided into three blocks.
Figure 2: File blocks, distribution and replication in a distributed file system
Figure 3 demonstrates how the file system handles node-failure by automated recovery of under-replicated blocks. HDFS further uses checksums to verify block integrity. As long as there is at least one integer and accessible copy of a block, it can automatically re-replicate to return to the requested replication rate.
Figure 3: Automatic repair in case of cluster node failure by additional replication
Figure 4 shows how a distributed file system re-assembles blocks to retrieve the complete file for external retrieval.
1
3
2
1
1
2
3
1
3
2
2
3
1
3
2
1
1
3
2
2
3
1
3
2
2
3
3
2
1
1
3
2
Saturday 9 February 13
HDFS - Overzicht
• Schaalbaar
• Snel lezen/schrijven
• Robuust
• Factor 10 goedkoper (2)
Saturday 9 February 13
MapReduce
Saturday 9 February 13
MapReduce - WordCount
Saturday 9 February 13
MapReduce - Overzicht
• Based on Google MapReduce (3)
• Data Locality
• Key/Value pairs
• Zeer snel
• Andere manier van denken
Saturday 9 February 13
Implementaties
• Apache Software Foundation
• Anderen: outdated, commercieel, weinig support (4-6)
Hadoop Stratosphere HPCCSupport + - +
Extensions + - ?Community +++ +/- -
Target ANY EDU BI
Saturday 9 February 13
Distributies
• Hortonworks (7)
•
• Cloudera : Cloudera Manager (9)
• Web Interface
• 1-Click install. (yeah right...)
• Interessant licentie model
(8)
Saturday 9 February 13
Algemeen
• Vooral tekstverwerking
• Voor kleine afbeeldingen (10)
• Weinig detail
• Commercieel (11)
Saturday 9 February 13
Inhoud
• Context
• Literatuurstudie
• Planning
Saturday 9 February 13
Planning
literatuur
fase 1
fase 2fase 3fase 4
vandaagverslag
inleverenmasterproef
01/0215/03
20/05
stage
01/09
Saturday 9 February 13
Fase 1 - Done
Sven
Master
Workstation
Patrick
Workstation
Bruno
Workstation
Tim
DN
DN DN DNNN
JT TTTTTT
TT
192.168.10.245 192.168.10.246 192.168.10.247
192.168.10.248
192.168.10.249
TT
JT NN
DN
= Job Tracker
= Task Tracker
= Name Node
= Data Node
= RedHat 6.2 Workstation
= RedHat 6.2 Virtual Machine
Saturday 9 February 13
Fase 2
• Eenvoudig algoritme
• Beeld draaien
• Standaard IO
• HDFS
Saturday 9 February 13
Fase 3
• Meer complexiteit: MapReduce
• Spatiaal: Convolutiemasker, ROI
• Temporeel/Spectraal: Meerdere afbeeldingen
•
Saturday 9 February 13
Fase 4• Performantie in functie van pixel
afstand
Saturday 9 February 13
Planning
literatuur
fase 1
fase 2fase 3fase 4
vandaagverslag
inleverenmasterproef
01/0215/03
20/05
stage
01/09
Saturday 9 February 13
The End• Veel data
• Anders denken
• Veel mogelijkheden• RLZ of nieuw keuzevak Big Data? ;)
• Mapreduce + OpenCL?
• Veel uitdagingen
• Veel vragenSaturday 9 February 13
Referenties(1) Ghemawat, S., Gobioff, H. and Leung, S.-T. (2003), ‘The google file system’
(2) Krishnan, S., Baru, C. and Crosby, C. (2010), ‘Evaluation of mapreduce for gridding lidar data’
(3) Dean, J., Ghemawat, S. and Inc, G. (2004), ‘Mapreduce: simplified data processing on large clusters’
(4) http://hadoop.apache.org/
(5) Warneke, D. and Kao, O. (2009), ‘Nephele: Efficient parallel data processing in the cloud’, http://www.stratosphere.eu
(6) http://hpccsystems.com/
(7) http://hortonworks.com/
(8) http://mapr.com/
(9) http://cloudera.com/
(10) Sweeney, C. (2011), ‘Hipi: Hadoop image processing interface for image-based mapreduce’
(11) Guinan, O. (2011), ‘Indexing the earth - large scale satellite image processing using hadoop’, http://www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/hadoop-world-2011-presentation-video-indexing-the-earth-large-scale-satellite-image-processing-using-hadoop.htmt(12) Q. Duffy, D. (2013), ‘Untangling the computing landscape for NASA climate simulations’. URL: http://www.nas.nasa.gov/SC12/demos/demo20.html(13) http://www.slideshare.net/rgrossman/project-matsu-elastic-clouds-for-disaster-relief
(14) Lassnig, M., Garonne, V., Dimitrov, G. and Canali, L. (2012), ‘Atlas data management accounting with hadoop pig and hbase’.
Saturday 9 February 13