15 minute presentation about thesis

Too much Data!Sven Meys

Saturday 9 February 13

On-demand

Information Extraction from

Remote Sensing Images

with

MapReduce

Onderwerp


Inhoud

• Context

• Literatuurstudie

• Planning


Context

• VITO

• Remote Sensing

• Probleemstelling

• Onderzoeksvragen


700 €103 Milj.84%

16%

GovernmentPrivate


Energy Industrial Innovation

Energy Technology

Transition Energy &

Environment

Environ- mental

Analysis &

Techno- logy

Material Techno-

logy

Separation &

Conversion Technology

Quality of Environment

Remote Sensing

Environ- mental

Modelling

Environ- mental Health


Context

• VITO

• Remote Sensing




Remote Sensing


1 km2 per pixel0.5 miljard pixels1.2 GB


RS Toepassingen


01-01-2001

01-01-2012

NDVI

Time Series:

Algorithm:

MeanOutput:

SUBMITSaturday 9 February 13

Context

• VITO

• Remote Sensing




Probleemstelling

Betere sensorenBetere beelden

Meer data Duurdere opslag

Meer informatie

Data Transport

Meer rekenwerkDure supercomputersParallel Processing


Doelstellingen

• Snel genoeg

• Betaalbaar

• Schaalbaar Bestandssysteem+

Software framework


Onderzoeksvragen• Hoe kunnen grote satellietbeelden in

een HDFS filesysteem opgeslagen worden zodat ze op een efficiënte manier in parallel verwerkt kunnen worden?

• Welke algoritmes kunnen gebruikt worden met deze opslagtechniek en MapReduce?


Inhoud

• Context


• Planning


Literatuurstudie• Interessante projecten

• HDFS

• MapReduce

• Implementaties

• Distributies

• Huidige Literatuur


Interessante projecten• NA (12)

• Center for Climate Simulation

• Square Kilometer Array: 700 TB/sec

• Open Cloud Consortium(13)

• Project Matsu: Elastic Clouds for Disaster Relief

• : Large Hadron Collider (14)

• 20 PB/jaarSaturday 9 February 13

HDFS

• Gedistribueerd bestandssysteem

• Gebaseerd op the Google File System(1)

• Grote blokken (128 MiB)

• Commodity hardware

• Falen = standaard

• Read & append (1)

1

2

...

...n


HDFS

Calvalus Final Report Brockmann Consult GmbH

Page 8 / 43 Copyright © Brockmann Consult GmbH

3 Technical Approach

3.1 Hadoop Distributed Computing The basis of the Calvalus processing system is Apache Hadoop. Hadoop is an industry proven open-source software capable of running clusters of tens to ten thousands of computers and processing ultra large amounts of data based on massive parallelisation and a distributed file system.

3.1.1 Distributed File System (DFS) In opposite to a local file system, the Network File System (NFS) or the Common Internet File System (CIFS), a distributed file system (DFS) uses multiple nodes in a cluster to store the files and data resources [RD-5]. A DFS usually accounts for transparent file replication and fault tolerance and furthermore enables data locality for processing tasks. A DFS does this by subdividing files into blocks and replicating these blocks within a cluster of computers. Figure 2 shows the distribution and replication (right) of a file (left) subdivided into three blocks.

Figure 2: File blocks, distribution and replication in a distributed file system

Figure 3 demonstrates how the file system handles node-failure by automated recovery of under-replicated blocks. HDFS further uses checksums to verify block integrity. As long as there is at least one integer and accessible copy of a block, it can automatically re-replicate to return to the requested replication rate.

Figure 3: Automatic repair in case of cluster node failure by additional replication

Figure 4 shows how a distributed file system re-assembles blocks to retrieve the complete file for external retrieval.

1

3

2

1

1

2

3

1

3

2

2

3

1

3

2

1

1

3

2

2

3

1

3

2

2

3

3

2

1

1

3

2


HDFS Brockmann Consult GmbH Calvalus Final Report

Copyright © Brockmann Consult GmbH Page 9 / 43

Figure 4: Block assembly for data retrieval from the distributed file system

3.1.2 Data Locality Data processing systems that need to read and write large amounts of data perform best if the data I/O takes place on local storage devices. In clusters, where storage nodes are separated from compute nodes, two situations are likely:

1. Network bandwidth is the bottleneck, especially when multiple tasks work in parallel on the same input data but from different compute nodes and when storage nodes are separated from compute nodes.

2. Transfer rates of the local hard drives are the bottleneck, especially when multiple tasks are working in parallel on single (multi-CPU, multi-core) compute nodes.

A solution to these problems is to first use a cluster whose nodes are both, compute and storage nodes. Secondly, it is to distribute the processing tasks and execute them on the nodes that are “close” to the data, with respect to the network topology (see Figure 5). Parallel processing of inputs is done on splits. A split is a logical part of an input file that usually has the size of the blocks that store the data, but in contrast to a block that ends at an arbitrary byte position, a split is always aligned at file format specific record boundaries (see next chapter, step 1). Since splits are roughly aligned with file blocks, processing of input splits can be performed data-local.

Figure 5: Data-local processing and result assembly for retrieval

3.1.3 MapReduce Programming Model The MapReduce programming model has been published in 2004 by the two Google scientists J. Dean and S. Ghemawat [RD 4]. It is used for processing and generation of huge datasets on clusters for certain kinds of distributable problems. The model is composed of a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate keys. Many real world problems can be expressed in terms of this model and programs written in this functional style can be easily parallelised.

1

3

2 3

2

1

1

2

3

1

3

2

1

3

3

1

1

3

2

2

1

2

3


HDFS

Calvalus Final Report Brockmann Consult GmbH

Page 8 / 43 Copyright © Brockmann Consult GmbH

3 Technical Approach

3.1 Hadoop Distributed Computing The basis of the Calvalus processing system is Apache Hadoop. Hadoop is an industry proven open-source software capable of running clusters of tens to ten thousands of computers and processing ultra large amounts of data based on massive parallelisation and a distributed file system.

3.1.1 Distributed File System (DFS) In opposite to a local file system, the Network File System (NFS) or the Common Internet File System (CIFS), a distributed file system (DFS) uses multiple nodes in a cluster to store the files and data resources [RD-5]. A DFS usually accounts for transparent file replication and fault tolerance and furthermore enables data locality for processing tasks. A DFS does this by subdividing files into blocks and replicating these blocks within a cluster of computers. Figure 2 shows the distribution and replication (right) of a file (left) subdivided into three blocks.

Figure 2: File blocks, distribution and replication in a distributed file system

Figure 3 demonstrates how the file system handles node-failure by automated recovery of under-replicated blocks. HDFS further uses checksums to verify block integrity. As long as there is at least one integer and accessible copy of a block, it can automatically re-replicate to return to the requested replication rate.

Figure 3: Automatic repair in case of cluster node failure by additional replication

Figure 4 shows how a distributed file system re-assembles blocks to retrieve the complete file for external retrieval.

1

3

2

1

1

2

3

1

3

2

2

3

1

3

2

1

1

3

2

2

3

1

3

2

2

3

3

2

1

1

3

2


HDFS - Overzicht

• Schaalbaar

• Snel lezen/schrijven

• Robuust

• Factor 10 goedkoper (2)


MapReduce


MapReduce - WordCount


MapReduce - Overzicht

• Based on Google MapReduce (3)

• Data Locality

• Key/Value pairs

• Zeer snel

• Andere manier van denken


Implementaties

• Apache Software Foundation

• Anderen: outdated, commercieel, weinig support (4-6)

Hadoop Stratosphere HPCCSupport + - +

Extensions + - ?Community +++ +/- -

Target ANY EDU BI


Distributies

• Hortonworks (7)

•

• Cloudera : Cloudera Manager (9)

• Web Interface

• 1-Click install. (yeah right...)

• Interessant licentie model

(8)


Algemeen

• Vooral tekstverwerking

• Voor kleine afbeeldingen (10)

• Weinig detail

• Commercieel (11)


Inhoud

• Context


• Planning


Planning

literatuur

fase 1

fase 2fase 3fase 4

vandaagverslag

inleverenmasterproef

01/0215/03

20/05

stage

01/09


Fase 1 - Done

Sven

Master

Workstation

Patrick

Workstation

Bruno

Workstation

Tim

DN

DN DN DNNN

JT TTTTTT

TT

192.168.10.245 192.168.10.246 192.168.10.247

192.168.10.248

192.168.10.249

TT

JT NN

DN

= Job Tracker

= Task Tracker

= Name Node

= Data Node

= RedHat 6.2 Workstation

= RedHat 6.2 Virtual Machine


Fase 2

• Eenvoudig algoritme

• Beeld draaien

• Standaard IO

• HDFS


Fase 3

• Meer complexiteit: MapReduce

• Spatiaal: Convolutiemasker, ROI

• Temporeel/Spectraal: Meerdere afbeeldingen

•


Fase 4• Performantie in functie van pixel

afstand


Planning

literatuur

fase 1

fase 2fase 3fase 4

vandaagverslag

inleverenmasterproef

01/0215/03

20/05

stage

01/09


The End• Veel data

• Anders denken

• Veel mogelijkheden• RLZ of nieuw keuzevak Big Data? ;)

• Mapreduce + OpenCL?

• Veel uitdagingen

• Veel vragenSaturday 9 February 13

Referenties(1) Ghemawat, S., Gobioff, H. and Leung, S.-T. (2003), ‘The google file system’

(2) Krishnan, S., Baru, C. and Crosby, C. (2010), ‘Evaluation of mapreduce for gridding lidar data’

(3) Dean, J., Ghemawat, S. and Inc, G. (2004), ‘Mapreduce: simplified data processing on large clusters’

(4) http://hadoop.apache.org/

(5) Warneke, D. and Kao, O. (2009), ‘Nephele: Efficient parallel data processing in the cloud’, http://www.stratosphere.eu

(6) http://hpccsystems.com/

(7) http://hortonworks.com/

(8) http://mapr.com/

(9) http://cloudera.com/

(10) Sweeney, C. (2011), ‘Hipi: Hadoop image processing interface for image-based mapreduce’

(11) Guinan, O. (2011), ‘Indexing the earth - large scale satellite image processing using hadoop’, http://www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/hadoop-world-2011-presentation-video-indexing-the-earth-large-scale-satellite-image-processing-using-hadoop.htmt(12) Q. Duffy, D. (2013), ‘Untangling the computing landscape for NASA climate simulations’. URL: http://www.nas.nasa.gov/SC12/demos/demo20.html(13) http://www.slideshare.net/rgrossman/project-matsu-elastic-clouds-for-disaster-relief

(14) Lassnig, M., Garonne, V., Dimitrov, G. and Canali, L. (2012), ‘Atlas data management accounting with hadoop pig and hbase’.


http://hortonworks.com

http://hortonworks.com

http://cloudera.com

http://cloudera.com

http://www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/hadoop-world-2011-presentation-video-indexing-the-earth-large-scale-satellite-image-processing-using-hadoop.htmt






http://www.nas.nasa.gov/SC12/demos/demo20.html




http://www.slideshare.net/rgrossman/project-matsu-elastic-clouds-for-disaster-relief

http://www.slideshare.net/rgrossman/project-matsu-elastic-clouds-for-disaster-relief

15 minute presentation about thesis

Documents

data retrieval

data nodesaturday

file blocks

google mapreduce

mapreduce overzicht

transparent file replication

mapreduce spatiaal

mapreduce wordcountsaturday