exercise 1: distributed data management / mapreduce · exercise 1: distributed data management /...

Distributed Data Management SS 2013

Dr.-Ing. Sebastian Michel

M. Sc. Johannes Schildgen

Fachbereich Informatik

Exercise Sheet 1: MapReduce Tuesday, May 07, 2013 – 15:30 to 17:00 – Room 52-203

Exercise 1: Distributed Data Management / MapReduce 1. Give and briefly discuss examples (at least one each) for good and bad use cases of

distributed data management.

2. Find two arguments for and against map reduce as THE novel multi-purpose computing tool.

3. The failure modeling presented in the lecture modeled the case of at least one machine

failure per day. Now, we want to derive the probability that more than m machines fail.

Assume there are n machines. The failure probability of a single machine is p.

(a) Derive P[more than m machines fail]

(b) Compute values for p=1/365, n=10000, and for m=10, 20, 30, 40 respectively.

Exercise 2: Working with Apache Hadoop The objective of this exercise is to familiarize you with Hadoop, especially with HDFS1 and the

implementation of MapReduce programs in Java. Your first job is to perform large text analysis of a

collection of texts from Shakespeare.

a) Install Oracle VirtualBox and download the modified Cloudera Quick Start Image2:

http://db.tt/PhvEGsKQ

Alternatively you can install and configure Hadoop and Eclipse manually.

On the top panel you find a terminal, Firefox and Eclipse:

b) You find the texts of Shakespeare in /home/cloudera/shakespeare.txt. Load this file into

HDFS using hadoop dfs –put on command line. Before you can do this you have to

create a new HDFS directory: hadoop dfs –mkdir /shakespeare

c) Use Firefox to Browse the HDFS via the web interface: http://localhost:50070

(on VM: Firefox bookmark “Hadoop Namenode”)

Check if the file was loaded into HDFS correctly.

d) Now you can write your first MapReduce job. Start Eclipse: /home/cloudera/eclipse/eclipse

As you can see, we prepared a simple Map-only job for you which just writes a dummy

output into the HDFS directory /output. Can you imagine how the output will look like?

e) Rightclick on the project “MapReduce Sandbox” and chose “Export” – “Java” – “JAR File” and

run the job: hadoop jar wordcount.WordCount, take a look at the output file.

You can monitor the execution of the job via the Jobtracker web interface.

f) Change the Map function so that it emits (word, 1) for each word in the given text line.

g) Write a Reduce function so that your job counts the occurrences of every word in the text.

1 Hadoop Distributed File System

2 This is a 64-bit VM, and requires a 64-bit host OS and a virtualization product that can support a 64-bit guest

OS. This VM uses 4 GB of total RAM. The total system memory required varies depending on the size of your data set and on the other processes that are running. The demo VM file is approximately 3 GB.

http://db.tt/PhvEGsKQ

Exercise 3: SQL -> MapReduce MapReduce is often used to analyze CSV3 data, e.g. large log files. Write a new MapReduce job,

which produce an equivalent output as the following SQL queries.

a) First load the two files ~/City.dat and ~/Coutry.dat into one HDFS folder.

City(name, country, province, population, latitude, longitude)

Country(name, code, capital, province, population, area)

b) SELECT name, country, province, population FROM City

c) SELECT name, country, province, population FROM City WHERE population > 100000

d) SELECT name, country, province, population FROM City WHERE population > 100000

ORDER BY population

e) SELECT country, SUM(population) FROM City GROUP BY country

f) SELECT country, province, AVG(population) FROM City GROUP BY country, province

g) SELECT Country.name, City.name, City.population

FROM Country JOIN City ON Country.capital = City.name

AND Country.province = City.province AND Country.code = City.country

3 Comma-separated values

exercise 1: distributed data management / mapreduce · exercise 1: distributed data management /...

Documents