exercise 1: distributed data management / mapreduce · exercise 1: distributed data management /...
TRANSCRIPT
Distributed Data Management SS 2013
Dr.-Ing. Sebastian Michel
M. Sc. Johannes Schildgen
Fachbereich Informatik
Exercise Sheet 1: MapReduce Tuesday, May 07, 2013 – 15:30 to 17:00 – Room 52-203
Exercise 1: Distributed Data Management / MapReduce 1. Give and briefly discuss examples (at least one each) for good and bad use cases of
distributed data management.
2. Find two arguments for and against map reduce as THE novel multi-purpose computing tool.
3. The failure modeling presented in the lecture modeled the case of at least one machine
failure per day. Now, we want to derive the probability that more than m machines fail.
Assume there are n machines. The failure probability of a single machine is p.
(a) Derive P[more than m machines fail]
(b) Compute values for p=1/365, n=10000, and for m=10, 20, 30, 40 respectively.
Exercise 2: Working with Apache Hadoop The objective of this exercise is to familiarize you with Hadoop, especially with HDFS1 and the
implementation of MapReduce programs in Java. Your first job is to perform large text analysis of a
collection of texts from Shakespeare.
a) Install Oracle VirtualBox and download the modified Cloudera Quick Start Image2:
http://db.tt/PhvEGsKQ
Alternatively you can install and configure Hadoop and Eclipse manually.
On the top panel you find a terminal, Firefox and Eclipse:
b) You find the texts of Shakespeare in /home/cloudera/shakespeare.txt. Load this file into
HDFS using hadoop dfs –put on command line. Before you can do this you have to
create a new HDFS directory: hadoop dfs –mkdir /shakespeare
c) Use Firefox to Browse the HDFS via the web interface: http://localhost:50070
(on VM: Firefox bookmark “Hadoop Namenode”)
Check if the file was loaded into HDFS correctly.
d) Now you can write your first MapReduce job. Start Eclipse: /home/cloudera/eclipse/eclipse
As you can see, we prepared a simple Map-only job for you which just writes a dummy
output into the HDFS directory /output. Can you imagine how the output will look like?
e) Rightclick on the project “MapReduce Sandbox” and chose “Export” – “Java” – “JAR File” and
run the job: hadoop jar wordcount.WordCount, take a look at the output file.
You can monitor the execution of the job via the Jobtracker web interface.
f) Change the Map function so that it emits (word, 1) for each word in the given text line.
g) Write a Reduce function so that your job counts the occurrences of every word in the text.
1 Hadoop Distributed File System
2 This is a 64-bit VM, and requires a 64-bit host OS and a virtualization product that can support a 64-bit guest
OS. This VM uses 4 GB of total RAM. The total system memory required varies depending on the size of your data set and on the other processes that are running. The demo VM file is approximately 3 GB.
Exercise 3: SQL -> MapReduce MapReduce is often used to analyze CSV3 data, e.g. large log files. Write a new MapReduce job,
which produce an equivalent output as the following SQL queries.
a) First load the two files ~/City.dat and ~/Coutry.dat into one HDFS folder.
City(name, country, province, population, latitude, longitude)
Country(name, code, capital, province, population, area)
b) SELECT name, country, province, population FROM City
c) SELECT name, country, province, population FROM City WHERE population > 100000
d) SELECT name, country, province, population FROM City WHERE population > 100000
ORDER BY population
e) SELECT country, SUM(population) FROM City GROUP BY country
f) SELECT country, province, AVG(population) FROM City GROUP BY country, province
g) SELECT Country.name, City.name, City.population
FROM Country JOIN City ON Country.capital = City.name
AND Country.province = City.province AND Country.code = City.country
3 Comma-separated values