jimmy lin the ischool university of maryland sunday, may 31, 2009

36
Data-Intensive Text Processing with MapReduce Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Chris Dyer Department of Linguistics University of Maryland Tutorial at 2009 North American Chapter of the Association for Computational Linguistics―Human Language Technologies Conference (NAACL HLT 2009) (Bonus session)

Upload: berk-strickland

Post on 31-Dec-2015

32 views

Category:

Documents


1 download

DESCRIPTION

Data-Intensive Text Processing with MapReduce. (Bonus session). Tutorial at 2009 North American Chapter of the Association for Computational Linguistics―Human Language Technologies Conference (NAACL HLT 2009). Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009. Chris Dyer - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Data-Intensive Text Processing with MapReduce

Jimmy LinThe iSchoolUniversity of Maryland

Sunday, May 31, 2009

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Chris DyerDepartment of LinguisticsUniversity of Maryland

Tutorial at 2009 North American Chapter of the Association for Computational Linguistics―Human Language Technologies Conference (NAACL HLT 2009)

(Bonus session)

Page 2: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Agenda Hadoop “nuts and bolts”

“Hello World” Hadoop example(distributed word count)

Running Hadoop in “standalone” mode

Running Hadoop on EC2

Open-source Hadoop ecosystem

Exercises and “office hours”

Page 3: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Hadoop “nuts and bolts”

Page 4: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Source: http://davidzinger.wordpress.com/2007/05/page/2/

Page 5: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Hadoop Zen Don’t get frustrated (take a deep breath)…

Remember this when you experience those W$*#T@F! moments

This is bleeding edge technology: Lots of bugs Stability issues Even lost data To upgrade or not to upgrade (damned either way)? Poor documentation (or none)

But… Hadoop is the path to data nirvana?

Page 6: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Cloud9

Library used for teaching cloud computing courses at Maryland

Demos, sample code, etc. Computing conditional probabilities Pairs vs. stripes Complex data types Boilerplate code for working various IR collections

Dog food for research

Open source, anonymous svn access

Page 7: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

JobTracker

TaskTracker TaskTracker TaskTracker

Master node

Slave node Slave node Slave node

Client

Page 8: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

From Theory to Practice

Hadoop ClusterYou

1. Scp data to cluster2. Move data into HDFS

3. Develop code locally

4. Submit MapReduce job4a. Go back to Step 3

5. Move data out of HDFS6. Scp data from cluster

Page 9: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Data Types in Hadoop

Writable Defines a de/serialization protocol. Every data type in Hadoop is a Writable.

WritableComprable Defines a sort order. All keys must be of this type (but not values).

IntWritableLongWritableText…

Concrete classes for different data types.

Page 10: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Complex Data Types in Hadoop How do you implement complex data types?

The easiest way: Encoded it as Text, e.g., (a, b) = “a:b” Use regular expressions to parse and extract data Works, but pretty hack-ish

The hard way: Define a custom implementation of WritableComprable Must implement: readFields, write, compareTo Computationally efficient, but slow for rapid prototyping

Alternatives: Cloud9 offers two other choices: Tuple and JSON Plus, a number of frequently-used data types

Page 11: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Input file (on HDFS)

InputSplit

RecordReader

Mapper

Partitioner

Reducer

RecordWriter

Output file (on HDFS)

InputFormat

OutputFormat

Page 12: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

What version should I use?

Page 13: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

“Hello World” Hadoop example

Page 14: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Hadoop in “standalone” mode

Page 15: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Hadoop in EC2

Page 16: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

From Theory to Practice

Hadoop ClusterYou

1. Scp data to cluster2. Move data into HDFS

3. Develop code locally

4. Submit MapReduce job4a. Go back to Step 3

5. Move data out of HDFS6. Scp data from cluster

Page 17: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

On Amazon: With EC2

You

1. Scp data to cluster2. Move data into HDFS

3. Develop code locally

4. Submit MapReduce job4a. Go back to Step 3

5. Move data out of HDFS6. Scp data from cluster

0. Allocate Hadoop cluster

EC2

Your Hadoop Cluster

7. Clean up!

Uh oh. Where did the data go?

Page 18: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

On Amazon: EC2 and S3

Your Hadoop Cluster

S3(Persistent Store)

EC2(The Cloud)

Copy from S3 to HDFS

Copy from HFDS to S3

Page 19: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Open-source Hadoop ecosystem

Page 20: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Hadoop/HDFS

Page 21: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Hadoop streaming

Page 22: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

HDFS/FUSE

Page 23: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

EC2/S3/EBS

Page 24: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

EMR

Page 25: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Pig

Page 26: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

HBase

Page 27: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Hypertable

Page 28: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Hive

Page 29: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Mahout

Page 30: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Cassandra

Page 31: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Dryad

Page 32: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

CUDA

Page 33: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

CELL

Page 34: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Beware of toys!

Page 35: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Exercises

Page 36: Jimmy Lin The  iSchool University of Maryland Sunday, May 31, 2009

Questions?Comments?

Thanks to the organizations who support our work: