cs 435 - introduction to big datacs435/pa/pa1/fall2019/...resource manager (yarn) • questions on...

15
Paahuni Khandelwal Email: [email protected] 6 th September, 2019 [Recitation 2] CS 435 - Introduction to Big Data

Upload: others

Post on 22-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 435 - Introduction to Big Datacs435/PA/PA1/Fall2019/...resource manager (yarn) • Questions on your Non-MapReduce word count program • Viva will include questions which shows

Paahuni KhandelwalEmail: [email protected]

6th September, 2019

[Recitation 2]

CS 435 - Introduction to Big Data

Page 2: CS 435 - Introduction to Big Datacs435/PA/PA1/Fall2019/...resource manager (yarn) • Questions on your Non-MapReduce word count program • Viva will include questions which shows

Agenda

• Wrapping up PA0

• Introduction to Programming Assignment 1

2CS 435: Introduction to Big Data [Fall 2019]

Page 3: CS 435 - Introduction to Big Datacs435/PA/PA1/Fall2019/...resource manager (yarn) • Questions on your Non-MapReduce word count program • Viva will include questions which shows

PA0 (Recap)

• Programming Assignment 1 comprises of PA0+ PA1 (3+7 = 10 points)

• Submission through canvas + Demo (includes viva)

• PA0 Demo – Monday, Tuesday and Wednesday

• If no time slot works out, then mail us at [email protected]

3CS 435: Introduction to Big Data [Fall 2019]

Page 4: CS 435 - Introduction to Big Datacs435/PA/PA1/Fall2019/...resource manager (yarn) • Questions on your Non-MapReduce word count program • Viva will include questions which shows

Demo

• Test file uploaded on Monday morning

• Running hdfs, running resource manager

• Performing word count on given test file using

resource manager (yarn)

• Questions on your Non-MapReduce word count

program

• Viva will include questions which shows your

understanding on MapReduce and wordcount

program

4CS 435: Introduction to Big Data [Fall 2019]

Page 5: CS 435 - Introduction to Big Datacs435/PA/PA1/Fall2019/...resource manager (yarn) • Questions on your Non-MapReduce word count program • Viva will include questions which shows

Common Issues faced during Setup

• Using JDK 64 – bit compiled version for $JAVA_HOME

• Yarn Resource manager not starting

• Port conflicts

• Looking at logs – present in $HADOOP_LOG_DIR and $YARN_LOG_DIR

• Dfs getting corrupted – remove tmp files

• Memory issues – Adding properties to mapred-site.xml and yarn-site.xml

5CS 435: Introduction to Big Data [Fall 2019]

Page 6: CS 435 - Introduction to Big Datacs435/PA/PA1/Fall2019/...resource manager (yarn) • Questions on your Non-MapReduce word count program • Viva will include questions which shows

Word Count Program

6

Page 7: CS 435 - Introduction to Big Datacs435/PA/PA1/Fall2019/...resource manager (yarn) • Questions on your Non-MapReduce word count program • Viva will include questions which shows

Programming Assignment 1

• This will be worth 7 points (7% of your total grades)

• Submission due on 30Th September by 5 pm

• Uploaded on CS435 assignment page

7

Page 8: CS 435 - Introduction to Big Datacs435/PA/PA1/Fall2019/...resource manager (yarn) • Questions on your Non-MapReduce word count program • Viva will include questions which shows

Dataset

• Summary of appr. 1.5 million Wikipedia articles

• Total size of dataset: ~1 GB

• How will your input file look like?

Title_of_Article-1<====>DocumentID<====>Summary_of_Article-1

NEWLINE

NEWLINE

Title_of_Article-2<====>DocumentID<====>Summary_of_Article-2

NEWLINE

NEWLINE

8

Page 9: CS 435 - Introduction to Big Datacs435/PA/PA1/Fall2019/...resource manager (yarn) • Questions on your Non-MapReduce word count program • Viva will include questions which shows

PA1 : N-gram

• Contiguous sequence of N items from a given sample of text or speech

Text: type of language model for predicting the next item in a sequence

Example:

– Unigram ( 1- gram): type, of ,language, model …………,sequence

– Biagram (2-grams): (__, type), (type, of), (of, language), (language, model), (model, for) ……….. (in,

sequence), (sequence, ___)

– Trigram (3-grams): (__, type, of), (type, of, language), …………….

9

Page 10: CS 435 - Introduction to Big Datacs435/PA/PA1/Fall2019/...resource manager (yarn) • Questions on your Non-MapReduce word count program • Viva will include questions which shows

PA1 requires three Profiles

PROFILE 1

o First 500 unigrams in the dataset

o Must be sorted in alphabetical order (ascending)

o Eliminate duplicates

o For designing you can use combiner to eliminate local duplicates

10

Page 11: CS 435 - Introduction to Big Datacs435/PA/PA1/Fall2019/...resource manager (yarn) • Questions on your Non-MapReduce word count program • Viva will include questions which shows

PA1 requires three Profiles

PROFILE 2

o A list of top 500 unigrams and their frequencies within each article

o Similar to word-count program

o Each article has unique integral Document ID

o Output should be grouped by Document ID

o Output files will correspond to the number of reducer used

o Requires two jobs to be executed in a sequence

11

Page 12: CS 435 - Introduction to Big Datacs435/PA/PA1/Fall2019/...resource manager (yarn) • Questions on your Non-MapReduce word count program • Viva will include questions which shows

PROFILE 3

o A list of top 500 unigrams and their frequencies in the corpus

o List should be sorted from most frequent unigrams to least frequent ones.

o Solution to generate profile 3 is a combination of profile 1 and profile 2

o Here, we list unigram with its total occurrence in the complete dataset

(without considering per article construct as in profile 2)

12

PA1 requires three Profiles

Page 13: CS 435 - Introduction to Big Datacs435/PA/PA1/Fall2019/...resource manager (yarn) • Questions on your Non-MapReduce word count program • Viva will include questions which shows

Pre-processing of Dataset

o Consider only alphabetic and numeric text.

o Convert upper cases to lower cases.

o Example (Original → processed → toLowerCase)

D.I.A. → DIA → dia

South. → South → south

(U-S-A) → USA → usa

D.C., → DC → dc

9.8 → 98 → 98

13

Page 14: CS 435 - Introduction to Big Datacs435/PA/PA1/Fall2019/...resource manager (yarn) • Questions on your Non-MapReduce word count program • Viva will include questions which shows

Few tips

o Make a design your system first

o Make sure you understand Word-Count program

o Start writing program by making modification in given Word-Count program

o Test your program in standalone/local mode with given test file(~95 Kb)

o Verify your output in even smaller text file, if required, before moving on to larger files

o Parsing (preprocessing) the dataset will be an important part and will require

considerable effort. Spend some time to observe the dataset provided.

14

Page 15: CS 435 - Introduction to Big Datacs435/PA/PA1/Fall2019/...resource manager (yarn) • Questions on your Non-MapReduce word count program • Viva will include questions which shows

Thank you

15

Questions?