google star –project...

36
Google Star – Project Presentation CS8621: Advanced Computer Architecture Course Instructor : Dr.Ted Pedersen Team : Fluminense Aneerudh Naik Ankur Nepalia Prafulla Bhalekar Sarika Mehta

Upload: others

Post on 31-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Google Star – Project Presentation

CS8621: Advanced Computer Architecture

Course Instructor : Dr.Ted Pedersen

Team : Fluminense

Aneerudh Naik

Ankur Nepalia

Prafulla Bhalekar

Sarika Mehta

Page 2: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Architecture

Unigram Network

Read Bi- gram

Search in Unigram Network

Calc Assoc Score

>= Assoc weight

Create Network

Bi-gram Network

Search Network

Disjoint N/w search

Weight >= Uni Cut

Page 3: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Unigram and Associativity Cut

� Overview :

�Alpha Stage – to store the network in files.

�Beta Stage – to handle the unigram cut-off

�Final Stage – to handle the associativity cut-

off and modifications in the unigram network.

Page 4: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Alpha Stage

� Storing the network in the files

� Locating and Creating Directories.

� Creating Files.

� Traversing the network.

� Writing to the files.

Note : This has been dropped from the current implementation considering the file I/O cost.

Page 5: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Beta Stage

� Handling unigram cut

� Creating unigram network on processes 0 and 1(due to memory limitations)9

� Split vocab in two files

� Create Hash tables on 0 and 1

� Read and Hash each unigram

� Attach unigram node to the binary search tree if the weight is >= unigram cut

� Remove temporary files.

Page 6: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Beta Stage contd...

� When the bigram file is being read, accept each word and pass it on to search function

� Search function : (Tree traversals)9

� Proc 0 :

� Searches it's own.

� Sends string to proc1

� Searches for all other processors.

� Proc 1:

� Same as 0

� Other Processors

� Just send their unigrams to 0 and 1.

� Accept the return value and send to the create network.

Page 7: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Final Stage

� Unigram network

� Was modified to hold and return unigram weights

� Handling Associativity cut

� Accept bigram weight

� Accept weights of searched unigram

� Calculate Associativity cut-off

Page 8: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Bi-Gram Network Creation

Page 9: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Data structure

0

1

2

3

4

5

The

0

1

2

Sat

Cat

Weather

Bat

0

1

2

First String Hash Table

Second String Hash Table Binary Search Tree :

Second String

Binary Search Tree : First String

Page 10: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Insertion Method

� The Apple 100 ( Bi-gram )

� ‘The -> Apple’ linked as front entry� Front Flag : 1

� ‘Apple -> The’ linked as back entry� Back Flag : 1

� Helps get back connected strings during search

� Short Algorithm : � Hash the String

� Access hash table location

� Insert into binary search tree

Page 11: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Hash ( “The” ) = 2

0

1

2

3

4

5

Wait

Hash Table For First String

Already Existing Node At Hash Location 2

Page 12: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

The : Inserted

0

1

2

3

4

5

Wait

Left Link

The

0

1

2

Hash Table For First String

Already Existing Node At Hash Location 2

Page 13: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Apple : Hash = 1

0

1

2

3

4

5

Wait

Left Link

The

0

1

2

Aim

Already Existing Node At Hash Location 1

Page 14: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Apple : Inserted

0

1

2

3

4

5

Wait

Left Link

The

0

1

2

Aim

Apple

Right Link

Page 15: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Search Method : ‘The’

0

1

2

3

4

5

Weather The

Lion

Bat

Yes

Page 16: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Search Method : ‘The’

0

1

2

3

4

5

Weather The

Lion

Bat

Yes

Left Link Right Link

Page 17: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

The

0

1

2

Bat

Wall

Apple

Cat

Pick

tall

Weather

The Node Structure

Second String hash table Second String Binary

Search Trees

Page 18: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Linked List : Connected Strings

Apple Bat Cat Pick Wall tall

Weather

All Connected String Sent to Process 0 and Linked list of all connected strings created

Linked List Created On Process 0

Front Connection : Front Flag = 1 Back Connections : Back Flag = 1

Page 19: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Searching Strings

Page 20: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

� Read the target-list file one line at a time.

� Suppose the target string is ‘The’.

� Create a data structure for storing all the connected strings to the target string ‘The’, both in front and back.

Creating Target-string network

The Front Connected StringsBack Connected Strings

Page 21: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Search String: ‘The’ Path Length: 1

� Passing the target string to the search function.

� Returns back a connected link list of all the sub-strings to the target string

TheLift Get Ran Cat Dog Big

Front connected sub-stringsBack connected sub-strings Target-String

Page 22: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Creating the sub-string network

� Passing the substrings connected to the target string to the search function.

� Front Connected Sub-strings to ‘The’

� Back Connected Sub-strings to ‘The’

Cat Dog Big

Lift Get Ran

Page 23: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Creating the sub-string network…

� This again gives back all the connected strings to the substring passed to the search function in the form of a connected link-list.

� This process is done for both the front and back connected substrings to the target string ‘The’.

Page 24: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Search String: ‘The’ Path Length: 2

Big

Fat

Rat

Black

white

Boy

You

TheLift Get Ran Cat Dog Big

Bat

Ball

IWe

Me

You

Front Connected StringsBack Connected Strings

Back Connected Sub-strin

gs

Front Connected Sub-Strin

gs

Page 25: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Display Search String Network

� The target-string network creation is limited by the path-length.

� Recursively displaying the target-string network both in front and back directions.

Page 26: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Hashing Function

4 Ideal Characteristics:

1) Hash value is fully determined by the data being hashed.

2) Hash function uses all the input data.

3) Hash function "uniformly" distributes data across entire set of possible hash values.

4) Hash function generates very different hash values for similar strings.

Page 27: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

� djb2 hash function has been implemented

� Ideally suited to string hashing

� Supports any character.

� Fast.

� Robust.

� Uses XOR function.

Hashing Function…

Page 28: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Hash

Table

[0]

Hash

Table

[1]

[0]

[0]

[n-1]

[n-1]

[0]

[n-1]

Page 29: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Disjoint Networks

Creation of disjoint networks:

� Each processor creates its own disjoint network files.

� Files created are network and stat files.

� Network files hold actual data (words).

� Stat files hold number of nodes and number of edges.

Page 30: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

� Search for creating networks starts at the root.

� Root is traversed to find all connected nodes.

� Once, all strings connected to the root are traversed, that entire network is separated or deducted from the unigram cut.

� Process is repeated for each root, in the remaining networks.

� What results is the set of disjoint networks.

Disjoint Networks…

Page 31: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

� Each network file holding nodes, from each processor, is compared to every other processor’s network file.

� Nodes are inserted into arrays and compared.

� If any string is found common among them, then total network count is reduced.

Disjoint Networks…

Page 32: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

� This means those 2 networks are, in fact, joint.

� Then, nodes in the files with a common word are added.

� Count of common words is subtracted.

� Edges are added, as the joint network contains edges from both the networks.

Disjoint Networks…

Page 33: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Implementation Example

� Network files named according to following convention:

process 0: 000netw000, 000netw001,…process 1: 001netw000, 001netw001,…

process i: 00 i netw000, 00 i netw001,…

Page 34: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

� Stat files named similarly,

process 0: 000stat000, 000stat001,…process 1: 001stat000, 001stat001,…

process i: 00 i stat000, 00 i stat001,…

Implementation Example…

Page 35: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Acknowledgments

Dr. Ted Pedersen

Page 36: Google Star –Project Presentationgoogle-star.sourceforge.net/TarBalls/Fluminense-Presentation.pdfGoogle Star –Project Presentation CS8621: Advanced Computer Architecture Course

Thank You…