building a scalable distributed www search engine … not in perl! presented by alex chudnovsky ()...

23
Building a scalable distributed WWW search engine NOT in Perl! Presented by Alex Chudnovsky (http://www.majectic12.co.uk) at Birmingham Perl Mongers User Group (http://birmingham.pm.org) V1.0 27/07/05

Upload: shonda-knight

Post on 16-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Building a scalabledistributed

WWW search engine …

NOT in Perl! Presented by Alex Chudnovsky (http://www.majectic12.co.uk)

at Birmingham Perl Mongers User Group (http://birmingham.pm.org)

V1.0 27/07/05

Page 2: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Contents

1. History

2. Goals

3. Architecture

4. Implementation

5. Why not Perl?

6. Conclusions

7. Credits

8. Recommended reading

Page 3: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

History(of my work in area of information retrieval)

1. First primitive pathetic stone-age search engine: 1000 documents in the “index” (1997, Perl)

2. Second engine using proper inverted indexing for Jungle.com: 500,000 products indexed (Perl + Java, 2002)

3. Current: 50,000,000 pages indexed with a lot more to go (to be revealed, 2005)

Page 4: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Goals

1. Build a distributed WWW search engine capable of dealing with at least 1 bln web pages based on principles of SETI@Home and D.NET

2. See to it that the chosen language for implementation (more on this later) fits purpose or more likely learn how to make it work

3. Eventually make some money out of it

Page 5: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Architecture

1. Data collection (crawling)

2. Indexing: turning text into numbers

3. Merging: turning indexed barrels into single searchable index

4. Searching: locating documents for given keywords

Page 6: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Data collection (crawling)

Base

Issues URLs to crawl and receives compressed pages

Distributed crawlers – receive lists of URLs to crawl, crawl them and send back compressed data.

In the future will do distributed indexing

Note: this stage is optional if you already have data to index, ie list of products with their descriptions

Page 7: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Crawler screenshot 1

Page 8: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Crawler screenshot 2

Page 9: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Crawler screenshot 3

Page 10: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Crawler screenshot 4

Page 11: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Crawler screenshot 5

Page 12: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Current Stats

Source: http://www.majestic12.co.uk/projects/dsearch/stats.php as of 27/07/05

Page 13: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Indexing

Indexing is a process of turning words into numbers and creating inverted index.

Data barrel

Doc #0: Birmingham Perl Mongers

Doc #1: Birmingham City

Doc #2: Perl City

Lexicon(maps words to their numeric WordIDs)

Birmingham – 0Perl – 1Mongers – 2City – 3

Inverted Index(Each of the WordID has list of

(ideally sorted) DocIDs)

0 -> 0, 11 -> 0, 22 -> 0,3 -> 1, 2

Note: if you use database then it make sense to have clustered index on WordID

Page 14: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Merging

Individual indexed barrels

Single searchable

index

Note: this stage is not necessary if just one barrel is used as there will be no need to remap all Ids from local to their global equivalents.

Page 15: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Searching

Searching is a process of finding documents that contain words from search query

Doc #0: Birmingham Perl Mongers

Doc #1: Birmingham City

Doc #2: Perl City

Lexicon(maps words to their numeric WordIDs)

Birmingham – 0Perl – 1Mongers – 2City – 3

Inverted Index(lists DocIDs for each of the WordID)

0 -> 0, 11 -> 0, 22 -> 0,3 -> 1, 2

Note: if you use database then it make sense to cluster on WordID

Search query: “Birmingham Perl”

WordIDs: 0, 1

Intersection of DocIDs present in both lists (implementation of boolean AND logic):

0 (Brum) 1 (Perl) Result

0 0 Matched!

1 n/a Not matched!

n/a 2 Not matched!

Page 16: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Search engine screenshot 1

Page 17: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Search engine screenshot 2

Page 18: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Implementation

1. Microsoft .NET C# ported to Linux using Mono (http://www.mono-project.com)

2. ~90k lines of code (minimal copy/paste) written from scratch

3. Low level of dependencies (SharpZipLib/SQLite/NPlot)

Page 19: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Why not Perl?(using C# instead)

1. Not strong in GUI department

2. Hard to deal with Multi-Threading and Asyncronous sockets

3. OOP is more of a hack

4. Lax compile-time checks due to not being strictly typed

5. Fear of performance bottlenecks forcing to use C++

6. Hard to profile for performance analysis

7. Managed memory lacks support for pointers (?)

8. Poor exceptions handling

9. I wanted something new :)

Page 20: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Conclusions

Still work in progress, but some conclusions can be made already:

1. Inverted indexing approach helps to achieve fast searches

2. Its tough to build one – don’t try if you ain’t going to see it through!

3. Crawler is one tough piece of code – 6 months vs 2 months on searching

4. .NET C# is a decent language suitable for heavy duty tasks like this

Page 21: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Credits

1. R&D: Alex Chudnovsky <[email protected]>

2. Pioneers*: FiddleAbout, dazza12, lazytom, Mordac, linuxbren, Cyber911, www.vanginkel.info, Vari, ASB, SEOBy.org, arni, japonicus, webstek.info | Pimpel, DimPrawn, Zyron, partys-bei-uns.de, jake, bull at webmasterworld, nada, dodgy4, sri-heinz

* Volunteers running crawler and who crawled at least 1 mln URLs as of 27/07/05

Page 22: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Recommended reading

1. “The Anatomy of a Large-Scale Hypertextual Web Search Engine” Sergey Brin and Lawrence Page of Google (http://www-db.stanford.edu/~backrub/google.html)

2. “Managing Gigabytes” Ian h. Witten et al ISBN 1-55860-570-3

Page 23: Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky () at Birmingham Perl Mongers

Join!

Join the project (unmetered broadband required!):

majestic12.co.uk

Your name could be here!