challenges in large-scale web crawling

204
WEB CRAWLING introduction to by Nate Murray & extraction Wednesday, September 14, 2011

Upload: nate-murray

Post on 14-May-2015

10.345 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Challenges in Large-Scale Web Crawling

WEB CRAWLINGintroduction to

by Nate Murray& extraction

Wednesday, September 14, 2011

Page 2: Challenges in Large-Scale Web Crawling

WHO AM I ?

Wednesday, September 14, 2011

Page 3: Challenges in Large-Scale Web Crawling

Nate Murray

AT&T Interactive (Yellowpages.com)

TB-scale data since 2009

Various crawlers since 2005

Wednesday, September 14, 2011

Page 4: Challenges in Large-Scale Web Crawling

WEB CRAWLINGwhat is

?

Wednesday, September 14, 2011

Page 5: Challenges in Large-Scale Web Crawling

web crawlerdefinition:

a program that browses the web.

Wednesday, September 14, 2011

Page 6: Challenges in Large-Scale Web Crawling

transforming unstructured web data into structured data

web extractiondefinition:

Wednesday, September 14, 2011

Page 7: Challenges in Large-Scale Web Crawling

transforming semistructured web data into structured data

web extractiondefinition:

Wednesday, September 14, 2011

Page 8: Challenges in Large-Scale Web Crawling

motivation

Wednesday, September 14, 2011

Page 9: Challenges in Large-Scale Web Crawling

motivation: bookmark buddies

Wednesday, September 14, 2011

Page 10: Challenges in Large-Scale Web Crawling

motivation: bookmark buddies

URL TitleUsers

Wednesday, September 14, 2011

Page 11: Challenges in Large-Scale Web Crawling

motivation:

Wednesday, September 14, 2011

Page 12: Challenges in Large-Scale Web Crawling

motivation: business hours

Wednesday, September 14, 2011

Page 13: Challenges in Large-Scale Web Crawling

motivation: business hours

Day OpennessOpennessMon ClosedClosedTue 11:30-14:30 17:30-22:00

Wed 11:30-14:30 17:30-22:00

Thur 11:30-14:30 17:30-22:00

Fri 11:30-14:30 17:30-22:00

Sat 12:00-14:30 17:00-22:00

Sun - 17:00-21:00

Wednesday, September 14, 2011

Page 14: Challenges in Large-Scale Web Crawling

motivation:

Wednesday, September 14, 2011

Page 15: Challenges in Large-Scale Web Crawling

motivation: recommend videos

Wednesday, September 14, 2011

Page 16: Challenges in Large-Scale Web Crawling

motivation: recommend videos

Users

Wednesday, September 14, 2011

Page 17: Challenges in Large-Scale Web Crawling

motivation:

Wednesday, September 14, 2011

Page 18: Challenges in Large-Scale Web Crawling

motivation: vertical search

Wednesday, September 14, 2011

Page 19: Challenges in Large-Scale Web Crawling

motivation: vertical search

ImageSKU

NamePriceRating

Wednesday, September 14, 2011

Page 20: Challenges in Large-Scale Web Crawling

motivation:

Wednesday, September 14, 2011

Page 21: Challenges in Large-Scale Web Crawling

DESIRED PROPERTIES

Wednesday, September 14, 2011

Page 22: Challenges in Large-Scale Web Crawling

DESIRED PROPERTIES

SPEED

Wednesday, September 14, 2011

Page 23: Challenges in Large-Scale Web Crawling

CONSTRAINTS

Wednesday, September 14, 2011

Page 24: Challenges in Large-Scale Web Crawling

CONSTRAINTS

• Politeness

Wednesday, September 14, 2011

Page 25: Challenges in Large-Scale Web Crawling

CONSTRAINTS

• Politeness• Distributed

Wednesday, September 14, 2011

Page 26: Challenges in Large-Scale Web Crawling

CONSTRAINTS

• Politeness• Distributed• Linear Scalability

Wednesday, September 14, 2011

Page 27: Challenges in Large-Scale Web Crawling

CONSTRAINTS

• Politeness• Distributed• Linear Scalability• Even partitioning

Wednesday, September 14, 2011

Page 28: Challenges in Large-Scale Web Crawling

CONSTRAINTS

• Politeness• Distributed• Linear Scalability• Even partitioning• Minimum overlap

Wednesday, September 14, 2011

Page 29: Challenges in Large-Scale Web Crawling

CONSTRAINTS

• Politeness• Distributed• Linear Scalability• Even partitioning• Minimum overlap

it’s easy to burden small servers

Wednesday, September 14, 2011

Page 30: Challenges in Large-Scale Web Crawling

CONSTRAINTS

• Politeness• Distributed• Linear Scalability• Even partitioning• Minimum overlap

(for any significant crawl)

Wednesday, September 14, 2011

Page 31: Challenges in Large-Scale Web Crawling

CONSTRAINTS

• Politeness• Distributed• Linear Scalability• Even partitioning• Minimum overlap

n machines = n*m pages-per-second

Wednesday, September 14, 2011

Page 32: Challenges in Large-Scale Web Crawling

CONSTRAINTS

• Politeness• Distributed• Linear Scalability• Even partitioning• Minimum overlap

every machine should perform equal work

Wednesday, September 14, 2011

Page 33: Challenges in Large-Scale Web Crawling

CONSTRAINTS

• Politeness• Distributed• Linear Scalability• Even partitioning• Minimum overlap crawl each page

exactly once

Wednesday, September 14, 2011

Page 34: Challenges in Large-Scale Web Crawling

CONSTRAINTS

• Politeness• Distributed• Linear Scalability• Even partitioning• Minimum overlap

Wednesday, September 14, 2011

Page 35: Challenges in Large-Scale Web Crawling

BASIC ALGORITHM

Wednesday, September 14, 2011

Page 36: Challenges in Large-Scale Web Crawling

Initialize:    UrlsDone = null    UrlFrontier = {'google.com/index.html', ..}Repeat    url = UrlFrontier.getNext()    ip = DNSlookup(url.getHostname())    html = DownloadPage(ip, url.getPath())    UrlsDone.insert(url)    newUrls = parseForLinks(html)    For each newUrl      If not UrlsDone.contains(newUrl)      then UrlsTodo.insert(newUrl)

Wednesday, September 14, 2011

Page 37: Challenges in Large-Scale Web Crawling

Initialize:    UrlsDone = null    UrlFrontier = {'google.com/index.html', ..}Repeat    url = UrlFrontier.getNext()    ip = DNSlookup(url.getHostname())    html = DownloadPage(ip, url.getPath())    UrlsDone.insert(url)    newUrls = parseForLinks(html)    For each newUrl      If not UrlsDone.contains(newUrl)      then UrlsTodo.insert(newUrl)

Wednesday, September 14, 2011

Page 38: Challenges in Large-Scale Web Crawling

Initialize:    UrlsDone = null    UrlFrontier = {'google.com/index.html', ..}Repeat    url = UrlFrontier.getNext()    ip = DNSlookup(url.getHostname())    html = DownloadPage(ip, url.getPath())    UrlsDone.insert(url)    newUrls = parseForLinks(html)    For each newUrl      If not UrlsDone.contains(newUrl)      then UrlsTodo.insert(newUrl)

Wednesday, September 14, 2011

Page 39: Challenges in Large-Scale Web Crawling

Initialize:    UrlsDone = null    UrlFrontier = {'google.com/index.html', ..}Repeat    url = UrlFrontier.getNext()    ip = DNSlookup(url.getHostname())    html = DownloadPage(ip, url.getPath())    UrlsDone.insert(url)    newUrls = parseForLinks(html)    For each newUrl      If not UrlsDone.contains(newUrl)      then UrlsTodo.insert(newUrl)

Wednesday, September 14, 2011

Page 40: Challenges in Large-Scale Web Crawling

Initialize:    UrlsDone = null    UrlFrontier = {'google.com/index.html', ..}Repeat    url = UrlFrontier.getNext()    ip = DNSlookup(url.getHostname())    html = DownloadPage(ip, url.getPath())    UrlsDone.insert(url)    newUrls = parseForLinks(html)    For each newUrl      If not UrlsDone.contains(newUrl)      then UrlsTodo.insert(newUrl)

Wednesday, September 14, 2011

Page 41: Challenges in Large-Scale Web Crawling

Initialize:    UrlsDone = null    UrlFrontier = {'google.com/index.html', ..}Repeat    url = UrlFrontier.getNext()    ip = DNSlookup(url.getHostname())    html = DownloadPage(ip, url.getPath())    UrlsDone.insert(url)    newUrls = parseForLinks(html)    For each newUrl      If not UrlsDone.contains(newUrl)      then UrlsTodo.insert(newUrl)

Wednesday, September 14, 2011

Page 42: Challenges in Large-Scale Web Crawling

Initialize:    UrlsDone = null    UrlFrontier = {'google.com/index.html', ..}Repeat    url = UrlFrontier.getNext()    ip = DNSlookup(url.getHostname())    html = DownloadPage(ip, url.getPath())    UrlsDone.insert(url)    newUrls = parseForLinks(html)    For each newUrl      If not UrlsDone.contains(newUrl)      then UrlsTodo.insert(newUrl)

Wednesday, September 14, 2011

Page 43: Challenges in Large-Scale Web Crawling

Initialize:    UrlsDone = null    UrlFrontier = {'google.com/index.html', ..}Repeat    url = UrlFrontier.getNext()    ip = DNSlookup(url.getHostname())    html = DownloadPage(ip, url.getPath())    UrlsDone.insert(url)    newUrls = parseForLinks(html)    For each newUrl      If not UrlsDone.contains(newUrl)      then UrlsTodo.insert(newUrl)

Wednesday, September 14, 2011

Page 44: Challenges in Large-Scale Web Crawling

Initialize:    UrlsDone = null    UrlFrontier = {'google.com/index.html', ..}Repeat    url = UrlFrontier.getNext()    ip = DNSlookup(url.getHostname())    html = DownloadPage(ip, url.getPath())    UrlsDone.insert(url)    newUrls = parseForLinks(html)    For each newUrl      If not UrlsDone.contains(newUrl)      then UrlsTodo.insert(newUrl)

Wednesday, September 14, 2011

Page 45: Challenges in Large-Scale Web Crawling

Initialize:    UrlsDone = null    UrlFrontier = {'google.com/index.html', ..}Repeat    url = UrlFrontier.getNext()    ip = DNSlookup(url.getHostname())    html = DownloadPage(ip, url.getPath())    UrlsDone.insert(url)    newUrls = parseForLinks(html)    For each newUrl      If not UrlsDone.contains(newUrl)      then UrlsTodo.insert(newUrl)

Wednesday, September 14, 2011

Page 46: Challenges in Large-Scale Web Crawling

Initialize:    UrlsDone = null    UrlFrontier = {'google.com/index.html', ..}Repeat    url = UrlFrontier.getNext()    ip = DNSlookup(url.getHostname())    html = DownloadPage(ip, url.getPath())    UrlsDone.insert(url)    newUrls = parseForLinks(html)    For each newUrl      If not UrlsDone.contains(newUrl)      then UrlsTodo.insert(newUrl)

Wednesday, September 14, 2011

Page 47: Challenges in Large-Scale Web Crawling

Initialize:    UrlsDone = null    UrlFrontier = {'google.com/index.html', ..}Repeat    url = UrlFrontier.getNext()    ip = DNSlookup(url.getHostname())    html = DownloadPage(ip, url.getPath())    UrlsDone.insert(url)    newUrls = parseForLinks(html)    For each newUrl      If not UrlsDone.contains(newUrl)      then UrlsTodo.insert(newUrl)

Wednesday, September 14, 2011

Page 48: Challenges in Large-Scale Web Crawling

Initialize:    UrlsDone = null    UrlFrontier = {'google.com/index.html', ..}Repeat    url = UrlFrontier.getNext()    ip = DNSlookup(url.getHostname())    html = DownloadPage(ip, url.getPath())    UrlsDone.insert(url)    newUrls = parseForLinks(html)    For each newUrl      If not UrlsDone.contains(newUrl)      then UrlsTodo.insert(newUrl)

Wednesday, September 14, 2011

Page 49: Challenges in Large-Scale Web Crawling

architecture overview

FETCHER

STORAGE

CRAWL PLANNER INTERNET

URL QUEUE

Web Data

Web Data

Web Data

URLs

Wednesday, September 14, 2011

Page 50: Challenges in Large-Scale Web Crawling

CHALLENGES

Wednesday, September 14, 2011

Page 51: Challenges in Large-Scale Web Crawling

challenges:

depends on your ambitions

Wednesday, September 14, 2011

Page 52: Challenges in Large-Scale Web Crawling

challenges:

1998 - 26 million2005 - 8 billion2008 - 1 trillion

http://www.nytimes.com/2005/08/15/technology/15search.htmlhttp://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html

Google’s Index Size:

Wednesday, September 14, 2011

Page 53: Challenges in Large-Scale Web Crawling

challenges:

small crawls are easy

Wednesday, September 14, 2011

Page 54: Challenges in Large-Scale Web Crawling

challenges:

small crawls are easy

< 10MM

Wednesday, September 14, 2011

Page 55: Challenges in Large-Scale Web Crawling

challenges:

large crawls are interesting

Wednesday, September 14, 2011

Page 56: Challenges in Large-Scale Web Crawling

challenges:

Wednesday, September 14, 2011

Page 57: Challenges in Large-Scale Web Crawling

challenges:

DNS Lookup

Wednesday, September 14, 2011

Page 58: Challenges in Large-Scale Web Crawling

challenges:

DNS LookupURLs Crawled

Wednesday, September 14, 2011

Page 59: Challenges in Large-Scale Web Crawling

challenges:

DNS LookupURLs Crawled

Politeness

Wednesday, September 14, 2011

Page 60: Challenges in Large-Scale Web Crawling

challenges:

DNS LookupURLs Crawled

PolitenessURL Frontier

Wednesday, September 14, 2011

Page 61: Challenges in Large-Scale Web Crawling

challenges:

DNS LookupURLs Crawled

PolitenessURL Frontier

Queueing URLs

Wednesday, September 14, 2011

Page 62: Challenges in Large-Scale Web Crawling

challenges:

DNS LookupURLs Crawled

PolitenessURL Frontier

Queueing URLsExtracting URLs

Wednesday, September 14, 2011

Page 63: Challenges in Large-Scale Web Crawling

DNS LOOKUPchallenges:

Wednesday, September 14, 2011

Page 64: Challenges in Large-Scale Web Crawling

Initialize:    UrlsDone = null    UrlFrontier = {'google.com/index.html', ..}Repeat    url = UrlFrontier.getNext()    ip = DNSlookup(url.getHostname())    html = DownloadPage(ip, url.getPath())    UrlsDone.insert(url)    newUrls = parseForLinks(html)    For each newUrl      If not UrlsDone.contains(newUrl)      then UrlsTodo.insert(newUrl)

Wednesday, September 14, 2011

Page 65: Challenges in Large-Scale Web Crawling

DNS LOOKUPchallenges:

can easily be a bottleneck

Wednesday, September 14, 2011

Page 66: Challenges in Large-Scale Web Crawling

DNS LOOKUPchallenges:

• consider running your own DNS servers• djbdns• PowerDNS• etc.

Wednesday, September 14, 2011

Page 67: Challenges in Large-Scale Web Crawling

DNS LOOKUPchallenges:

• be aware of software limitations• gethostbyaddr is synchronized• same with many “default” DNS clients

Wednesday, September 14, 2011

Page 68: Challenges in Large-Scale Web Crawling

DNS LOOKUPchallenges:

You’ll know when you need it

Wednesday, September 14, 2011

Page 69: Challenges in Large-Scale Web Crawling

URLs CRAWLEDchallenges:

Wednesday, September 14, 2011

Page 70: Challenges in Large-Scale Web Crawling

Initialize:    UrlsDone = null    UrlFrontier = {'google.com/index.html', ..}Repeat    url = UrlFrontier.getNext()    ip = DNSlookup(url.getHostname())    html = DownloadPage(ip, url.getPath())    UrlsDone.insert(url)    newUrls = parseForLinks(html)    For each newUrl      If not UrlsDone.contains(newUrl)      then UrlsTodo.insert(newUrl)

Wednesday, September 14, 2011

Page 71: Challenges in Large-Scale Web Crawling

URLs CRAWLEDchallenges:

1 machine, store in memory

Wednesday, September 14, 2011

Page 72: Challenges in Large-Scale Web Crawling

URLs CRAWLEDchallenges:

1 machine, store in memory

NAPKIN CALCULATION

Wednesday, September 14, 2011

Page 73: Challenges in Large-Scale Web Crawling

URLs CRAWLEDchallenges:

1 machine, store in memory

NAPKIN CALCULATION~50 bytes per URL

e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations

Wednesday, September 14, 2011

Page 74: Challenges in Large-Scale Web Crawling

URLs CRAWLEDchallenges:

1 machine, store in memory

NAPKIN CALCULATION~50 bytes per URL

e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations

+8 bytes for time-last-crawledas long e.g. System.currentTimeMillis() -> 1314392455712

Wednesday, September 14, 2011

Page 75: Challenges in Large-Scale Web Crawling

URLs CRAWLEDchallenges:

1 machine, store in memory

NAPKIN CALCULATION~50 bytes per URL

e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations

+8 bytes for time-last-crawledas long e.g. System.currentTimeMillis() -> 1314392455712

x 100 million

Wednesday, September 14, 2011

Page 76: Challenges in Large-Scale Web Crawling

URLs CRAWLEDchallenges:

1 machine, store in memory

NAPKIN CALCULATION~50 bytes per URL

e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations

+8 bytes for time-last-crawledas long e.g. System.currentTimeMillis() -> 1314392455712

x 100 million

=~ 5.4 gigabytes

Wednesday, September 14, 2011

Page 77: Challenges in Large-Scale Web Crawling

can we do better?

Wednesday, September 14, 2011

Page 78: Challenges in Large-Scale Web Crawling

BLOOM FILTERS

Wednesday, September 14, 2011

Page 79: Challenges in Large-Scale Web Crawling

BLOOM FILTERS

answers the question:

is this item in the set?

Wednesday, September 14, 2011

Page 80: Challenges in Large-Scale Web Crawling

BLOOM FILTERS

answers either:

Wednesday, September 14, 2011

Page 81: Challenges in Large-Scale Web Crawling

BLOOM FILTERS

answers either:

• yes, probably

Wednesday, September 14, 2011

Page 82: Challenges in Large-Scale Web Crawling

BLOOM FILTERS

answers either:

• yes, probably• definitely not

Wednesday, September 14, 2011

Page 83: Challenges in Large-Scale Web Crawling

BLOOM FILTERS

answers either:

• yes, probably• definitely not

Have we crawled: http://www.xcombinator.com?

Wednesday, September 14, 2011

Page 84: Challenges in Large-Scale Web Crawling

BLOOM FILTERS

answers either:

• yes, probably• definitely not

Have we crawled: http://www.xcombinator.com?

Wednesday, September 14, 2011

Page 85: Challenges in Large-Scale Web Crawling

URLs CRAWLEDchallenges:

1 machine, bloom filter

100 million URLs

1 in 100 million chanceof false positive

see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8

Wednesday, September 14, 2011

Page 86: Challenges in Large-Scale Web Crawling

URLs CRAWLEDchallenges:

1 machine, bloom filter

NAPKIN CALCULATION100 million URLs

1 in 100 million chanceof false positive

see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8

Wednesday, September 14, 2011

Page 87: Challenges in Large-Scale Web Crawling

URLs CRAWLEDchallenges:

1 machine, bloom filter

NAPKIN CALCULATION100 million URLs

1 in 100 million chanceof false positive

=~ 457 megabytes

see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8

Wednesday, September 14, 2011

Page 88: Challenges in Large-Scale Web Crawling

BLOOM FILTER

Wednesday, September 14, 2011

Page 89: Challenges in Large-Scale Web Crawling

BLOOM FILTERdrawbacks

Wednesday, September 14, 2011

Page 90: Challenges in Large-Scale Web Crawling

BLOOM FILTER

• probabilistic - occasional errors

drawbacks

Wednesday, September 14, 2011

Page 91: Challenges in Large-Scale Web Crawling

BLOOM FILTER

• probabilistic - occasional errors

• estimate # of items ahead of time

drawbacks

Wednesday, September 14, 2011

Page 92: Challenges in Large-Scale Web Crawling

BLOOM FILTER

• probabilistic - occasional errors

• estimate # of items ahead of time

• can’t delete

drawbacks

Wednesday, September 14, 2011

Page 93: Challenges in Large-Scale Web Crawling

BLOOM FILTER

• probabilistic - occasional errors

• estimate # of items ahead of time

• can’t delete

drawbackssolutions

Wednesday, September 14, 2011

Page 94: Challenges in Large-Scale Web Crawling

BLOOM FILTER

• probabilistic - occasional errors

• estimate # of items ahead of time

• can’t delete

drawbacks

• acceptable

solutions

Wednesday, September 14, 2011

Page 95: Challenges in Large-Scale Web Crawling

BLOOM FILTER

• probabilistic - occasional errors

• estimate # of items ahead of time

• can’t delete

drawbacks

• acceptable

• not hard, see Dynamic BFs

solutions

Wednesday, September 14, 2011

Page 96: Challenges in Large-Scale Web Crawling

BLOOM FILTER

• probabilistic - occasional errors

• estimate # of items ahead of time

• can’t delete

drawbacks

• acceptable

• not hard, see Dynamic BFs

• pick granularity (days)

solutions

Wednesday, September 14, 2011

Page 97: Challenges in Large-Scale Web Crawling

BLOOM FILTER

• probabilistic - occasional errors

• estimate # of items ahead of time

• can’t delete

drawbacks

• acceptable

• not hard, see Dynamic BFs

• pick granularity (days)• cascade them

solutions

Wednesday, September 14, 2011

Page 98: Challenges in Large-Scale Web Crawling

BLOOM FILTERSreferences:

http://en.wikipedia.org/wiki/Bloom_filterhttp://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.htmlhttp://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/

Wednesday, September 14, 2011

Page 99: Challenges in Large-Scale Web Crawling

POLITENESSchallenges:

Wednesday, September 14, 2011

Page 100: Challenges in Large-Scale Web Crawling

obey robots.txt

Wednesday, September 14, 2011

Page 101: Challenges in Large-Scale Web Crawling

wait 2 seconds (w.r.t. ip)

rule of thumb:

Wednesday, September 14, 2011

Page 102: Challenges in Large-Scale Web Crawling

centralized politeness

Wednesday, September 14, 2011

Page 103: Challenges in Large-Scale Web Crawling

centralized politeness

SPOF

Wednesday, September 14, 2011

Page 104: Challenges in Large-Scale Web Crawling

centralized politeness

SPOFcontention

Wednesday, September 14, 2011

Page 105: Challenges in Large-Scale Web Crawling

POLITENESSchallenges:

Wednesday, September 14, 2011

Page 106: Challenges in Large-Scale Web Crawling

POLITENESSchallenges:

• Options:

Wednesday, September 14, 2011

Page 107: Challenges in Large-Scale Web Crawling

POLITENESSchallenges:

• Options:• central database

Wednesday, September 14, 2011

Page 108: Challenges in Large-Scale Web Crawling

POLITENESSchallenges:

• Options:• central database • distributed locks (paxos/sigma/zookeeper)

Wednesday, September 14, 2011

Page 109: Challenges in Large-Scale Web Crawling

POLITENESSchallenges:

• Options:• central database • distributed locks (paxos/sigma/zookeeper)• controlled URL distribution

Wednesday, September 14, 2011

Page 110: Challenges in Large-Scale Web Crawling

POLITENESSchallenges:

• Options:• central database • distributed locks (paxos/sigma/zookeeper)• controlled URL distribution

http://en.wikipedia.org/wiki/Paxos_(computer_science)

Wednesday, September 14, 2011

Page 111: Challenges in Large-Scale Web Crawling

POLITENESSchallenges:

• Options:• central database • distributed locks (paxos/sigma/zookeeper)• controlled URL distribution

http://en.wikipedia.org/wiki/Paxos_(computer_science)http://zookeeper.apache.org/

Wednesday, September 14, 2011

Page 112: Challenges in Large-Scale Web Crawling

URL FRONTIERchallenges:

Wednesday, September 14, 2011

Page 113: Challenges in Large-Scale Web Crawling

url frontier

Wednesday, September 14, 2011

Page 114: Challenges in Large-Scale Web Crawling

consistently distribute URLs based on IP

idea:

Wednesday, September 14, 2011

Page 115: Challenges in Large-Scale Web Crawling

moduloIP SHA-1 bucket (mod 5)

174.132.225.106 4dd14b0b... 2

74.125.224.115 cf4b7594... 1

157.166.255.19 0ac4d141... 4

69.22.138.129 6c1584fa... 4

98.139.50.166 327252c5... 3

Wednesday, September 14, 2011

Page 116: Challenges in Large-Scale Web Crawling

same IP always goes to same machine

benefits:

simple

Wednesday, September 14, 2011

Page 117: Challenges in Large-Scale Web Crawling

susceptible to skew

drawbacks:

can’t add / remove nodes without pain

Wednesday, September 14, 2011

Page 118: Challenges in Large-Scale Web Crawling

consistent hashing

Wednesday, September 14, 2011

Page 119: Challenges in Large-Scale Web Crawling

source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011

Page 120: Challenges in Large-Scale Web Crawling

source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011

Page 121: Challenges in Large-Scale Web Crawling

source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011

Page 122: Challenges in Large-Scale Web Crawling

source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011

Page 123: Challenges in Large-Scale Web Crawling

~ 1/(n+1) URLs move on add/remove

benefits:

virtual nodes help skewrobust (no SOP)

Wednesday, September 14, 2011

Page 124: Challenges in Large-Scale Web Crawling

naive solution won’t work for large sites

drawbacks:

Wednesday, September 14, 2011

Page 126: Challenges in Large-Scale Web Crawling

QUEUEING URLSchallenges:

Wednesday, September 14, 2011

Page 127: Challenges in Large-Scale Web Crawling

situation:

Wednesday, September 14, 2011

Page 128: Challenges in Large-Scale Web Crawling

situation:URL

Wednesday, September 14, 2011

Page 129: Challenges in Large-Scale Web Crawling

situation:URLnot recently crawled

Wednesday, September 14, 2011

Page 130: Challenges in Large-Scale Web Crawling

situation:URLnot recently crawledallowed by robots.txt

Wednesday, September 14, 2011

Page 131: Challenges in Large-Scale Web Crawling

situation:URLnot recently crawledallowed by robots.txtpolite

Wednesday, September 14, 2011

Page 132: Challenges in Large-Scale Web Crawling

how to you order them?

(within a single machine)

Wednesday, September 14, 2011

Page 133: Challenges in Large-Scale Web Crawling

http://yachtmaintenanceco.com/http://www.amsterdamports.nl/http://www.4s-dawn.com/http://www.embassysuiteslittlerock.com/http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htmhttp://mdgroover.iweb.bsu.eduhttp://music.imbc.com/http://www.robertjbradshaw.comhttp://www.kerkattenhoven.behttp://www.escolania.org/http://www.musiciansdfw.org/http://www.ariana.org/

1 2 3hash each lane:

Wednesday, September 14, 2011

Page 134: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 135: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 136: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 137: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 138: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 139: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 140: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 141: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 142: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 143: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 144: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 145: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 146: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 147: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 148: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 149: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 150: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 151: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 152: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 153: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 154: Challenges in Large-Scale Web Crawling

1 2 3

Wednesday, September 14, 2011

Page 155: Challenges in Large-Scale Web Crawling

ERLANG

lookup: erlang B / C / engset

Wednesday, September 14, 2011

Page 156: Challenges in Large-Scale Web Crawling

as many threads as possible

Wednesday, September 14, 2011

Page 157: Challenges in Large-Scale Web Crawling

don’t sort input URLs

Wednesday, September 14, 2011

Page 158: Challenges in Large-Scale Web Crawling

http://abcnews.go.com/http://abcnews.go.com/2020/ABCNEWSSpecial/http://abcnews.go.com/2020/story?id=207269&amp;page=1http://abcnews.go.com/2020/story?id=207269&amp;page=1http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395http://abcnews.go.com/International/News/story?id=203089&amp;page=1http://abcnews.go.com/International/Pope/http://abcnews.go.com/International/story?id=81417&amp;page=1

Wednesday, September 14, 2011

Page 159: Challenges in Large-Scale Web Crawling

http://abcnews.go.com/http://abcnews.go.com/2020/ABCNEWSSpecial/http://abcnews.go.com/2020/story?id=207269&amp;page=1http://abcnews.go.com/2020/story?id=207269&amp;page=1http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395http://abcnews.go.com/International/News/story?id=203089&amp;page=1http://abcnews.go.com/International/Pope/http://abcnews.go.com/International/story?id=81417&amp;page=1

fetch

Wednesday, September 14, 2011

Page 160: Challenges in Large-Scale Web Crawling

http://abcnews.go.com/http://abcnews.go.com/2020/ABCNEWSSpecial/http://abcnews.go.com/2020/story?id=207269&amp;page=1http://abcnews.go.com/2020/story?id=207269&amp;page=1http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395http://abcnews.go.com/International/News/story?id=203089&amp;page=1http://abcnews.go.com/International/Pope/http://abcnews.go.com/International/story?id=81417&amp;page=1

wait

Wednesday, September 14, 2011

Page 161: Challenges in Large-Scale Web Crawling

http://abcnews.go.com/http://abcnews.go.com/2020/ABCNEWSSpecial/http://abcnews.go.com/2020/story?id=207269&amp;page=1http://abcnews.go.com/2020/story?id=207269&amp;page=1http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395http://abcnews.go.com/International/News/story?id=203089&amp;page=1http://abcnews.go.com/International/Pope/http://abcnews.go.com/International/story?id=81417&amp;page=1

fetch

Wednesday, September 14, 2011

Page 162: Challenges in Large-Scale Web Crawling

http://abcnews.go.com/http://abcnews.go.com/2020/ABCNEWSSpecial/http://abcnews.go.com/2020/story?id=207269&amp;page=1http://abcnews.go.com/2020/story?id=207269&amp;page=1http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395http://abcnews.go.com/International/News/story?id=203089&amp;page=1http://abcnews.go.com/International/Pope/http://abcnews.go.com/International/story?id=81417&amp;page=1

wait

Wednesday, September 14, 2011

Page 163: Challenges in Large-Scale Web Crawling

http://abcnews.go.com/http://abcnews.go.com/2020/ABCNEWSSpecial/http://abcnews.go.com/2020/story?id=207269&amp;page=1http://abcnews.go.com/2020/story?id=207269&amp;page=1http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395http://abcnews.go.com/International/News/story?id=203089&amp;page=1http://abcnews.go.com/International/Pope/http://abcnews.go.com/International/story?id=81417&amp;page=1

fetch

Wednesday, September 14, 2011

Page 164: Challenges in Large-Scale Web Crawling

http://abcnews.go.com/http://abcnews.go.com/2020/ABCNEWSSpecial/http://abcnews.go.com/2020/story?id=207269&amp;page=1http://abcnews.go.com/2020/story?id=207269&amp;page=1http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395http://abcnews.go.com/International/News/story?id=203089&amp;page=1http://abcnews.go.com/International/Pope/http://abcnews.go.com/International/story?id=81417&amp;page=1

wait

Wednesday, September 14, 2011

Page 165: Challenges in Large-Scale Web Crawling

http://abcnews.go.com/http://abcnews.go.com/2020/ABCNEWSSpecial/http://abcnews.go.com/2020/story?id=207269&amp;page=1http://abcnews.go.com/2020/story?id=207269&amp;page=1http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395http://abcnews.go.com/International/News/story?id=203089&amp;page=1http://abcnews.go.com/International/Pope/http://abcnews.go.com/International/story?id=81417&amp;page=1

Wednesday, September 14, 2011

Page 166: Challenges in Large-Scale Web Crawling

http://yachtmaintenanceco.com/http://www.amsterdamports.nl/http://www.4s-dawn.com/http://www.embassysuiteslittlerock.com/http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htmhttp://mdgroover.iweb.bsu.eduhttp://music.imbc.com/http://www.robertjbradshaw.comhttp://www.kerkattenhoven.behttp://www.escolania.org/http://www.musiciansdfw.org/http://www.ariana.org/

Wednesday, September 14, 2011

Page 167: Challenges in Large-Scale Web Crawling

http://yachtmaintenanceco.com/http://www.amsterdamports.nl/http://www.4s-dawn.com/http://www.embassysuiteslittlerock.com/http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htmhttp://mdgroover.iweb.bsu.eduhttp://music.imbc.com/http://www.robertjbradshaw.comhttp://www.kerkattenhoven.behttp://www.escolania.org/http://www.musiciansdfw.org/http://www.ariana.org/

Wednesday, September 14, 2011

Page 168: Challenges in Large-Scale Web Crawling

http://yachtmaintenanceco.com/http://www.amsterdamports.nl/http://www.4s-dawn.com/http://www.embassysuiteslittlerock.com/http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htmhttp://mdgroover.iweb.bsu.eduhttp://music.imbc.com/http://www.robertjbradshaw.comhttp://www.kerkattenhoven.behttp://www.escolania.org/http://www.musiciansdfw.org/http://www.ariana.org/

no waiting!

Wednesday, September 14, 2011

Page 169: Challenges in Large-Scale Web Crawling

EXTRACTING URLSchallenges:

Wednesday, September 14, 2011

Page 170: Challenges in Large-Scale Web Crawling

challenges:

EXTRACTING URLS

the internet is full of garbage

Wednesday, September 14, 2011

Page 171: Challenges in Large-Scale Web Crawling

challenges:

EXTRACTING URLS

Wednesday, September 14, 2011

Page 172: Challenges in Large-Scale Web Crawling

challenges:

EXTRACTING URLS

enormous pages

Wednesday, September 14, 2011

Page 173: Challenges in Large-Scale Web Crawling

challenges:

EXTRACTING URLS

enormous pages

terrible markup

Wednesday, September 14, 2011

Page 174: Challenges in Large-Scale Web Crawling

challenges:

EXTRACTING URLS

enormous pages

terrible markup

ridiculous urls

Wednesday, September 14, 2011

Page 175: Challenges in Large-Scale Web Crawling

challenges:

EXTRACTING URLS

enormous pages

terrible markup

ridiculous urls

☃.net/

Wednesday, September 14, 2011

Page 176: Challenges in Large-Scale Web Crawling

challenges:

EXTRACTING URLS

enormous pages

terrible markup

ridiculous urls

☃.net/“unicode snowman dot net”

Wednesday, September 14, 2011

Page 177: Challenges in Large-Scale Web Crawling

challenges:

EXTRACTING URLSbe prepared:

Wednesday, September 14, 2011

Page 178: Challenges in Large-Scale Web Crawling

challenges:

EXTRACTING URLSbe prepared:

use a streaming XML parser

Wednesday, September 14, 2011

Page 179: Challenges in Large-Scale Web Crawling

challenges:

EXTRACTING URLSbe prepared:

use a streaming XML parser

use a library that handle’s bad markup

Wednesday, September 14, 2011

Page 180: Challenges in Large-Scale Web Crawling

challenges:

EXTRACTING URLSbe prepared:

use a streaming XML parser

use a library that handle’s bad markup

be aware that URLs aren’t ASCII

Wednesday, September 14, 2011

Page 181: Challenges in Large-Scale Web Crawling

challenges:

EXTRACTING URLSbe prepared:

use a streaming XML parser

use a library that handle’s bad markup

be aware that URLs aren’t ASCII

use a URL normalizer

Wednesday, September 14, 2011

Page 182: Challenges in Large-Scale Web Crawling

SOFTWARE

Wednesday, September 14, 2011

Page 183: Challenges in Large-Scale Web Crawling

software advice:

Wednesday, September 14, 2011

Page 184: Challenges in Large-Scale Web Crawling

software advice:

• goals determine scale

Wednesday, September 14, 2011

Page 185: Challenges in Large-Scale Web Crawling

software advice:

• goals determine scale

• someone else has already done it

Wednesday, September 14, 2011

Page 186: Challenges in Large-Scale Web Crawling

2 second crawler:

function wgetspider() { wget --html-extension --convert-links --mirror \ --page-requisites --progress=bar --level=5 \ --no-parent --no-verbose \ --no-check-certificate "$@"; }

$ wgetspider http://www.ischool.berkeley.edu/

Wednesday, September 14, 2011

Page 187: Challenges in Large-Scale Web Crawling

java crawlers:

Wednesday, September 14, 2011

Page 188: Challenges in Large-Scale Web Crawling

java crawlers:

• Heritrix (Internet Archive)

Wednesday, September 14, 2011

Page 189: Challenges in Large-Scale Web Crawling

java crawlers:

• Heritrix (Internet Archive)

• Nutch (Lucene)

Wednesday, September 14, 2011

Page 190: Challenges in Large-Scale Web Crawling

java crawlers:

• Heritrix (Internet Archive)

• Nutch (Lucene)

• Bixo (Hadoop / Cascading)

Wednesday, September 14, 2011

Page 192: Challenges in Large-Scale Web Crawling

extraction packages:

Wednesday, September 14, 2011

Page 193: Challenges in Large-Scale Web Crawling

extraction packages:

• mechanize

Wednesday, September 14, 2011

Page 194: Challenges in Large-Scale Web Crawling

extraction packages:

• mechanize

• BeautifulSoup & urllib2

Wednesday, September 14, 2011

Page 195: Challenges in Large-Scale Web Crawling

extraction packages:

• mechanize

• BeautifulSoup & urllib2

• Scrapy

Wednesday, September 14, 2011

Page 197: Challenges in Large-Scale Web Crawling

wrapper induction(ish)

Wednesday, September 14, 2011

Page 198: Challenges in Large-Scale Web Crawling

wrapper induction(ish)• Ariel

Wednesday, September 14, 2011

Page 199: Challenges in Large-Scale Web Crawling

wrapper induction(ish)• Ariel

• RoadRunner

Wednesday, September 14, 2011

Page 200: Challenges in Large-Scale Web Crawling

wrapper induction(ish)• Ariel

• RoadRunner

• TemplateMaker

Wednesday, September 14, 2011

Page 201: Challenges in Large-Scale Web Crawling

wrapper induction(ish)• Ariel

• RoadRunner

• TemplateMaker

• scrubyt

Wednesday, September 14, 2011

Page 202: Challenges in Large-Scale Web Crawling

wrapper induction(ish)• Ariel

• RoadRunner

• TemplateMaker

• scrubyt

http://ariel.rubyforge.org/index.htmlhttp://www.dia.uniroma3.it/db/roadRunner/http://code.google.com/p/templatemaker/http://scrubyt.rubyforge.org/files/README.html

Wednesday, September 14, 2011

Page 203: Challenges in Large-Scale Web Crawling

QUESTIONS?

Wednesday, September 14, 2011