agenda overview of the project resources. cs172 project crawlingrankingindexing

Agenda

• Overview of the project• Resources

CS172 Project

crawling rankingindexing

First Phase Second Phase

Phase 1 Options

• Web data– Needs to come out with your own crawling

strategy• Twitter data– Can use third-party for Twitter Streaming API– Still needs some web crawling

Download contents of page1

Parse the downloaded file to extract links the page

2

Store extracted links in the Frontier

4

Frontier• www.cs.ucr.edu• www.cs.ucr.edu/

~vagelis

getNext()

Add(List<URLs>)

getNext

addAll(List) Clean and Normalize the

extracted links3

Crawling

http://www.cs.ucr.edu/

1. Download File Contents

<- This is what you will see when you download a page. Notice HTML Tags.

2. Parsing HTML to extract links

2. Parsing HTML file

• Write your own parser Some suggestions: Parse HTML file as XML. Two Parsing methods

– SAX (Simple API for XML)– DOM (Document Object Model)

• Use existing library– JSoup (http://jsoup.org/). Can be used to download the

page.– HTML Parser (http://htmlparser.sourceforge.net/)

http://jsoup.org/

http://jsoup.org/

2. Parsing HTML file

• Things to think about– How to handle Malformed HTML?

Browser can still display it, but how do you handle it?

3. Clean extracted URLs• Some URL entries while crawling www.cs.ucr.edu

/intranet//inventthefuture.htmlsystems.engr.ucr.edunews/e-newsletter.htmlhttp://www.engr.ucr.edu/sendmail.htmlhttp://ucrcmsdev.ucr.edu/oucampus/de.jsp?user=D01002&site=cmsengr&path=%2Findex.html/faculty///about/#mainhttp://www.pe.com/local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988#ssStory533104


3. Clean extracted URLs

What to avoid• Parse only http links (avoid ftp, https or any other protocol)• Avoid duplicates

– Bookmarks : #main – Bookmarks should be stripped off.– Self paths: /

• Avoid downloading pdfs or images – /news/GraphenePublicationsIndex.pdf– Its ok to download them, but you cannot parse them.

• Take care of invalid characters in URLs– Space: www.cs.ucr.edu/vagelis hristidis – Ampersand: www.cs.ucr.edu/vagelis&hristidis – These characters should be encoded else you will get a

MalformedURLException

http://www.cs.ucr.edu/vagelis

http://www.cs.ucr.edu/vagelis

Normalize Links Found on the page

• Relative URLs: – These URLs have no host address– E.g. While crawling (www.cs.ucr.edu/faculty) you find urls such

as: – Case 1: /find_people.php

• A “/” at the beginning means path starts from the root of the host (www.cs.ucr.edu) in this case.

– Case 2: all• No “/” means the path is relative to current path.

• Normalize them (respectively) to – www.cs.ucr.edu/find_people.php– www.cs.ucr.edu/faculty/all


http://www.cs.ucr.edu/find_people.php

http://www.cs.ucr.edu/faculty/all

Clean extracted URLs

• Different Parts of the URL highlighted with different colors• http://www.pe.com:8080/local-news/riverside-county/

riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988#ssStory533

• Protocol• Port• Host• Path• Query• Bookmark

java.net.URL Has methods that can separate different parts of the URL.

getProtocol: httpgetHost: www.pe.comgetPort: -1getPath: /local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ecegetQuery: ssimg=532988getFile: /local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988

Normalizing with java.net.URL

• You can normalize URLs with simple string manipulations and using methods from java.net.URL class.

• Here is the snippet for normalizing “Case 1” root relative URLs

Crawler Ethics

• Some websites don’t want crawlers swarming all over them.

• Why?– Increases load on the server– Private websites– Dynamic websites– …

Crawler Ethics

• How does the website tell you (crawler) if and what is off limits.

• Two options– Site wide restrictions: robots.txt– Webpage specific restrictions: Meta tag

Crawler Ethicsrobots.txt

• A file called “robots.txt” in the root directory of the website

• Example: http://www.about.com/robots.txt

• Format: User-Agent: <Crawler name> Disallow: <don’t follow path> Allow: <can-follow-paths>

http://www.about.com/robots.txt



Crawler Ethicsrobots.txt

• What should you do?– Before starting on a new website:– Check if robots.txt exists.– If it does, download it and parse it for all

inclusions and exclusions for “generic crawler” i.e. User-Agent: *

– Don’t’ crawl anything in the exclusion list including sub-directories

Crawler EthicsWebsite Specific: Meta tags

• Some webpages have one the following meta-tag entries:

• <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW"> • <META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW">• <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

• Options: – INDEX or NOINDEX– FOLLOW or NOFOLLOW

Twitter data collecting

• Collecting through Twitter Streaming API– https://dev.twitter.com/docs/platform-objects/tweets, where you can check

the data schema.– Rate limit: you will get up to 1% of the whole Twitter traffic. So you can get

about 4.3M tweets per day (about 2GB)– You need to have a Twitter account for that. Check https://dev.twitter.com/

https://dev.twitter.com/docs/platform-objects/tweets





Third-party libarary

Twitter4j for Java. • You can find supports for other languages also.• Well documented and code examples. e.g.,

http://twitter4j.org/en/code-examples.html

http://twitter4j.org/en/index.html

http://twitter4j.org/en/code-examples.html

Important Fields

• At least following fields you should save:– Text– Timestamp– Geolocation– User of the tweet– Links

Crawl links in Tweets

• Tweets may contain links.– It may contains useful information. E.g., links to

news articles.• After collect the tweets, use another process

to crawl the links.– Because the crawling is slower, so you may not

want to crawl it right after you get the tweet.

agenda overview of the project resources. cs172 project crawlingrankingindexing

Documents

web crawling slide

extracted links

crawling strategy twitter

download contents of

file contents

options web data needs

project resources

downloaded file