agenda overview of the project resources. cs172 project crawlingrankingindexing
TRANSCRIPT
Phase 1 Options
• Web data– Needs to come out with your own crawling
strategy• Twitter data– Can use third-party for Twitter Streaming API– Still needs some web crawling
Download contents of page1
Parse the downloaded file to extract links the page
2
Store extracted links in the Frontier
4
Frontier• www.cs.ucr.edu• www.cs.ucr.edu/
~vagelis
getNext()
Add(List<URLs>)
getNext
addAll(List) Clean and Normalize the
extracted links3
Crawling
<- This is what you will see when you download a page. Notice HTML Tags.
2. Parsing HTML to extract links
2. Parsing HTML file
• Write your own parser Some suggestions: Parse HTML file as XML. Two Parsing methods
– SAX (Simple API for XML)– DOM (Document Object Model)
• Use existing library– JSoup (http://jsoup.org/). Can be used to download the
page.– HTML Parser (http://htmlparser.sourceforge.net/)
2. Parsing HTML file
• Things to think about– How to handle Malformed HTML?
Browser can still display it, but how do you handle it?
3. Clean extracted URLs• Some URL entries while crawling www.cs.ucr.edu
/intranet//inventthefuture.htmlsystems.engr.ucr.edunews/e-newsletter.htmlhttp://www.engr.ucr.edu/sendmail.htmlhttp://ucrcmsdev.ucr.edu/oucampus/de.jsp?user=D01002&site=cmsengr&path=%2Findex.html/faculty///about/#mainhttp://www.pe.com/local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988#ssStory533104
3. Clean extracted URLs
What to avoid• Parse only http links (avoid ftp, https or any other protocol)• Avoid duplicates
– Bookmarks : #main – Bookmarks should be stripped off.– Self paths: /
• Avoid downloading pdfs or images – /news/GraphenePublicationsIndex.pdf– Its ok to download them, but you cannot parse them.
• Take care of invalid characters in URLs– Space: www.cs.ucr.edu/vagelis hristidis – Ampersand: www.cs.ucr.edu/vagelis&hristidis – These characters should be encoded else you will get a
MalformedURLException
Normalize Links Found on the page
• Relative URLs: – These URLs have no host address– E.g. While crawling (www.cs.ucr.edu/faculty) you find urls such
as: – Case 1: /find_people.php
• A “/” at the beginning means path starts from the root of the host (www.cs.ucr.edu) in this case.
– Case 2: all• No “/” means the path is relative to current path.
• Normalize them (respectively) to – www.cs.ucr.edu/find_people.php– www.cs.ucr.edu/faculty/all
Clean extracted URLs
• Different Parts of the URL highlighted with different colors• http://www.pe.com:8080/local-news/riverside-county/
riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988#ssStory533
• Protocol• Port• Host• Path• Query• Bookmark
java.net.URL Has methods that can separate different parts of the URL.
getProtocol: httpgetHost: www.pe.comgetPort: -1getPath: /local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ecegetQuery: ssimg=532988getFile: /local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988
Normalizing with java.net.URL
• You can normalize URLs with simple string manipulations and using methods from java.net.URL class.
• Here is the snippet for normalizing “Case 1” root relative URLs
Crawler Ethics
• Some websites don’t want crawlers swarming all over them.
• Why?– Increases load on the server– Private websites– Dynamic websites– …
Crawler Ethics
• How does the website tell you (crawler) if and what is off limits.
• Two options– Site wide restrictions: robots.txt– Webpage specific restrictions: Meta tag
Crawler Ethicsrobots.txt
• A file called “robots.txt” in the root directory of the website
• Example: http://www.about.com/robots.txt
• Format: User-Agent: <Crawler name> Disallow: <don’t follow path> Allow: <can-follow-paths>
Crawler Ethicsrobots.txt
• What should you do?– Before starting on a new website:– Check if robots.txt exists.– If it does, download it and parse it for all
inclusions and exclusions for “generic crawler” i.e. User-Agent: *
– Don’t’ crawl anything in the exclusion list including sub-directories
Crawler EthicsWebsite Specific: Meta tags
• Some webpages have one the following meta-tag entries:
• <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW"> • <META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW">• <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
• Options: – INDEX or NOINDEX– FOLLOW or NOFOLLOW
Twitter data collecting
• Collecting through Twitter Streaming API– https://dev.twitter.com/docs/platform-objects/tweets, where you can check
the data schema.– Rate limit: you will get up to 1% of the whole Twitter traffic. So you can get
about 4.3M tweets per day (about 2GB)– You need to have a Twitter account for that. Check https://dev.twitter.com/
Third-party libarary
Twitter4j for Java. • You can find supports for other languages also.• Well documented and code examples. e.g.,
http://twitter4j.org/en/code-examples.html
Important Fields
• At least following fields you should save:– Text– Timestamp– Geolocation– User of the tweet– Links