Why Crawling Sites is Hard and What You Can Do About It

Download Why Crawling Sites is Hard and What You Can Do About It

Post on 16-Apr-2017




4 download

Embed Size (px)


Why It's Hard to Crawl Your WebsiteAnd What You Can Do To Help

Nick Gerner

Software Engineerand Web Crawler


Building a Crawler

Search Goals For Web Crawling
Requirements v1.0

Collect Data to Answer Queries like:Cannon SLR review

Vacation in the Caribbean

mod_rewrite apache

Send Searchers to Relevant Sites/Pages

Crawling The Web

while(morePages){ FetchNextPage ParseXHTML AddNewLinksToCrawl}

Crawling The Web

while(morePages) //duplicate content?{ //200 OK? 3xx? 5xx, 404? politeness? FetchNextPage ParseXHTML //hah! AddNewLinksToCrawl //but which ones?}

Search Goals For Web Crawling
Requirements v2.0

Avoid Duplicate Content

Avoid Spam

Send Searchers to Highest Quality Sites

Don't crash websites with crawl

Keep content up to date

...iterate for reqs v3.0, 4.0...

What Can You Do To Help?

Build For Users Not Robots!

The formula is simple: design for people, be smart about robots, and you will achieve long-lasting success.--Jane and Robot Philosophy

Build For Users Not Robots

Robots do notbuy products

subscribe to services

listen to advice

click ads

Search engines like websites that have good usability

Search engines do NOT like websites that lie or confuse them

Investing in Usability once pays double!

Problems You Can Help to Solve

Duplicate Content

FooFooFooFooFooFooFooFooFooFooA majority of the web is duplicate content--Joachim Kupke, Google Engineer on Duplicate Content

Duplicate Content

A Majority of SEOmoz Consulting is Duplicate Content!

Many sources:Search Results

Sorting Lists

Session IDs

View State

Campaign tracking codes


Example: http://bit.ly/ZPkIE

Duplicate Content Solutions

Don't have dups!

robots.txt stuff out don't!

Redirect (301)

rel=canonical (at least for Google today)

sitemap.xml (www.sitemaps.org)

See http://bit.ly/NDYYpJane and Robot URL Referrer Tracking

www vs non-www


www vs non-www

No one cares if you're www or not

...But choose one!

Stick to it throughout your site

301 EVERYTHING to the canonical form

My slight preference for www:many people will link to it www version

if you've screwed it up that gives you less risk

URL Structure

http://my.url.says/a/lot?about=meDescriptive and Human Readable

Single name for a single piece of content

Think about usability:Click-able



URL Structure


Not Great:www.example.com/;jsessionid=0683s?pid=33&sid=123&cid=831



Site Volume and Indexation

Holy Crap that's a lot of crawling!

Time to prioritizewhat's important!

Site Volume

Organize Content Under a HierarchyImportant stuff goes at the top

Good link navigation to important but deep content

Think about information architecture

Use a sitemap (www.sitemaps.org) to tell:What's important

What's updated

Bad Architecture

Uh... what is important here?

Bad Architecture

Complex Navigation

Important Content is Deeply Hidden

Hard to discover important content from unimportant

Good Architecture

Oh, I can understand that!

Good Architecture

Intuitive Information Layout

Short Path to Relevant Content

HTTP status Codes

YourSite.com200 Not FoundUh... That's not right

HTTP Status Codes

200 OK is great for:Unique content that you want to deliver to users

Non-canonical content with rel=canonical

301 is great for non-canonical urls

302 is temporary!

304 for conditional get is great

404 for does not exist (NOT 200!)

5XX for temporarily down (NOT 200, 404!)

Familiarize yourself with RFC 2616

Robots Exclusion Protocol &

Robots Exclusion Protocol &

Succinct (not 10MB!!)

Use sparingly and simplyDon't know your future url structure

Don't know future search engines

This is a sledgehammer

Forget robotstxt.org, use this tool as truth:http://www.google.com/webmasters your site site configuration crawler access test robots.txt

Robots Exclusion Protocol &
Robots.txt Gotchas

URLs are (sometimes) case Sensitive!

Easy to mess up special casing for different crawlers

Avoid duplicate blocks

Some guy's kid got pissed and de-indexed corp site

Robots Exclusion Protocol &
Robots.txt Good Stuff

Stuff under developmentBut why is it in public?

Search Result PagesUnless you're clever

But don't be clever in general!

Private Information (e.g. PII)

Access Restricted SectionsCrawlers can't get to em anyway

Better yet, 301 these to a better place

Robots Exclusion Protocol &
Meta robots

NOINDEX: don't serve this resultDifferent than robots.txt disallow

NOFOLLOW: rel=nofollow on all links

NOODP: don't use dmoz for my snippetJust use meta description if you can



Getting Fancy (AJAX, Flash, etc.)

Progressive Enhancement

Graceful Degradation

Accessibility (e.g. for the blind, for the deaf)


Try your site out:With Javascript Disabled (webdeveloper toolbar)

In Telnet


Google Webmaster Tools and Forums:http://www.google.com/webmasters


SEOmoz.orgBlog (free)

Beginner's Guide to SEO (free)

site:seomoz.org Site Architecture (free and paid)



Muokkaa otsikon tekstimuotoa napsauttamalla

Muokkaa jsennyksen tekstimuotoa napsauttamallaToinen jsennystasoKolmas jsennystasoNeljs jsennystasoViides jsennystasoKuudes jsennystasoSeitsems jsennystasoKahdeksas jsennystasoYhdekss jsennystaso

Column 1Column 2Column 3

Your Site13.24.54

The Other 70 Million Sites998.89.65