why crawling sites is hard and what you can do about it
TRANSCRIPT
Why It's Hard to Crawl Your WebsiteAnd What You Can Do To Help
Nick Gerner
Software Engineerand Web Crawler
SEOmoz.org
Building a Crawler
Search Goals For Web Crawling
Requirements v1.0
Collect Data to Answer Queries like:Cannon SLR review
Vacation in the Caribbean
mod_rewrite apache
Send Searchers to Relevant Sites/Pages
Crawling The Web
while(morePages){ FetchNextPage ParseXHTML AddNewLinksToCrawl}
Crawling The Web
while(morePages) //duplicate content?{ //200 OK? 3xx? 5xx, 404? politeness? FetchNextPage ParseXHTML //hah! AddNewLinksToCrawl //but which ones?}
Search Goals For Web Crawling
Requirements v2.0
Avoid Duplicate Content
Avoid Spam
Send Searchers to Highest Quality Sites
Don't crash websites with crawl
Keep content up to date
...iterate for reqs v3.0, 4.0...
What Can You Do To Help?
Build For Users Not Robots!
The formula is simple: design for people, be smart about robots, and you will achieve long-lasting success.--Jane and Robot Philosophy
Build For Users Not Robots
Robots do notbuy products
subscribe to services
listen to advice
click ads
Search engines like websites that have good usability
Search engines do NOT like websites that lie or confuse them
Investing in Usability once pays double!
Problems You Can Help to Solve
Duplicate Content
FooFooFooFooFooFooFooFooFooFooA majority of the web is duplicate content--Joachim Kupke, Google Engineer on Duplicate Content
Duplicate Content
A Majority of SEOmoz Consulting is Duplicate Content!
Many sources:Search Results
Sorting Lists
Session IDs
View State
Campaign tracking codes
...
Example: http://bit.ly/ZPkIE
Duplicate Content Solutions
Don't have dups!
robots.txt stuff out don't!
Redirect (301)
rel=canonical (at least for Google today)
sitemap.xml (www.sitemaps.org)
See http://bit.ly/NDYYpJane and Robot URL Referrer Tracking
www vs non-www
www.searchengineland.comdupdupdupdupdupdupdupdupdupsearchengineland.comdupdupdupdupdupdupdupdupdupdupdup
www vs non-www
No one cares if you're www or not
...But choose one!
Stick to it throughout your site
301 EVERYTHING to the canonical form
My slight preference for www:many people will link to it www version
if you've screwed it up that gives you less risk
URL Structure
http://my.url.says/a/lot?about=meDescriptive and Human Readable
Single name for a single piece of content
Think about usability:Click-able
Link-able
Memorable
URL Structure
Good:www.example.com/wigets/multi-purpose/blue-sprocket
Not Great:www.example.com/;jsessionid=0683s?pid=33&sid=123&cid=831
Worse:www.example.com/jsessionid/067x83s/pid/33/sid/123/cid/831
Terrible:http://www.serde.net/(X(1)A(o0Byn2nEyQEkAAAAZjUxMmI0ZGMtNDA4MS00ZDY0LTk5ZTAtNTA2MWY2M2Q5MTcxHreLE5sE1J_bXKfCHoBGWNFZTJ81)S(np4fn330ilbrliusxoe5k445))/GetThreadsRss.aspx?ForumID=5
Site Volume and Indexation
Holy Crap that's a lot of crawling!
Time to prioritizewhat's important!
Site Volume
Organize Content Under a HierarchyImportant stuff goes at the top
Good link navigation to important but deep content
Think about information architecture
Use a sitemap (www.sitemaps.org) to tell:What's important
What's updated
Bad Architecture
Uh... what is important here?
Bad Architecture
Complex Navigation
Important Content is Deeply Hidden
Hard to discover important content from unimportant
Good Architecture
Oh, I can understand that!
Good Architecture
Intuitive Information Layout
Short Path to Relevant Content
HTTP status Codes
YourSite.com200 Not FoundUh... That's not right
HTTP Status Codes
200 OK is great for:Unique content that you want to deliver to users
Non-canonical content with rel=canonical
301 is great for non-canonical urls
302 is temporary!
304 for conditional get is great
404 for does not exist (NOT 200!)
5XX for temporarily down (NOT 200, 404!)
Familiarize yourself with RFC 2616
Robots Exclusion Protocol &
Robots.txt
Robots Exclusion Protocol &
Robots.txt
Succinct (not 10MB!!)
Use sparingly and simplyDon't know your future url structure
Don't know future search engines
This is a sledgehammer
Forget robotstxt.org, use this tool as truth:http://www.google.com/webmasters your site site configuration crawler access test robots.txt
Robots Exclusion Protocol &
Robots.txt Gotchas
URLs are (sometimes) case Sensitive!
Easy to mess up special casing for different crawlers
Avoid duplicate blocks
Some guy's kid got pissed and de-indexed corp site
Robots Exclusion Protocol &
Robots.txt Good Stuff
Stuff under developmentBut why is it in public?
Search Result PagesUnless you're clever
But don't be clever in general!
Private Information (e.g. PII)
Access Restricted SectionsCrawlers can't get to em anyway
Better yet, 301 these to a better place
Robots Exclusion Protocol &
Meta robots
NOINDEX: don't serve this resultDifferent than robots.txt disallow
NOFOLLOW: rel=nofollow on all links
NOODP: don't use dmoz for my snippetJust use meta description if you can
http://searchengineland.com/meta-robots-tag-101-blocking-spiders-cached-pages-more-10665
http://googlewebmastercentral.blogspot.com/2007/03/using-robots-meta-tag.html
Getting Fancy (AJAX, Flash, etc.)
Progressive Enhancement
Graceful Degradation
Accessibility (e.g. for the blind, for the deaf)
KISS
Try your site out:With Javascript Disabled (webdeveloper toolbar)
In Telnet
Help!
Google Webmaster Tools and Forums:http://www.google.com/webmasters
www.JaneAndRobot.com/
SEOmoz.orgBlog (free)
Beginner's Guide to SEO (free)
site:seomoz.org Site Architecture (free and paid)
Help!
Questions?
Muokkaa otsikon tekstimuotoa napsauttamalla
Muokkaa jsennyksen tekstimuotoa napsauttamallaToinen jsennystasoKolmas jsennystasoNeljs jsennystasoViides jsennystasoKuudes jsennystasoSeitsems jsennystasoKahdeksas jsennystasoYhdekss jsennystaso
Column 1Column 2Column 3
Your Site13.24.54
The Other 70 Million Sites998.89.65