Why Crawling Sites is Hard and What You Can Do About It

Download Why Crawling Sites is Hard and What You Can Do About It

Post on 16-Apr-2017




4 download


Why It's Hard to Crawl Your WebsiteAnd What You Can Do To HelpNick GernerSoftware Engineerand Web CrawlerSEOmoz.orgBuilding a CrawlerSearch Goals For Web CrawlingRequirements v1.0Collect Data to Answer Queries like:Cannon SLR reviewVacation in the Caribbeanmod_rewrite apacheSend Searchers to Relevant Sites/PagesCrawling The Webwhile(morePages){ FetchNextPage ParseXHTML AddNewLinksToCrawl}Crawling The Webwhile(morePages) //duplicate content?{ //200 OK? 3xx? 5xx, 404? politeness? FetchNextPage ParseXHTML //hah! AddNewLinksToCrawl //but which ones?}Search Goals For Web CrawlingRequirements v2.0Avoid Duplicate ContentAvoid SpamSend Searchers to Highest Quality SitesDon't crash websites with crawlKeep content up to date...iterate for reqs v3.0, 4.0...What Can You Do To Help?Build For Users Not Robots!The formula is simple: design for people, be smart about robots, and you will achieve long-lasting success.--Jane and Robot PhilosophyBuild For Users Not RobotsRobots do notbuy productssubscribe to serviceslisten to adviceclick adsSearch engines like websites that have good usabilitySearch engines do NOT like websites that lie or confuse themInvesting in Usability once pays double!Problems You Can Help to SolveDuplicate ContentFooFooFooFooFooFooFooFooFooFooA majority of the web is duplicate content--Joachim Kupke, Google Engineer on Duplicate ContentDuplicate ContentA Majority of SEOmoz Consulting is Duplicate Content!Many sources:Search ResultsSorting ListsSession IDsView StateCampaign tracking codes...Example: http://bit.ly/ZPkIEDuplicate Content SolutionsDon't have dups!robots.txt stuff out don't!Redirect (301)rel=canonical (at least for Google today)sitemap.xml (www.sitemaps.org)See http://bit.ly/NDYYpJane and Robot URL Referrer Trackingwww vs non-wwwwww.searchengineland.comdupdupdupdupdupdupdupdupdupsearchengineland.comdupdupdupdupdupdupdupdupdupdupdupwww vs non-wwwNo one cares if you're www or not...But choose one!Stick to it throughout your site301 EVERYTHING to the canonical formMy slight preference for www:many people will link to it www versionif you've screwed it up that gives you less riskURL Structurehttp://my.url.says/a/lot?about=meDescriptive and Human ReadableSingle name for a single piece of contentThink about usability:Click-ableLink-ableMemorableURL StructureGood:www.example.com/wigets/multi-purpose/blue-sprocketNot Great:www.example.com/;jsessionid=0683s?pid=33&sid=123&cid=831 Worse:www.example.com/jsessionid/067x83s/pid/33/sid/123/cid/831Terrible:http://www.serde.net/(X(1)A(o0Byn2nEyQEkAAAAZjUxMmI0ZGMtNDA4MS00ZDY0LTk5ZTAtNTA2MWY2M2Q5MTcxHreLE5sE1J_bXKfCHoBGWNFZTJ81)S(np4fn330ilbrliusxoe5k445))/GetThreadsRss.aspx?ForumID=5Site Volume and IndexationHoly Crap that's a lot of crawling!Time to prioritizewhat's important!Site VolumeOrganize Content Under a HierarchyImportant stuff goes at the topGood link navigation to important but deep contentThink about information architectureUse a sitemap (www.sitemaps.org) to tell:What's importantWhat's updatedBad ArchitectureUh... what is important here?Bad ArchitectureComplex NavigationImportant Content is Deeply HiddenHard to discover important content from unimportantGood ArchitectureOh, I can understand that!Good ArchitectureIntuitive Information LayoutShort Path to Relevant ContentHTTP status CodesYourSite.com200 Not FoundUh... That's not rightHTTP Status Codes200 OK is great for:Unique content that you want to deliver to usersNon-canonical content with rel=canonical301 is great for non-canonical urls302 is temporary!304 for conditional get is great404 for does not exist (NOT 200!)5XX for temporarily down (NOT 200, 404!)Familiarize yourself with RFC 2616Robots Exclusion Protocol &Robots.txtRobots Exclusion Protocol &Robots.txtSuccinct (not 10MB!!)Use sparingly and simplyDon't know your future url structureDon't know future search enginesThis is a sledgehammerForget robotstxt.org, use this tool as truth:http://www.google.com/webmasters your site site configuration crawler access test robots.txtRobots Exclusion Protocol &Robots.txt GotchasURLs are (sometimes) case Sensitive!Easy to mess up special casing for different crawlersAvoid duplicate blocksSome guy's kid got pissed and de-indexed corp siteRobots Exclusion Protocol &Robots.txt Good StuffStuff under developmentBut why is it in public?Search Result PagesUnless you're cleverBut don't be clever in general!Private Information (e.g. PII)Access Restricted SectionsCrawlers can't get to em anywayBetter yet, 301 these to a better placeRobots Exclusion Protocol &Meta robotsNOINDEX: don't serve this resultDifferent than robots.txt disallowNOFOLLOW: rel=nofollow on all linksNOODP: don't use dmoz for my snippetJust use meta description if you canhttp://searchengineland.com/meta-robots-tag-101-blocking-spiders-cached-pages-more-10665http://googlewebmastercentral.blogspot.com/2007/03/using-robots-meta-tag.htmlGetting Fancy (AJAX, Flash, etc.)Progressive EnhancementGraceful DegradationAccessibility (e.g. for the blind, for the deaf)KISSTry your site out:With Javascript Disabled (webdeveloper toolbar)In TelnetHelp!Google Webmaster Tools and Forums:http://www.google.com/webmasterswww.JaneAndRobot.com/SEOmoz.orgBlog (free)Beginner's Guide to SEO (free)site:seomoz.org Site Architecture (free and paid)Help!Questions?Muokkaa otsikon tekstimuotoa napsauttamallaMuokkaa jsennyksen tekstimuotoa napsauttamallaToinen jsennystasoKolmas jsennystasoNeljs jsennystasoViides jsennystasoKuudes jsennystasoSeitsems jsennystasoKahdeksas jsennystasoYhdekss jsennystasoColumn 1Column 2Column 3Your Site13.24.54The Other 70 Million Sites998.89.65