better sitemap (mozilla drumbeat)
DESCRIPTION
Project proposal on how SItemap 0.90 can be improved.TRANSCRIPT
Better Sitemap
U-Zyn [email protected]
December 12, 2009Mozilla Drumbeat Challenge
Singapore
This work is licensed under a Creative Commons Attribution 3.0 License.All other trademarks, logos and copyrights are the property of their respective owners.
Sitemap 0.90
U-Zyn [email protected]
• XML• List of URLs• For URL discovery• Robot-friendly
• Max of 10MB/50k URLs per file
U-Zyn [email protected]
U-Zyn [email protected]
<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.google.com/</loc> <priority>1.000</priority> </url> <url> <loc>http://www.google.com/3dwh_dmca.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/cpanel/domain</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/edu/</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/new.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/overview.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/privacy.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/program_policies.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/seminars.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/terms.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/testimonials.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/tour.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/administration.html</loc> <priority>0.5000</priority>
</url> <url> <loc>http://www.google.com/a/help/intl/en/edu/benefits.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/calendar.html</loc>
<priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/customers/asu.html</loc> <priority>0.5000</priority> </url> <url>
<loc>http://www.google.com/a/help/intl/en/edu/customers/pdfs/asu_success_story.pdf</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/details.html</loc> <priority>0.5000</priority> </url>
<url> <loc>http://www.google.com/a/help/intl/en/edu/features.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/gmail.html</loc> <priority>0.5000</priority>
</url> <url> <loc>http://www.google.com/a/help/intl/en/edu/pagecreator.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/seminars.html</loc>
<priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/startpage.html</loc> <priority>0.5000</priority> </url> <url>
<loc>http://www.google.com/a/help/intl/en/edu/talk.html</loc> <priority>0.5000</priority> </url>
• Messy
• Huge(google.com’s – 3.9MB)
• Useless(for human)
Improvements
U-Zyn [email protected]
• For robots:– Faster– More efficient
• For humans:– More useful– At least readable by human web client – browser.– A browser uses about 5KB of bandwidth to download favicons.
Why not use the bandwidth to download more useful material?
U-Zyn [email protected]
Aims
Sitemap
• Parent page• Sibling pages• Children pages• Parsable by web browsers
U-Zyn [email protected]
Hierarchical
• <lastmod> is in Sitemap 0.90• But not sorted-by• Present sitemap in chronological order
U-Zyn [email protected]
Chronological
• Robots:– Do not have to download huge sitemap files
everytime– Only download first few chunks
• Browsers:– Easily tell surfers where the newly updated
content is located– (unlike RSS) not limited to blog/blog-like site.
U-Zyn [email protected]
Chronological
U-Zyn [email protected]
More Efficient (Draft)
• Multiple versions– Chronological• Robots do not have to download the whole sitemap for
each crawl– Hierarchical
• Seekable– With header index– Only download needed portions
U-Zyn [email protected]
More Efficient (Draft)
• Smarter– Each page serves sitemap based on where
client/user is at.– Do not have to download whole sitemap.– Do not have to parse whole sitemap.– Able to keep filesize small – approx. 5KB for
browsers to load quickly.
• Switch away from XML?
Better SitemapU-Zyn Chua
This work is licensed under a Creative Commons Attribution 3.0 License.All other trademarks, logos and copyrights are the property of their respective owners.
• For robots and humans alike• Chronological• Hierarchical• Seekable• Smarter
Project Summary