better sitemap (mozilla drumbeat)

14
Better Sitemap U-Zyn Chua [email protected] December 12, 2009 Mozilla Drumbeat Challenge Singapore This work is licensed under a Creative Commons Attribution 3.0 License. All other trademarks, logos and copyrights are the property of their respective owners.

Upload: uzyn

Post on 15-Jun-2015

715 views

Category:

Technology


1 download

DESCRIPTION

Project proposal on how SItemap 0.90 can be improved.

TRANSCRIPT

Page 1: Better Sitemap (Mozilla Drumbeat)

Better Sitemap

U-Zyn [email protected]

December 12, 2009Mozilla Drumbeat Challenge

Singapore

This work is licensed under a Creative Commons Attribution 3.0 License.All other trademarks, logos and copyrights are the property of their respective owners.

Page 2: Better Sitemap (Mozilla Drumbeat)

Sitemap 0.90

U-Zyn [email protected]

Page 3: Better Sitemap (Mozilla Drumbeat)

• XML• List of URLs• For URL discovery• Robot-friendly

• Max of 10MB/50k URLs per file

U-Zyn [email protected]

Page 4: Better Sitemap (Mozilla Drumbeat)

U-Zyn [email protected]

<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.google.com/</loc> <priority>1.000</priority> </url> <url> <loc>http://www.google.com/3dwh_dmca.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/cpanel/domain</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/edu/</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/new.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/overview.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/privacy.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/program_policies.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/seminars.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/terms.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/testimonials.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/tour.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/administration.html</loc> <priority>0.5000</priority>

</url> <url> <loc>http://www.google.com/a/help/intl/en/edu/benefits.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/calendar.html</loc>

<priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/customers/asu.html</loc> <priority>0.5000</priority> </url> <url>

<loc>http://www.google.com/a/help/intl/en/edu/customers/pdfs/asu_success_story.pdf</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/details.html</loc> <priority>0.5000</priority> </url>

<url> <loc>http://www.google.com/a/help/intl/en/edu/features.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/gmail.html</loc> <priority>0.5000</priority>

</url> <url> <loc>http://www.google.com/a/help/intl/en/edu/pagecreator.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/seminars.html</loc>

<priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/startpage.html</loc> <priority>0.5000</priority> </url> <url>

<loc>http://www.google.com/a/help/intl/en/edu/talk.html</loc> <priority>0.5000</priority> </url>

• Messy

• Huge(google.com’s – 3.9MB)

• Useless(for human)

Page 5: Better Sitemap (Mozilla Drumbeat)

Improvements

U-Zyn [email protected]

Page 6: Better Sitemap (Mozilla Drumbeat)

• For robots:– Faster– More efficient

• For humans:– More useful– At least readable by human web client – browser.– A browser uses about 5KB of bandwidth to download favicons.

Why not use the bandwidth to download more useful material?

U-Zyn [email protected]

Aims

Page 7: Better Sitemap (Mozilla Drumbeat)

Sitemap

• Parent page• Sibling pages• Children pages• Parsable by web browsers

U-Zyn [email protected]

Hierarchical

Page 8: Better Sitemap (Mozilla Drumbeat)

U-Zyn [email protected]

Hierarchical

Browser is able to tell user where he/she is at

Page 9: Better Sitemap (Mozilla Drumbeat)

• <lastmod> is in Sitemap 0.90• But not sorted-by• Present sitemap in chronological order

U-Zyn [email protected]

Chronological

Page 10: Better Sitemap (Mozilla Drumbeat)

U-Zyn [email protected]

Chronological

Browser showing newly updated pages

Page 11: Better Sitemap (Mozilla Drumbeat)

• Robots:– Do not have to download huge sitemap files

everytime– Only download first few chunks

• Browsers:– Easily tell surfers where the newly updated

content is located– (unlike RSS) not limited to blog/blog-like site.

U-Zyn [email protected]

Chronological

Page 12: Better Sitemap (Mozilla Drumbeat)

U-Zyn [email protected]

More Efficient (Draft)

• Multiple versions– Chronological• Robots do not have to download the whole sitemap for

each crawl– Hierarchical

• Seekable– With header index– Only download needed portions

Page 13: Better Sitemap (Mozilla Drumbeat)

U-Zyn [email protected]

More Efficient (Draft)

• Smarter– Each page serves sitemap based on where

client/user is at.– Do not have to download whole sitemap.– Do not have to parse whole sitemap.– Able to keep filesize small – approx. 5KB for

browsers to load quickly.

• Switch away from XML?

Page 14: Better Sitemap (Mozilla Drumbeat)

Better SitemapU-Zyn Chua

[email protected]

This work is licensed under a Creative Commons Attribution 3.0 License.All other trademarks, logos and copyrights are the property of their respective owners.

• For robots and humans alike• Chronological• Hierarchical• Seekable• Smarter

Project Summary