Better Sitemap (Mozilla Drumbeat)

14
Better Sitemap U-Zyn Chua [email protected] December 12, 2009 Mozilla Drumbeat Challenge Singapore This work is licensed under a Creative Commons Attribution 3.0 License. All other trademarks, logos and copyrights are the property of their respective owners.

description

Project proposal on how SItemap 0.90 can be improved.

Transcript of Better Sitemap (Mozilla Drumbeat)

Page 1: Better Sitemap (Mozilla Drumbeat)

Better Sitemap

U-Zyn [email protected]

December 12, 2009Mozilla Drumbeat Challenge

Singapore

This work is licensed under a Creative Commons Attribution 3.0 License.All other trademarks, logos and copyrights are the property of their respective owners.

Page 2: Better Sitemap (Mozilla Drumbeat)

Sitemap 0.90

U-Zyn [email protected]

Page 3: Better Sitemap (Mozilla Drumbeat)

• XML• List of URLs• For URL discovery• Robot-friendly

• Max of 10MB/50k URLs per file

U-Zyn [email protected]

Page 4: Better Sitemap (Mozilla Drumbeat)

U-Zyn [email protected]

<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.google.com/</loc> <priority>1.000</priority> </url> <url> <loc>http://www.google.com/3dwh_dmca.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/cpanel/domain</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/edu/</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/new.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/overview.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/privacy.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/program_policies.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/seminars.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/terms.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/testimonials.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/admins/tour.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/administration.html</loc> <priority>0.5000</priority>

</url> <url> <loc>http://www.google.com/a/help/intl/en/edu/benefits.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/calendar.html</loc>

<priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/customers/asu.html</loc> <priority>0.5000</priority> </url> <url>

<loc>http://www.google.com/a/help/intl/en/edu/customers/pdfs/asu_success_story.pdf</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/details.html</loc> <priority>0.5000</priority> </url>

<url> <loc>http://www.google.com/a/help/intl/en/edu/features.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/gmail.html</loc> <priority>0.5000</priority>

</url> <url> <loc>http://www.google.com/a/help/intl/en/edu/pagecreator.html</loc> <priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/seminars.html</loc>

<priority>0.5000</priority> </url> <url> <loc>http://www.google.com/a/help/intl/en/edu/startpage.html</loc> <priority>0.5000</priority> </url> <url>

<loc>http://www.google.com/a/help/intl/en/edu/talk.html</loc> <priority>0.5000</priority> </url>

• Messy

• Huge(google.com’s – 3.9MB)

• Useless(for human)

Page 5: Better Sitemap (Mozilla Drumbeat)

Improvements

U-Zyn [email protected]

Page 6: Better Sitemap (Mozilla Drumbeat)

• For robots:– Faster– More efficient

• For humans:– More useful– At least readable by human web client – browser.– A browser uses about 5KB of bandwidth to download favicons.

Why not use the bandwidth to download more useful material?

U-Zyn [email protected]

Aims

Page 7: Better Sitemap (Mozilla Drumbeat)

Sitemap

• Parent page• Sibling pages• Children pages• Parsable by web browsers

U-Zyn [email protected]

Hierarchical

Page 8: Better Sitemap (Mozilla Drumbeat)

U-Zyn [email protected]

Hierarchical

Browser is able to tell user where he/she is at

Page 9: Better Sitemap (Mozilla Drumbeat)

• <lastmod> is in Sitemap 0.90• But not sorted-by• Present sitemap in chronological order

U-Zyn [email protected]

Chronological

Page 10: Better Sitemap (Mozilla Drumbeat)

U-Zyn [email protected]

Chronological

Browser showing newly updated pages

Page 11: Better Sitemap (Mozilla Drumbeat)

• Robots:– Do not have to download huge sitemap files

everytime– Only download first few chunks

• Browsers:– Easily tell surfers where the newly updated

content is located– (unlike RSS) not limited to blog/blog-like site.

U-Zyn [email protected]

Chronological

Page 12: Better Sitemap (Mozilla Drumbeat)

U-Zyn [email protected]

More Efficient (Draft)

• Multiple versions– Chronological• Robots do not have to download the whole sitemap for

each crawl– Hierarchical

• Seekable– With header index– Only download needed portions

Page 13: Better Sitemap (Mozilla Drumbeat)

U-Zyn [email protected]

More Efficient (Draft)

• Smarter– Each page serves sitemap based on where

client/user is at.– Do not have to download whole sitemap.– Do not have to parse whole sitemap.– Able to keep filesize small – approx. 5KB for

browsers to load quickly.

• Switch away from XML?

Page 14: Better Sitemap (Mozilla Drumbeat)

Better SitemapU-Zyn Chua

[email protected]

This work is licensed under a Creative Commons Attribution 3.0 License.All other trademarks, logos and copyrights are the property of their respective owners.

• For robots and humans alike• Chronological• Hierarchical• Seekable• Smarter

Project Summary