Proposal for Setting Canonical Host via Robots.txtThis is a proposal to indicate a preferred host name (e.g. domain with or without "www") for search engine robots by adding a "Canonical-host" entry to the robots.txt file. Valid host values are as per RFC 2396 and RFC 2732, i.e. "hostname | IPv4address | [IPv6address]". For example: User-agent: * or User-agent: * or User-agent: * or User-agent: * or User-agent: * RationaleIt has always been a common practice to make a web site accessible both with and without a "www" host name. This remains the way sites are almost always configured by default by an ISP under managed hosting plans. While potentially interesting from a usability standpoint (both www.example.com and its shorter form, example.com, will work when typed in a browser's address field), this results in several problems as soon as the different URLs pointing to the same host are published on the web in spite of the site's maintainer preference for one specific form. Known issues include:
Solutions to this are limited in part because:
Resorting to robots.txt to solve this problem comes natural for several reasons:
While this discussion centers on the presence or lack of the "www" host name, which is a very practical and frequent issue, the aim is to propose a flexible solution that can be applied to other situations as well. ConclusionIn consideration of the above, the proposal is made to define an extension token named "Canonical-host", allowing the maintainer of a web site to indicate a preferred host name value to be used by robots to access and index the site. More specifically:
Post Scriptum: Robots.txt vs. rel="canonical"In 2009 the major search engines announced support for the rel="canonical" attribute: Although from a per-page rather than per-site perspective, the new implementation addresses many of the needs covered by this proposal. At the same time though, it requires adding a tag on each page, and it cannot be applied to scenarios where the content administrator has no control over the HTML headers, e.g. with many CMS systems, or with web services. Not to mention non-HTML content (audio, video, images, etc.) FeedbackAny feedback is most certainly appreciated.
Michael C. Battilana | |||