Robots, keep out!

Search engines use software ‘robots’ to examine pages, and follow links. These tools are designed to discover stuff, and they’re very good at it! If you leave a link anywhere Google can find it, then the content will find its way into the index, sometimes with embarrassing results! A robots.txt file is simply a way of telling search engines that you don’t want certain content indexed.

Why bother?

Good examples of content you might not want indexed might be your template files, test pages, archived content, and out of date content that still has to be publicly accessible. It’s worth keeping all of this out of the index, because the less irrelevant content you have in the index, the easier it is for visitors to find the good content. Google may even penalise a site that has old, poor quality, or duplicate content.

Setting it up

Setting up a robots.txt file is dead simple. It’s just a plain, unformatted text file that sits at the root of the website. Try typing /robots.txt after a few of your favourite domains, and see what comes up: (1,2,3,4,5) you can learn a lot from seeing what others do.

In most cases, however, you’ll be keeping it pretty simple. First, you’ll specify who you’re giving directions to, and most often, it’s everyone, so you’ll use a wild card asterisk character:

User-agent: *

and next, you simply list the things (individual files, or whole directories) you don’t want indexed, so your complete file is going to look something like:

User-agent: *
Disallow: /mytest/
Disallow: /template-assets/
Disallow: /old-home.html

That’s it! If you want to test your robots.txt file, there are any number of free tools on the web. This one seemed to work quite smoothly, and there’s a good one in Google’s Webmaster Tools.


One Response to “Robots, keep out!”

  1. […] you still want the content to be public, but just not indexed, use a robots.txt file to exclude search […]