Controlling Search Engines with Robots.txt
Using a robots.txt file to control
which pages are seen by search engines
 |
Article
by
David Fitzgerald |
Introduction
As mentioned in previous articles, search engines can be a great source of
traffic for a standard business or personal website. What would
happen though, if you didn't want to appear in them?
This is the purpose of robots.txt files.
While they generally do not help you get listed, they
can help ensure that you don't get listed if you wish not to be.
What is a robot?
A robot (also shortened to just "bot", or called a
spider) is a computer that goes around collecting information from
websites.
Different bots do different things, depending on the
owners reasons for having them. In the case of search engines, the
robots' purpose is to collect information about what your site
contains, ready for it to be included in the search engine.
So where does the robots.txt file fit in?
Search engines generally like to respect the owners of
websites. Most like to provide people the option of not including some
or all of their pages on their site in the search engine. The
robots.txt file is used for telling them.
Before the bot goes around your site looking at the
various pages you have, it will take a look inside your robots.txt file
first to see if it is allowed to.
If the bot doesn't find a robots.txt file, or the file
is blank, it will normally assume you don't want any robot blocked and
that the robot is free to roam around your site.
So how do I control where it can go?
Robots.txt files can either specify individual robots
that have to be restricted, or cover them all with the one command.
Commands for robots consist of two parts:
- User-agent: used for the name of the robot to
control
- Disallow: where they are banned from accessing
In the example below, we would block robots called
googlebot from accessing greentree.html. Googlebot is the name of
Google's search engine robot, and by blocking it from this page we
would remove it from Google next time they update their results.
User-agent: googlebot
Disallow: greentree.html
While this works great for that individual page, what
if we wanted to block it from all pages? It would be highly inefficient
to list every page on your site as blocked, but we could do:
User-agent: googlebot
Disallow: greentree.html
Disallow: /frogs/
The above code would block googlebot from accessing
greentree.html and every page in the frogs directory.
Still the whole site would not be blocked, but we have
already reduced the areas that can be seen significantly. To block the
whole site we disallow the "/" directory. This "/" directory is
absolutely everything on the site.
For example:
User-agent: googlebot
Disallow: /
You now have the ability to block as many bots as you
like by naming each one individually down the file. In the case below
we have banned googlebot and slurp (the name of Yahoo's robot) from the
site.
User-agent: googlebot
Disallow: /
User-agent: slurp
Disallow: /
Finally, if the same rules apply to all bots we
can specify them with the "*" character instead.
User-agent: *
Disallow: /
Finally, it is worth mentioning that while almost every
bot likes to play nicely with the websites it visits, there are some
that do not. If you have pages that really shouldn't be seen by any
sort of robot, then perhaps you should use password protection.
See further general
website hosting FAQs or our FAQ categories
for more information. >
Please contact
us if you have a question that is not answered on our site.
Other Frequently Asked Questions Categories
|