Information about robots.txt file...

robot.txt file is your invitation to the search engines when they send a spider to crawl through the pages of your web-site. These spiders are out and about on the web gathering information that contributes to your search engine ranking.

There are different kinds of robots/spiders/crawlers, some are just taking inventory, some are looking for what's new, but whatever kind of robot they are, when they hit your site, you have some control over what they do, and you need to give them some instructions.

You can block bots, you can tell them which pages they can and can't look at and you can tell them when you'd like them to return.

There are probably a thousand free robot.txt generators out there and here is one you can use: 1-hit.com

If you want to generate your own file read on...

The syntax is very limited and easy to understand. The first part specifies the robot we are referring to.

User-agent: BotName

Replace BotName with the robot name in question. To address all of them, simply use an asterisk.

User-agent: *

The second part tells the robot in question not to enter certain parts of your web site.

Disallow: /cgi-bin/

In this example, any path on our site starting with the string /cgi-bin/ is declared off limits. Multiple paths can be excluded per robot by using several Disallow lines.

User-agent: *
Disallow: /cgi-bin/
Disallow: /temp/
Disallow: /private

This robots.txt file would apply to all bots and instruct them to stay out of directories /cgi-bin/ and /temp/.

It also tells them any path/URL on your site starting with /private (files and directories) is off limits.

To declare your entire website off limits to BotName, use the example shown below.

User-agent: BotName
Disallow: /

To have a generic robots.txt file which welcomes every robot and does not restrict them, use this sample.

User-agent: *
Disallow:

Here's a list of common robot names. There are many others that exists...

Several different Bot types are always exploring the world wide web. The best known ones are classified here.

1. Search Engine Robots

Below is a list of the the most popular and active search engine bots (also called spiders).

Google: "Googlebot"
Google Images: "Googlebot-Image"
Inktomi: "Slurp"
WiseNut/LookSmart: "ZyBorg"
Fast/AllTheWeb: "fast"
OpenFind: "Openbot"
Alta Vista: "Scooter"

By excluding any part of your site from these, you will also exclude that part to show up in search results.

2. Bots Used By Spammers

Unless you enjoy receiving lots of SPAM, you don't want these on your web site. They look for email addresses on web pages to send their junk email to.

EmailSiphon
EmailWolf
ExtractorPro
CherryPicker
NICErsPRO
Teleport
EmailCollector

These will ignore the robots.txt file as they want to find new email addresses by any means possible. There is a way to refuse them access to your site via .htaccess file.

3. Others

These claim to respect the robots.txt file and you can block them (if you wish) by robot name, as usual.

"TurnitinBot" detects Plagiarism
"NPBot" for Intellectual Property

This list has to be banned from your web site hosting account via .htaccess, if you wish to do so.

"LinkWalker" is a Link Directory Builder
"Zeus Link" is also a Directory Builder

These robots look for reciprocal link partners. If you're interested in that type of venture, do not block them.

Comment