The Oft-Overlooked Robots.txt File
by Aaron Turpen of Aaronz WebWorkz
When a search engine spider accesses your website, it will usually look first for a file in the root directory of your site (where your website begins) called “robots.txt.” The robots.txt file tells the spider what it may spider (index/parse). The standard for all of this is called “The Robots Exclusion Standard.”
The format for this standard is very simple. It consists of records in a text file, each record consisting of two fields: a user-agent line and one or more disallow lines. These fields are formatted in a specific way so that the spider program can read them. You’ll see examples of this formatting later in this article.
The first field is the “User-agent” field, which his used to specify which robot the “Disallow” lines in the next field apply to. Usually, this contains the wildcard character “*” to specify all robots. In some cases, however, you may wish to only exclude specific robots, such as the googlebot.
The second field is the “Disallow” field, which can actually contain several records. You can specify that robots are to ignore specific files, whole directories, or combinations of these. Password protected directories (such as those on a Unix system using .htaccess files) are usually excluded by robots, but it’s a good idea to include them in the “disallow” anyway.
To create or edit your robots.txt file, you’ll need a text editor such as Notepad. Whatever you use, just make sure it saves in pure text and in no other format. Your HTML editor usually has this function.
Comments can be done using the “#” character to specify that a comment follows. Since the file’s contents are pretty self-explanatory, comments are rarely used. The first line of your robots.txt file is the User-agent line, so the first line will probably look like this:
You can replace the “*” with any robot’s name, if you wish. For a complete and up-to-date list of spider names, visit http://www.searchenginedictionary.com/spider-names.shtml.
The next line or lines will consist of those directories you wish to disallow access to the spider or spiders you’ve specified in the User-agent line:
This would block spiders from indexing the file “dontindexthis.html” in your root directory. To disallow a whole directory, just use the same format:
To disallow specific files in sub-directories, you would use a combination of these:
Wildcards can be used in several ways. You can specify a file AND directory of the same name in the same line like this:
This blocks both the directory /notthisone/ and any files named “notthisone.” (such as “notthisone.html” or “notthisone.cgi”). You can also include all files on the site by just putting a “/” in the Disallow line:
A completed robots.txt file will look something like this:
If you want to get really complicated with your robots.txt file, I’d suggest you look at some of the robots.txt files of the big boys of the Internet like Amazon.com or eBay. You can find these by simply typing in the URL followed by “/robots.txt” (as in: http://www.amazon.com/robots.txt). These files are universally accessible via the Web as a rule.
The absence of a robots.txt file or a blank robots.txt file are the same and result in the spider indexing everything on your site, whether you want it to or not. So implementing a robots.txt file is important to your site’s success.
Aaron Turpen is the proprietor of Aaronz WebWorkz and the author of several informative e-books, including “The Layman’s Handbook To Doing Business Online,” in which this article appears. His books are available from his website: http://www.AaronzWebWorkz.com
The Oft-Overlooked Robots.txt File
Was told by a driver that Halliburton was facing out all contractors company only employees and equipment here Odessa Tx
o Publisher name: mark wells
o Physical address: 117 Highway 332 W, Ste J, Box# 104 Lake Jackson, TX 77566
o Email address: firstname.lastname@example.org
o Phone number: 9792153132