Robot Exclusion Standard or Robot Exclusion Protocol provides information to search engine spiders on the directories that have to be skipped or disallowed in your website. Robots.txt protocol is as important as site structure, site content, search engine friendliness and Meta descriptions. If Robots.txt is implemented incorrectly, it can easily trip up websites. Small errors in the Robots.txt file can prevent your website from being looked up by search engines. It can also change the way search engines index your site and this can have adverse effects on your SEO strategy. If you are interested in knowing more about Robot Exclusion Protocol, click here http://en.wikipedia.org/wiki/Robots_exclusion_standard.
Robots.txt file can be found in the root of the domain. If you open the file in a text editor, you will find a list of directories that the site webmaster asks the search engines to skip. It is therefore, important to ensure that the file does not ask search engines to skip important directories in your website. You can also prevent ‘bat bots’ from indexing you site using robots.txt file.
General Robots.txt format
The Robots.txt file has to be placed in the root of your domain (For example, domain.com/robots.txt). The general format used to exclude all robots from indexing certain parts of a website is given below.
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
When the above syntax is used, information will be given to the search engine robots to avoid indexing the /cgi-bin, /temp and /junk directories in the website.
Some examples of Robot.txt
Example #1: Allow indexing of everything
User-agent: *
Disallow:
Example #2: Disallow indexing of everything
User-agent: *
Disallow: /
Example #3: Disallow indexing of a specific folder
User-agent: *
Disallow: /folder/
Example #4: Disallow Googlebot from indexing of a folder, except for allowing the indexing of one file in that folder
User-agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html
Example #5: Allow Only One Specific Robot Access
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
Example #6: To exclude a single robot
User-agent: BadBot
Disallow: /
Why it is beneficial to use Robots.txt
- Using Robots.txt you will be able to disallow directories that you would not want the search engine robots to index. For example directories such as /cgi-bin/, /scripts/, /cart/, /wp-admin/ and other directories that may contain sensitive data.
- Certain directories in your website may contain duplicate content, such as print versions of articles or web pages. You can use ‘Robots.txt’ to allow search engine robots to index only one version of the duplicate content.
- You can ensure that the search engine bots index the main content in your website.
- You can avoid search engines from indexing certain files in a directory that may contain scripts, personal data or other kinds of sensitive data.
What to avoid in Robots.txt
- Avoid the use of comments in the ‘robots.txt’ file
- Robots.txt file does not have a “/allow” command. Therefore, avoid using such commands in the file.
- Do not list all files as it will give others information regarding the files you want to hide. Try to put all files in a directory and disallow that directory.