In terms of search engines, robots are programs that automatically travel across the World Wide Web to index a website’s content. Robots or web robots can also be used by non-search engines for both good intentions and bad intentions, such as scanning a website for e-mail addresses to spam.

What is robots.txt

A robots.txt file is found in the primary directory of a website and is used to give instructions to web robots, as to what they can and can not do and is called the Robots Exclusion Protocol.

The robots.txt file will be the first file the web robot will look for before doing anything else and it should also be noted that not all web robots will obey the robots.txt file but the important web robots, like Google, will.

Creating the File

You should create a text file named robots, which should look like robots.txt in a file manager. The file name and extension are case sensitive; they should be lower cased. Once you are done with the robots.txt file, you should upload it to the primary directory of your website, so if you visit the robots.txt in a web browser, the URL would be:

http://yourdomain.com/robots.txt

Replace yourdomain.com with your domain name.

The Basics

The robots.txt can have one or many defined rules. Each rule should consist of a User-agent, which is a web robot, with instructions underneath it. Each instruction should be on it’s own line. Example with just one instruction:

User-agent: * Disallow:

An example with more than one instruction:

User-agent: * Disallow: /cgi-bin/ Disallow: /images/ Disallow: /stuff/

The asterisk, *, is a wildcard and means all. So, in the above examples, the asterisk is beside the User-agent, so all web robots will be targeted. You can target specific web robots by replacing the asterisk with the name of the web robot, like so:

User-agent: BadBot Disallow: /cgi-bin/ Disallow: /images/ Disallow: /stuff/

In the above example, the rule targets the web robot named BadBot.

While in the above examples I’ve used the Disallow instruction, you could use the Allow instruction and have a combination of both, like so:

User-agent: * Disallow: /cgi-bin/ Allow: /images/

In the above example all web robots are disallowed to index the cgi-bin directory, but are allowed to index the images directory.

You can target specific files such as file.html by including it at the end of the file path. So, if you want to Disallow indexing of the gallery.html file in the images directory, the rule would look like this:

User-agent: * Disallow: /images/gallery.html

It’s also important to note that all other files or directories within the images directory will still be able to index, just the gallery.html file will be disallowed.

While stating that, if you provide a directory, all directories and files within that directory will be disallowed or allowed. Example of this is let’s say you have the following file structure:

 /images/ |--- /cats/ |--- /dogs/ |--- /birds/

Cats, dogs and birds are sub-directories within the images directory and you have a rule in the robots.txt file that states the following:

User-agent: * Disallow: /images/

All web robots will be disallowed from accessing the cats, dogs and birds sub-directories, as well as the images directory.

Sitemaps

Major search engines like Google, Bing and Yahoo support sitemaps, which allow you to get your web content into search engines faster. They are easy to add and you can list multiple sitemaps – one per line:

 Sitemap: http://yourdomain.com/sitemap.xml Sitemap: http://yourdomain.com/sitemap2.xml

WordPress plugins like Yoast’s WordPress SEO, Google XML Sitemaps and Google Sitemap are good plugins you can use to automatically generate sitemaps files.

Robots.txt File for WordPress

The following is an “base” example you can use for your WordPress installation:

 User-Agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /wp-content/ Disallow: /wp-includes/ Disallow: /wp-login.php # Sitemaps Sitemap: http://yourdomain.com/sitemap.xml

I’ve included the wp-content folder and everything in it, because I don’t host the uploads on the website’s account but via a CDN and search engines don’t need access to my themes and plugins. You can remove this instruction or change it to allow the uploads folder but disallow the themes folder like so:

User-Agent: * Disallow: /wp-content/plugins/ Disallow: /wp-content/themes/ Allow: /wp-content/uploads/

If the path to your uploads folder is different, please make sure you adjust the above example.

For Yoast’s WordPress SEO plugin, I use the following for sitemaps:

Sitemap: http://yourdomain.com/sitemap_index.xml 
Sitemap: http://yourdomain.com/post-sitemap.xml 
Sitemap: http://yourdomain.com/page-sitemap.xml 
Sitemap: http://yourdomain.com/category-sitemap.xml