When it comes to SEO, most people know that a website must have content, "search engine friendly" site architecture/HTML, and meta data (title tags and meta descriptions).
Another meta element, if implemented improperly, that can also trip up websites is robots.txt. I was recently reminded of this while reviewing the website of a large company that had spent extensive money on building a mobile version of their website, on a sub-directory. That’s fine, but having a disallow statement in their robots.txt file meant that the website wasn’t accessible to search engines (Disallow: /mobile/)
Let’s review how to properly implement robots.txt to avoid search ranking problems and destructive your business, as well as how to correctly disallow search engine crawling.
What is a Robots.txt File?
Simply put, if you go to domain.com/robots.txt, you should see a list of directories of the website that the site owner is asking the search engines to "skip" (or "disallow"). However, if you aren’t careful when editing a robots.txt file, you could be putting information in your robots.txt file that could really hurt your business.
There's tons of information about the robots.txt file available at the Web Robots Pages, including the proper usage of the disallow feature, and blocking "bad bots" from indexing your website.
The general rule of thumb is to make sure a robots.txt file exists at the root of your domain (e.g., domain.com/robots.txt). To exclude all robots from indexing part of your website, your robots.txt file would look something like this:
* Disallow: /cgi-bin/
The above syntax would tell all robots not to index the /cgi-bin/, the /tmp/, and the /junk/ directories on your website.
Other Real Life Examples of Robots.txt Gone Wrong
In the past, I reviewed a website that had a good amount of content and several high quality backlinks. However, the website had virtually no presence in the search engine results pages (SERPs).
What happened? Penalty? Well, no. The site's owner had included a disallow to "/". They were telling the search engine robots not to crawl any part of the website.
In another case, a SEO company edited the robots.txt file to disallow indexing of all parts of a website after the site's owner stopped paying the SEO Company.
I also remember reviewing a company's website and noticing that quite a lot of directories that were part of their former site were disallowed in their robots.txt file. The company should have set up a 301 permanent redirect to pass the value from the old web pages on the site to the new pages instead of disallowing the search engines to index any of the old legacy pages. Thus, all of the value was lost.
Robots.txt Dos and Don't
There are many good reasons to stop the search engines from indexing certain directories on a website and allowing others for SEO purposes. Let's look at some examples.
Here's what you should do with robots.txt:
Take a look at all of the directories in your website. Most likely, there are directories that you'd want to disallow the search engines from indexing, including directories like /cgi-bin/, /wp-admin/, /cart/, /scripts/, and others that might include sensitive data.
Stop the search engines from indexing certain directories of your site that might include duplicate content. For example, some websites have "print versions" of web pages and articles that allow visitors to print them easily. You should only allow the search engines to index one version of your content.
Make sure that nothing stops the search engines from indexing the main content of your website.
Look for certain files on your site that you might want to disallow the search engines from indexing, such as certain scripts, or files that might contain email addresses, phone numbers, or other sensitive data.
Here's what you should not do with robots.txt:
Don't use comments in your robots.txt file.
Don't list all your files in the robots.txt file. Listing the files allows people to find files that you don't want them to find.
There's no "/allow" command in the robots.txt file, so there's no need to add it to the robots.txt file.
By taking a good look at your website's robots.txt file and making sure that the syntax is set up correctly, you'll avoid search engine ranking problems. By disallowing the search engines to index duplicate content on your website, you can potentially overcome duplicate content issues that might hurt your search engine rankings.
One last note: if you aren't sure whether you can do this correctly, please consult with a SEO professional.