How To Increase Your Site Speed By Blocking or Controlling Search Bots

In case you hadn’t heard, Google most definitely considers how fast your site runs when it decides how well to rank you in the search results. But, ironically enough, it is often the search engines themselves which slow our sites to a crawl.

Well, you can put some controls in on this. It is a bit on the geeky side, but this is something you should pay attention to if you want your blog to be humming along nicely.

What Are We Talking About Here?

Search bots are little software programs despatched by the search engines in order to scan your site and index them in their directories. We, as normal internet users, will typically think of the big, obvious players like Google, Bing, Yahoo and perhaps some regional search engines. However, there are TONS of different bots out there.

Thing is, these bots were programmed by people. And, some are better programmed than others. The result of this is that some bots are respectful of your server while others will just hammer the hell out of your server so much that it will slow the whole server down.

What can happen is that your site slows for regular readers. Or your web host could end up sending you a nastygram telling you you’re using up too many resources.

When I recently switched my sites to another host, the move triggered all the bots out there to have a damn field day and re-index everything. The result was that my server was being SLAMMED and my sites were very slow for end users. Needless to say, this was concerning to me.

How To Deal With This

The solution is to block a bunch of bad bots and place some controls over others. There are two ways to do this:

Request that they live by some rules. This is done using a robots.txt file.
Outright block them. This is best done using the .htaccess file (assuming you’re on a server which uses Apache, which most do).

The robots.txt file is just that – a text file. And it sits right in the root folder of your website. Using this file, you can tell bots to ignore certain folders, block them altogether, or even control how often they come back.

Thing is, the robots.txt file is essentially a request. It is you saying, “Hey, Mr. Bot, if you would be so kind, can you play by these rules?” Most of the legit bots will obey a robots.txt. But, some simply ignore it.

For those shadier bots, you have to basically block them. Then they have no choice in the matter. They’re just screwed. 🙂

Working With Robots.txt

So, its easy. Create a file in any text editor called robots.txt. Then, put your directives into it and upload it to the root folder of your blog.

The way you specify a bot is by the “user agent”. This is a little string of text which the bot uses to identify itself. It will appear in all your server log files, and you can use it to identify the bot and give it directions. You can view a big honker list of bots and their user-agent’s at User-Agents.org, or a little less ugly version of the Robots Database ar Robotstxt.org.

There are a few things you can do with a robots.txt file:

User-agent: msnbot
User-agent: bingbot
Crawl-delay: 3

What this does is identifies two particular bots and throttles them. In this case, MSNBot and the BingBot… 2 major bots deployed by Microsoft for the MSN and Bing search engine. It just so happens that these bots are fairly well known for being very aggressive and pounding on servers pretty hard. Some webmasters have even gotten so pissed off that they’ve outright blocked Bing. But, I don’t recommend you do that. Instead, give it a “crawl delay”. This example tells Bing that it can only ping your server one time every 3 seconds.

Now, you can also block search bots from indexing certain folders of your site. For example:

User-agent: *
Disallow: /members/

This will tell all bots (as signified by the wildcard “*” character) that they are not allowed to index anything in the /members/ directory of your website.

You can also block certain bots of user agents, like so:

User-agent: Zeus Link Scout
Disallow: /

Now, there happens to be a huge bunch of programs out there that people use to download entire websites and do other various nefarious things. You can block them via robots.txt. Click here to download a sample ROBOTS.TXT file which contains a ton of bots you can block.

Blocking Bots Via .HTACCESS

Now, as I said above, a robots.txt file is essentially a request. While most bots will obey it, there are some who ignore it and simply won’t do what you tell them to do. For these, you can consider blocking them via an .HTACCESS file.

The .HTACCESS file is a configuration file for the Apache web server, which is the server software that a TON of web hosts use – chances are, your’s, too. So, when you put an .HTACCESS file in the root folder of your blog directory, it will tell Apache to do certain things. Now, a lot of different things go into this file. For example, your Wordpress permalink settings are controlled here. If you are using a cache plug-in, it probably has a bunch of settings in this .HTACCESS file. But, you can also put directives for search bots in there.

##begin code 
##start blocking potentially unwanted bots. 
RewriteEngine On 
RewriteCond %{HTTP_USER_AGENT} ^AhrefsBot [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Baidu [OR] 
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR] 
RewriteRule ^.* - [F,L] 
##end code. bai bots.

This code is an example. It uses URL re-writing in Apache to, essentially, spit out a big fat error if the user agent matches certain things.

Here is a TEXT file of some of the bots being blocked in my own .HTACCESS file.

You can insert this stuff into an existing .HTACCESS file you may have in your Wordpress directory. Just ensure that anything else that is already there is left alone.

How Do You Know Which Bots Are Hammering Your Server?

Well, first of all, don’t freak out. I don’t want anybody to read this post and then get all paranoid on me. 🙂 If your site is loading up nice and quick, then there is no need to worry about any of this. Let the bots be bots, so to speak. 🙂 You really only need to look into this if your site is loading up slowly.

Additionally, just because your site is loading slowly doesn’t necessarily mean that bots are the cause of it. It could be that you’re just loading up way too many plug-ins. It could be that the server your site is on is just low-powered. You could potentially speed things up by installing a cache plug-in like W3 Total Cache.

But, bots can be an issue. Because they cause a lot of resource usage and it can reduce the user experience for actual human beings on your site. And, being that some of these bots are just useless and will never result in any traffic to your blog, that’s just a waste. So, how do you tell what they are?

Well, you have to look at your log files. I mean, your RAW log files. We’re not talking about Google Analytics here. If you’re on a cPanel server, it is actually fairly straight-forward:

Log into cPanel.
In the “Logs” section, click on “Raw Access Logs”.
Click on the domain of your blog and you will be prompted to download the log file to your computer.
The log file will be GZipped, with a .GZ extension. So, you will need to uncompress it. If you have a GZIP utility installed, you may be able to simply double-click the file and it will uncompress.
The resulting file can then be opened in a text editor. The file may not have a .txt extension, so won’t be associated with a text editor on your computer. But, you can still manually open it in a text editor.

Now, what you’re going to see is a bunch of log entries. Here’s an example from my own PCMech.com log file:
66.249.**.*** - - [18/Mar/2013:08:01:58 -0400] "GET /forum/build-your-own-pc/137750-please-check-compatibility-issues-recommendations.html HTTP/1.1" 200 50341 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
So, first you’ll see the IP address (I blocked out a few numbers for privacy reasons). Then, the date/time of the hit on your server. Then, the request itself. In this instance, it was a request on a page of my forum over on PCMech.com. Then, you get the server response… in this case a 200 response, which indicates a successful request. Then, the user agent, which is “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”. That’s the part you want to pay attention to. In this case, this hit was from the Google bot.

So, you can scroll through your log file and see what’s hitting your server, and what they’re indexing.

OK, well there you have it. Hopefully this will give you a headstart on controlling bots on your server and keeping them from adversely affecting your site speed.

Got A Question? Need Some Assistance?

Have a question about this article? Need some help with this topic (or anything else)? Send it in and I’ll get back to you personally. If you’re OK with it, I might even use it as the basis of future content so I can make this site most useful.

How To Increase Your Site Speed By Blocking or Controlling Search Bots

What Are We Talking About Here?

How To Deal With This

Working With Robots.txt

Blocking Bots Via .HTACCESS

How Do You Know Which Bots Are Hammering Your Server?

Got A Question? Need Some Assistance?

The Academy

Categories

Work With Me

Resources

What Are We Talking About Here?

How To Deal With This

Working With Robots.txt

Blocking Bots Via .HTACCESS

How Do You Know Which Bots Are Hammering Your Server?

Got A Question? Need Some Assistance?

Don’t Miss The Next Issue

The Academy

Categories

Work With Me

Resources

If You’re Not Subscribed, You’re Missing A Lot