: How to block baidu spiders Most of my visits are from baidu spiders. I don't think it helps search engines at all so I'm thinking of how to block them. Could this be done via iptables? I'm
Most of my visits are from baidu spiders. I don't think it helps search engines at all so I'm thinking of how to block them. Could this be done via iptables? I'm using nginx as my webserver.
More posts by @Fox8124981
7 Comments
Sorted by latest first Latest Oldest Best
Just decided to block Baidu as the amount of traffic it was giving us was too negligible for their aggressive scanning. In addition, they now run an agent that impersonates a browser and launches JavaScript code (such as Google Analytics) and messed up our statistics.
The nice version is updating your robots.txt with the following
User-agent: Baiduspider
Disallow: /
User-agent: Baiduspider-video
Disallow: /
User-agent: Baiduspider-image
Disallow: /
But considering what others have written here and them using a user-agent that hides their presence I'd block their IP addresses altogether. The following is how it's done in nginx
# Baidu crawlers
deny 123.125.71.0/24;
deny 180.76.5.0/24;
deny 180.76.15.0/24;
deny 220.181.108.0/24;
Use .htaccess with
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^.*MJ12bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Baidu [NC]
RewriteRule .* - [L,F]
The "RewriteEngine On" allows you that the following lines are parsed
correctly. The HTTP_USER_AGENT is the line where the spiders are identifying
themselves. The condition is true if the line contains "MJ12bot" or "Baidu".
NC means "not case-sensitive" and you can chain conditions with OR.
The last line must not contain "OR" or the rule does not work.
Baidu is particularly nasty because it tries to read Wordpress entries ("fckeditor","wp-content") for which is has absolutely no reason. MJ12bot is
also one of the bad critters.
The Rewrite rule means block the spider with a 403 Forbidden ([F]) to access all files (.* is a regular expression for any file) and stop further evaluation ([L]) of htaccess.
I have just successfully blocked the Chinese searchbot Baiduspider from accessing any content on my site. I made the decision to do so because of the following reasons.
Reasons for deciding to block
Approximately every 20th request to my server was from a baidu bot. This is unpolite behavior. Baidubot accounts for 5% of my sites bandwidth usage.
I make a lot of effort to keep the resources on my site small and leverage technology such as browser caching in order to make small wins in speed and bandwidth. It is logical to consider freeing up that 5% by blocking Baidubot.
The possibility of losing some Chinese traffic is an acceptable risk to the business as the site's content is geographically specific to the UK, there is no Chinese language version and the revenue is generated from Advertising targeted at the UK market.
So I hope Su' and others concerned about Xenophobia will understand this decision is a cool-headed response to a unpolite number of requests.
Method
Baiduspider accesses my server using many different IP addresses but these addresses do fall inside certain ranges. So my .htaccess file now contains the following lines:
order allow,deny
allow from all
# Block access to Baiduspider
deny from 180.76.5.0/24 180.76.6.0/24 123.125.71.0/24 220.181.108.0/24
The bottom line basically describes 4 IP ranges in which I know Baiduspider and ONLY Baiduspider accesses my server. Each of the 4 ranges is 256 consecutive addresses (total 1024). Please note, the syntax for the IP ranges on the deny from... line can be very confusing if you have not read-up on CIDR ranges. Just understand that the 0/24 means a 256 size range starting from 0 so 180.76.5.0/24 actually means every IP address between 180.76.5.0 and 180.76.5.255. Yeah, not particularly obvious! But if you want to learn why or you just enjoy feeling confused go to www.mediawiki.org/wiki/Help:Range_blocks
Summary
The internet should be free, open and fair. But that means organisations like Baidu learning to obey Robots.txt and being less greedy with the regularity of it's crawls. My solution involves tinkering with very powerful settings so before you mess around with the .htaccess file be sure to back up your original, ready to rollback if you take down your server in a blaze of glory. Proceed at your own risk.
Wordpress solution (not the best but helps)
Same problem with Biadu spider, that aggressive that my box ranked over 35 in my console using top. Obviously that even a fast computer cannot handle effectively outside requests running at 35....
I traced the number of IP's (from that University building ????) to be several hundreds, with mainly two useragents)
Direct consequence ? As I have a cloud server I had to upgrade the same to higher memory in order to allow a decend response.
Previous answer :
#Baiduspider
User-agent: Baiduspider
Disallow: /
Baidu seems totally unable to respect the robot.txt indication.
What I did:
I installed WP-Ban plugin for Wordpress (free) and banned the following :
USER AGENTS :
Baiduspider+(+http://www.baidu.com/search/spider.htm)
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
Furthermore using Wp Super Cache I re-address the relative error page to a static page, thus the whole wordpress installation does not / or at least only for the banned useragents check the Mysql datatable.
(This is standard Wordpress blablabla, so everybody being able to install a Wordpress Plugin can do it, as no coding or ftp access is required for this procedure)
I agree with everyone : Internet is free, banning whoever or whatever is absolutely the last thing anyone should do, but Baidoo today costs me USD 40 more/month , just to spider a webside written in Portughese, and I have some doubts if there are lots of Chinese people and visitors able to read and understand this language.
You can use the following directive in robots.txt to disallow the crawling of your site.
# robots.txt
User-agent: Baiduspider
Disallow: /
However, crawlers may decide to ignore the content of your robots.txt. Moreover, the file can be cached by search engines and it takes time before changes are reflected.
The most effective approach is to use your server capabilities. Add the following rule to your nginx.conf file to block Baidu at server level.
if ($http_user_agent ~* ^Baiduspider) {
return 403;
}
Remember to restart or reload Nginx in order to apply the changes.
You can block by IP address using the ngx_http_access_module of nginx. To block a single IP you can add a line to the conf file like
deny 12.34.567.1;
To block a range, use CIDR notation, like 12.34.567.1/24 for the 24-bit subnet block (of 256 IP addresses) which includes the 12.34.567.1 IP address. For more details see, for instance, this page.
In your robots.txt add
#Baiduspider
User-agent: Baiduspider
Disallow: /
#Yandex
User-agent: Yandex
Disallow: /
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2025 All Rights reserved.