Mobile app version of vmapp.org
Login or Join
Fox8124981

: How to block baidu spiders Most of my visits are from baidu spiders. I don't think it helps search engines at all so I'm thinking of how to block them. Could this be done via iptables? I'm

@Fox8124981

Posted in: #Baidu #Nginx #Traffic #WebCrawlers

Most of my visits are from baidu spiders. I don't think it helps search engines at all so I'm thinking of how to block them. Could this be done via iptables? I'm using nginx as my webserver.

10.07% popularity Vote Up Vote Down


Login to follow query

More posts by @Fox8124981

7 Comments

Sorted by latest first Latest Oldest Best

 

@Connie744

Just decided to block Baidu as the amount of traffic it was giving us was too negligible for their aggressive scanning. In addition, they now run an agent that impersonates a browser and launches JavaScript code (such as Google Analytics) and messed up our statistics.

The nice version is updating your robots.txt with the following

User-agent: Baiduspider
Disallow: /
User-agent: Baiduspider-video
Disallow: /
User-agent: Baiduspider-image
Disallow: /


But considering what others have written here and them using a user-agent that hides their presence I'd block their IP addresses altogether. The following is how it's done in nginx

# Baidu crawlers
deny 123.125.71.0/24;
deny 180.76.5.0/24;
deny 180.76.15.0/24;
deny 220.181.108.0/24;

10% popularity Vote Up Vote Down


 

@Miguel251

Use .htaccess with

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^.*MJ12bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Baidu [NC]
RewriteRule .* - [L,F]


The "RewriteEngine On" allows you that the following lines are parsed
correctly. The HTTP_USER_AGENT is the line where the spiders are identifying
themselves. The condition is true if the line contains "MJ12bot" or "Baidu".
NC means "not case-sensitive" and you can chain conditions with OR.
The last line must not contain "OR" or the rule does not work.

Baidu is particularly nasty because it tries to read Wordpress entries ("fckeditor","wp-content") for which is has absolutely no reason. MJ12bot is
also one of the bad critters.

The Rewrite rule means block the spider with a 403 Forbidden ([F]) to access all files (.* is a regular expression for any file) and stop further evaluation ([L]) of htaccess.

10% popularity Vote Up Vote Down


 

@Gretchen104

I have just successfully blocked the Chinese searchbot Baiduspider from accessing any content on my site. I made the decision to do so because of the following reasons.

Reasons for deciding to block


Approximately every 20th request to my server was from a baidu bot. This is unpolite behavior. Baidubot accounts for 5% of my sites bandwidth usage.
I make a lot of effort to keep the resources on my site small and leverage technology such as browser caching in order to make small wins in speed and bandwidth. It is logical to consider freeing up that 5% by blocking Baidubot.
The possibility of losing some Chinese traffic is an acceptable risk to the business as the site's content is geographically specific to the UK, there is no Chinese language version and the revenue is generated from Advertising targeted at the UK market.


So I hope Su' and others concerned about Xenophobia will understand this decision is a cool-headed response to a unpolite number of requests.

Method

Baiduspider accesses my server using many different IP addresses but these addresses do fall inside certain ranges. So my .htaccess file now contains the following lines:

order allow,deny
allow from all
# Block access to Baiduspider
deny from 180.76.5.0/24 180.76.6.0/24 123.125.71.0/24 220.181.108.0/24


The bottom line basically describes 4 IP ranges in which I know Baiduspider and ONLY Baiduspider accesses my server. Each of the 4 ranges is 256 consecutive addresses (total 1024). Please note, the syntax for the IP ranges on the deny from... line can be very confusing if you have not read-up on CIDR ranges. Just understand that the 0/24 means a 256 size range starting from 0 so 180.76.5.0/24 actually means every IP address between 180.76.5.0 and 180.76.5.255. Yeah, not particularly obvious! But if you want to learn why or you just enjoy feeling confused go to www.mediawiki.org/wiki/Help:Range_blocks

Summary

The internet should be free, open and fair. But that means organisations like Baidu learning to obey Robots.txt and being less greedy with the regularity of it's crawls. My solution involves tinkering with very powerful settings so before you mess around with the .htaccess file be sure to back up your original, ready to rollback if you take down your server in a blaze of glory. Proceed at your own risk.

10% popularity Vote Up Vote Down


 

@Jennifer507

Wordpress solution (not the best but helps)

Same problem with Biadu spider, that aggressive that my box ranked over 35 in my console using top. Obviously that even a fast computer cannot handle effectively outside requests running at 35....

I traced the number of IP's (from that University building ????) to be several hundreds, with mainly two useragents)

Direct consequence ? As I have a cloud server I had to upgrade the same to higher memory in order to allow a decend response.

Previous answer :
#Baiduspider
User-agent: Baiduspider
Disallow: /


Baidu seems totally unable to respect the robot.txt indication.

What I did:

I installed WP-Ban plugin for Wordpress (free) and banned the following :

USER AGENTS :


Baiduspider+(+http://www.baidu.com/search/spider.htm)
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)


Furthermore using Wp Super Cache I re-address the relative error page to a static page, thus the whole wordpress installation does not / or at least only for the banned useragents check the Mysql datatable.

(This is standard Wordpress blablabla, so everybody being able to install a Wordpress Plugin can do it, as no coding or ftp access is required for this procedure)

I agree with everyone : Internet is free, banning whoever or whatever is absolutely the last thing anyone should do, but Baidoo today costs me USD 40 more/month , just to spider a webside written in Portughese, and I have some doubts if there are lots of Chinese people and visitors able to read and understand this language.

10% popularity Vote Up Vote Down


 

@Miguel251

You can use the following directive in robots.txt to disallow the crawling of your site.

# robots.txt
User-agent: Baiduspider
Disallow: /


However, crawlers may decide to ignore the content of your robots.txt. Moreover, the file can be cached by search engines and it takes time before changes are reflected.

The most effective approach is to use your server capabilities. Add the following rule to your nginx.conf file to block Baidu at server level.

if ($http_user_agent ~* ^Baiduspider) {
return 403;
}


Remember to restart or reload Nginx in order to apply the changes.

10% popularity Vote Up Vote Down


 

@Rivera981

You can block by IP address using the ngx_http_access_module of nginx. To block a single IP you can add a line to the conf file like

deny 12.34.567.1;


To block a range, use CIDR notation, like 12.34.567.1/24 for the 24-bit subnet block (of 256 IP addresses) which includes the 12.34.567.1 IP address. For more details see, for instance, this page.

10% popularity Vote Up Vote Down


 

@Si4351233

In your robots.txt add
#Baiduspider
User-agent: Baiduspider
Disallow: /
#Yandex
User-agent: Yandex
Disallow: /

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme