Mobile app version of vmapp.org
Login or Join
XinRu657

: Baiduspider is crawling my site even when forbidden by robots.txt, how do I prevent it? My site has heavy traffic because some bot. I checked access_log, some bot Baiduspider go to my site

@XinRu657

Posted in: #Apache #Htaccess #RobotsTxt

My site has heavy traffic because some bot. I checked access_log, some bot Baiduspider go to my site 10-20 times per minute. I do not need Chinese traffic. I have searched and read www.baidu.com/search/robots_english.html

I added rule into the robots.txt then restarted Aache, but it doesn't work. Baiduspider still crawls my site.

User-agent: Baiduspider
Disallow: /

User-agent: *
Disallow: /feed/
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /xmlrpc.php


I found their feedback page zhanzhang.baidu.com/feedback/index I can translate the page to my language, but I cannot translate and insert captcha.

Then I have searched and find some article: www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html But when I add it into .htaccess, I cannot access my site,(you do not have permission to access this site) Am i inserted in a wrong position? need a help.

# BEGIN WordPress
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /

RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC,OR]
RewriteRule ^.* - [F,L]
#some custom rewrite rule
RewriteRule ^article/([^/.]+)/?$ /article/.php [L,QSA]

RewriteRule ^(.*)$ www.example.com/ [L,R=301]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>


BTW, my server is CentOS7 apache 2.4.6. I also tried "httpd.conf", but I never find any article about apache 2.4.6 <IfModule setenvif_module>, all the articles are <IfModule mod_setenvif_c>... apache 2.4.6 do cancel order allow,deny rule, I have no idea how to modify and add into my httpd.conf.

Anyway, I just want to refuse Baiduspider Thanks.

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @XinRu657

2 Comments

Sorted by latest first Latest Oldest Best

 

@Heady270

I think the problem with your rewrite rule is the OR flag. That flag usually means that there is a second rewrite condition coming. You only have one condition.

Here is a site that provides a similar rule for blocking BaiduSpider with slightly different syntax:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC]
RewriteRule .* - [F]

10% popularity Vote Up Vote Down


 

@Shelley277

You can try blocking specific IP addresses in your .htaccess file. You can find the ranges here.

In robots.txt you can also add the following

User-agent: Baiduspider
User-agent: baiduspider
User-agent: Baiduspider+
User-agent: Baiduspider-video
User-agent: Baiduspider-image
Disallow: /


Also, if you use caching plugins or CDN, make sure to clear all your cache.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme