: Is it possible to slow the Baiduspider crawl frequency? Much has been made of the Baidu spider crawl frequency. It's true: "Baiduspider crawls like crazy." I've experienced this phenomenon at

Posted in: #Googlebot #SearchEngines #Seo #WebCrawlers

Much has been made of the Baidu spider crawl frequency. It's true: "Baiduspider crawls like crazy."

I've experienced this phenomenon at sites I work with. In at least one instance, I've found that Baiduspider crawls at about the same frequency as Googlebot, despite the fact that Baidu delivers about .1% as much traffic as Google.

I'd like to keep those visits on my site, as few as they are (maybe one day they'll grow?), but I can't justify allowing such a heavy load on my server.

The accepted answer to the question linked above suggests Baidu Webmaster Tools offers the opportunity to limit crawl rate, but I'm hesitate to open up that (Chinese-only) can of worms.

Does anybody have any experience limiting Baiduspider crawl rate with BWT? Is there another way to limit this load?

10.03% popularity Vote Up Vote Down

: How do I block a user-agent from Apache How do I realize a UA string block by regular expression in the config files of my Apache webserver? For example: if I would like to block out all

@Jennifer507

Posted in: #Apache2 #ApacheLogFiles #Googlebot #WebCrawlers

1 Comments

: After replacing all tables in an old website with divs, what other steps should I take? I have designed a website a few years back, and it ranks pretty well, customer is happy, no problems

@Jennifer507

Posted in: #Ranking #Table #Tableless

1 Comments

: How to get my website approved for AdSense? I have new website and want to add AdSense. What is the approval process? What kind of programming language can be used? How much content

@Jennifer507

Posted in: #GoogleAdsense

1 Comments

: Will search engines discover that our old pages have been 301 redirected if there are no more links to them in the old site? We've moved our website to a new domain. Thousands of our pages

@Jennifer507

Posted in: #301Redirect #Google #Pagerank #SearchEngines

1 Comments

Login to post a comment!

3 Comments

Sorted by latest first Latest Oldest Best

@Miguel251

After a lot of research and experimentation with this, I finally bit the bullet and set up a Baidu Webmaster Tools account. Its quite straightforward to use when armed with Google Translate in another window. You may need to have firebug activated in order to be able to copy-and-paste Chinese text from buttons that you cannot capture from the normal browser mode.

After you have setup, you need to wait a few days for crawling data to appear and then you can customize the crawl rate. It appears in a section called "Pressure" which you should be able to get to with this URL: zhanzhang.baidu.com/pressure/adjust?site=http%3A%2F%2Fwww.yourURL.com%2F Note that you will only be able to use this URL if you have a Baidu Webmaster Tools account setup and you have associated your website URL with your account for the website in question). Here you will see a slider with your current crawl rate in the center (in my case 12676 requests per day. Slide it to the left in order to reduce the crawl rate.

I have no idea yet if it actually respects your request. It gives you a warning which says something like this. "We recommend that you use the default site Baidu crawl rate. Only if your website has problems with our crawling then use this tool to adjust it. To maintain normal crawling of your site, Baidu will take your adjustment of crawl rate into account with actual site conditions and therefore can not guarantee to adjust according to your request."

10% popularity Vote Up Vote Down

@Ravi8258870

Great question, and one many webmasters might be interested in since the Baidu spider is notoriously aggressive and can zap resources from servers...

As indicated in Baidu's Web Search news, the Baidu spider does not support the Crawl-delay notification setting, and instead requires you to register and verify your site with its Baidu Webmaster Tools platform, as stated here on its site.
This appears to be the only option to control the crawling frequency directly with Baidu.

The problem is that other spam bots use Baidu's user-agents (listed here under number 2) to spider your site, as indicated in their FAQ's here under number 4. So requesting a slower crawl rate with Baidu may not solve everything.

Therefore, if you do decide to use Baidu's Webmaster Tools, it might be wise to also compare its user-agents with IP's known to be associated with them by using a resource like the Bots vs Browsers Database, or using a reverse DNS lookup

The only other options are to either block all Baidu user-agents, and thus sacrifice potential traffic from Baidu, or attempt to limit excessive requests using something like mod_qos for Apache, which claims to manage:

The maximum number of concurrent requests to a location/resource
(URL) or virtual host.
Limitation of the bandwidth such as the maximum allowed number of requests per second to an URL or the maximum/minimum of downloaded kbytes per second.
Limits the number of request events per second (special request conditions).
It can also "detect" very important persons (VIP) which may access the web server without or with fewer restrictions.
Generic request line and header filter to deny unauthorized operations. Request body data limitation and filtering (requires mod_parp).
Limitations on the TCP connection level, e.g., the maximum number of allowed connections from a single IP source address or dynamic keep-alive control.
Prefers known IP addresses when server runs out of free TCP connections.

I haven't found reported experiences with Baidu Webmaster Tools, which is slow to load and has translation issues (no English version either). That might be helpful, but opinion-based of course.

10% popularity Vote Up Vote Down

@Megan663

Yes, you can use the Crawl-delay parameter in robots.txt to set to the number of seconds to wait between successive requests to the same server.

User-agent: Baiduspider
Crawl-delay: 100

The first line is tell only the Baidu Web crawler to honor the command. The 2nd line is the time to wait in seconds between requests to the server. You can add what ever time delay you would like for your needs.

You will need to add these commands to your exsisting robots.txt file. If you don't already have a robots.txt file, add the code above to a text file, save the file as robots.txt and upload it the root folder of your website, so it appears at the address below:
examplesite.com/robots.txt

10% popularity Vote Up Vote Down

Feed

: Is it possible to slow the Baiduspider crawl frequency? Much has been made of the Baidu spider crawl frequency. It's true: "Baiduspider crawls like crazy." I've experienced this phenomenon at

More posts by @Jennifer507

: How do I block a user-agent from Apache How do I realize a UA string block by regular expression in the config files of my Apache webserver? For example: if I would like to block out all

: After replacing all tables in an old website with divs, what other steps should I take? I have designed a website a few years back, and it ranks pretty well, customer is happy, no problems

: How to get my website approved for AdSense? I have new website and want to add AdSense. What is the approval process? What kind of programming language can be used? How much content

: Will search engines discover that our old pages have been 301 redirected if there are no more links to them in the old site? We've moved our website to a new domain. Thousands of our pages

Login to post a comment!

3 Comments

Back to top | Use Dark Theme