: Is it possible to slow the Baiduspider crawl frequency? Much has been made of the Baidu spider crawl frequency. It's true: "Baiduspider crawls like crazy." I've experienced this phenomenon at
Much has been made of the Baidu spider crawl frequency. It's true: "Baiduspider crawls like crazy."
I've experienced this phenomenon at sites I work with. In at least one instance, I've found that Baiduspider crawls at about the same frequency as Googlebot, despite the fact that Baidu delivers about .1% as much traffic as Google.
I'd like to keep those visits on my site, as few as they are (maybe one day they'll grow?), but I can't justify allowing such a heavy load on my server.
The accepted answer to the question linked above suggests Baidu Webmaster Tools offers the opportunity to limit crawl rate, but I'm hesitate to open up that (Chinese-only) can of worms.
Does anybody have any experience limiting Baiduspider crawl rate with BWT? Is there another way to limit this load?
More posts by @Jennifer507
3 Comments
Sorted by latest first Latest Oldest Best
After a lot of research and experimentation with this, I finally bit the bullet and set up a Baidu Webmaster Tools account. Its quite straightforward to use when armed with Google Translate in another window. You may need to have firebug activated in order to be able to copy-and-paste Chinese text from buttons that you cannot capture from the normal browser mode.
After you have setup, you need to wait a few days for crawling data to appear and then you can customize the crawl rate. It appears in a section called "Pressure" which you should be able to get to with this URL: zhanzhang.baidu.com/pressure/adjust?site=http%3A%2F%2Fwww.yourURL.com%2F Note that you will only be able to use this URL if you have a Baidu Webmaster Tools account setup and you have associated your website URL with your account for the website in question). Here you will see a slider with your current crawl rate in the center (in my case 12676 requests per day. Slide it to the left in order to reduce the crawl rate.
I have no idea yet if it actually respects your request. It gives you a warning which says something like this. "We recommend that you use the default site Baidu crawl rate. Only if your website has problems with our crawling then use this tool to adjust it. To maintain normal crawling of your site, Baidu will take your adjustment of crawl rate into account with actual site conditions and therefore can not guarantee to adjust according to your request."
Great question, and one many webmasters might be interested in since the Baidu spider is notoriously aggressive and can zap resources from servers...
As indicated in Baidu's Web Search news, the Baidu spider does not support the Crawl-delay notification setting, and instead requires you to register and verify your site with its Baidu Webmaster Tools platform, as stated here on its site.
This appears to be the only option to control the crawling frequency directly with Baidu.
The problem is that other spam bots use Baidu's user-agents (listed here under number 2) to spider your site, as indicated in their FAQ's here under number 4. So requesting a slower crawl rate with Baidu may not solve everything.
Therefore, if you do decide to use Baidu's Webmaster Tools, it might be wise to also compare its user-agents with IP's known to be associated with them by using a resource like the Bots vs Browsers Database, or using a reverse DNS lookup
The only other options are to either block all Baidu user-agents, and thus sacrifice potential traffic from Baidu, or attempt to limit excessive requests using something like mod_qos for Apache, which claims to manage:
The maximum number of concurrent requests to a location/resource
(URL) or virtual host.
Limitation of the bandwidth such as the maximum allowed number of requests per second to an URL or the maximum/minimum of downloaded kbytes per second.
Limits the number of request events per second (special request conditions).
It can also "detect" very important persons (VIP) which may access the web server without or with fewer restrictions.
Generic request line and header filter to deny unauthorized operations. Request body data limitation and filtering (requires mod_parp).
Limitations on the TCP connection level, e.g., the maximum number of allowed connections from a single IP source address or dynamic keep-alive control.
Prefers known IP addresses when server runs out of free TCP connections.
I haven't found reported experiences with Baidu Webmaster Tools, which is slow to load and has translation issues (no English version either). That might be helpful, but opinion-based of course.
Yes, you can use the Crawl-delay parameter in robots.txt to set to the number of seconds to wait between successive requests to the same server.
User-agent: Baiduspider
Crawl-delay: 100
The first line is tell only the Baidu Web crawler to honor the command. The 2nd line is the time to wait in seconds between requests to the server. You can add what ever time delay you would like for your needs.
You will need to add these commands to your exsisting robots.txt file. If you don't already have a robots.txt file, add the code above to a text file, save the file as robots.txt and upload it the root folder of your website, so it appears at the address below:
examplesite.com/robots.txt
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.