Mobile app version of vmapp.org
Login or Join
Sims2060225

: Controlling robot crawling concurrency using robots.txt and the undocumented setting "host-load" I'm not much of a network guy, and I am troubleshooting an issue where one of our sites intermittently

@Sims2060225

Posted in: #ConcurrentUsers #CrawlRate #Googlebot #RobotsTxt #WebCrawlers

I'm not much of a network guy, and I am troubleshooting an issue where one of our sites intermittently becomes unresponsive to a reverse proxy: it simply starts turning down connections, and then after 30 minutes or so, everything is fine again.

The server doesn't appear overloaded, responds just fine to loopback traffic, but is just apparently turning down connections from the proxy...

This will likely be a multifaceted problem. Looking at the server as a whole, I noted we do get a lot of traffic from external web crawlers, and even more so from an internal Google Search Appliance (GSA).

I wanted to explore whether the degree of parallelism in the crawling is the problem. I am aware of the crawl delay setting which would help reduce overall traffic, but would also affect the frequency of crawling and add delay to indexing. It doesn't seem like the most optimal way to control load from crawlers anyway.

At any rate the frequency of requests should not be a problem... If Google wants to crawl constantly all day long every second, it's fine. As long as they don't hold too many connections open while this is done, it shouldn't really impact our ability to service other connections.

It's hard to tell with netstat how many concurrent Googlebots are originating connections, as the only established connections I will see at a given time are the ones that take a little longer to finish. At any rate, I'm not seeing more than two active connections at a time, and many more TIME_WAIT connections (i.e., Googlebot finished, closed its connection, and we are waiting to close on our side for a few minutes...default TCP stuff).

Then there's this: Robots.txt Q&A with Matt Cutts

This page talks about an apparently undocumented (not a part of Robots Exclusion Protocol) setting called host-load. This setting should in theory allow me to specify how many Googlebots will connect concurrently, and it's perfect because I can use one thing to tell our GSA and any other Googlebot hosts looking at us the degree of parallelism we can handle...

So that seems cool, however since no one but Matt Cutts (being a Google engineer heading up their webspam team) apparently has even mentioned this setting, I was curious if anyone else is using it?

If I knew for instance that the default host-load for Googlebot is 2, I can completely rule out parallel web crawlers as part of the problem. So that's one question. The fact that he used 2 for example in the Q&A seems to imply that 2 is the default host-load, and it correlates with what I've seen from netstat.

The bigger question is:

Is anyone aware of a list or reference of extended robots.txt properties that aren't formalized in the Robots Exclusion Protocol? I would assume with so many crawlers out there, there would be all kinds of proprietary settings.

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Sims2060225

2 Comments

Sorted by latest first Latest Oldest Best

 

@Welton855

You can use the "report a problem with Googlebot" link in Webmaster Tools to let the Googlebot team know about your crawling preferences. You can find it in the site's dashboard, in the gear-icon (top right), under "Site settings", "Crawl rate", "Learn more." They may sometimes be able to tweak things, or it might make sense to just keep it on automatic.

10% popularity Vote Up Vote Down


 

@BetL925

ANY crawler can identify itself as "Googlebot" or other search engine bot via its user agent. So the reason could be that some crawlers are "spamming" your site. I'd still advise that you to set a crawl delay in your robots.txt for all bots.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme