Mobile app version of vmapp.org
Login or Join
Heady270

: Understanding the maximum hit-rate supported by a web-server I would like to crawl a publicly available site (and one that's legal to crawl) for a personal project. From a brief trial of the

@Heady270

Posted in: #CrawlRate #WebCrawlers #Webserver

I would like to crawl a publicly available site (and one that's legal to crawl) for a personal project. From a brief trial of the crawler, I gathered that my program hits the server with a new HTTPRequest 8 times in a second. At this rate, as per my estimate, to obtain the full set of data I need about 60 full days of crawling.

While the site is legal to crawl, I understand it can still be unethical to crawl at a rate that causes inconvenience to the regular traffic on the site. What I'd like to understand here is -- how high is 8 hits per second to the server I'm crawling? Could I possibly do 4 times that (by running 4 instances of my crawler in parallel) to bring the total effort down to just 15 days instead of 60?

How do you find the maximum hit-rate a web-server supports? What would be the theoretical (and ethical) upper-limit for the crawl-rate so as to not adversely affect the server's routine traffic?

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Heady270

1 Comments

Sorted by latest first Latest Oldest Best

 

@Nimeshi995

There is no standard. You will have to consider how robust the site is. Multiple servers and lots of bandwidth would be less taxed than a single server with less bandwidth. Also consider that some are paying for the bandwidth they use rather than having a set amount rated on a per second basis (like we are used to with DSL and cable). This would mean that your activities can really add a significant cost for the site.

Now for the bad news. The average hit rate of a human is .8 seconds per request. Google and Bing state that they try not to make more than one request every 2 seconds. As well, some systems block users with high access rates. If your spider efforts are spotted and the webmaster feels that you are inappropriate, you can be blocked extremely quickly and effectively.

The best policy is to check the robots.txt file to see if there is a crawl rate and respect the wishes of the site owner. As well, it is often best to contact the site owner and ask for permission. Just because the site is publicly available does not mean you can abuse it. There is no assumption under the law (U.S.) that gives you automatic rights to a network or the resources offered. This means that a civil case can be filed for abuses that can be clearly documented.

I can tell you this. Most sophisticated spider blocks will not even tolerate your 4 access per second rate. You would be blocked and a manual process would have to occur to remove the block.

I will not even mention anything about copyright infringements and your lack of rights to original content. Get permission. You may get what you want.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme