Mobile app version of vmapp.org
Login or Join
Shanna517

: What do I need to be aware of when running a web crawler? I am getting into more holistic data analysis and would like to use a web crawler to extract data from websites for use in my

@Shanna517

Posted in: #CrawlRate #WebCrawlers

I am getting into more holistic data analysis and would like to use a web crawler to extract data from websites for use in my analytics. To be clear, I don't want to mirror data and republish, at most I'd be aggregating for proprietary use.

I imagine web crawling effects traffic differently than normal users:


Is it a significant strain on the host?
Do hosts notice web crawlers accessing their pages, and do they cause problems?
What is the industry perception of web crawlers, are they malicious, annoyances, or reasonable utilities?
Are there any rules governing their use, or any industry faux pas to avoid?

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Shanna517

1 Comments

Sorted by latest first Latest Oldest Best

 

@Berumen354

Is it a significant strain on the host?
It depends on what you would consider a significant drain on the host and the extent to which you are crawling data and the frequency with which you do it. Search engine spiders crawl site content very frequently and as long as they are doing it in a safe manner and following industry best practices on limiting the number of simultaneous bots running on a single site then while they are noticed in logs the server is unlikely to suffer from the crawling.

Do hosts notice web crawlers accessing their pages, and do they cause problems?
Hosts are able to see every single connection made to their site through their web server logs, in addition many sites use an analytics product such as Google Analytics to monitor traffic and these services often identify if there is unusual traffic such as a search engine spider or web scraper which has swept the site. In certain instances there is no issues and webmasters don't really bat an eyelid but this is generally where the crawling in question is by a legitimate search engine spider crawling for the purpose of updating the search engine index, other crawlers are generally disliked as unnecessary.

What is the industry perception of web crawlers, are they malicious, annoyances, or reasonable utilities?
This depends on the nature of the crawler. Search engine spiders are accepted as a necessary evil and a reasonable utility to websites however private crawlers not affiliated with recognized search engines often raise eyebrows due to the potential for a malicious user to be using a crawler to identify any vulnerabilities to exploit.

The basics of what I am saying from this is to be careful about your indexing practices, make sure that the sites you index allow indexing for your use case, and ensure that you don't crawl a site too frequently.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme