Mobile app version of vmapp.org
Login or Join
Pope3001725

: If .htaccess is used to block my bot from accessing a particular directory, will I know this? I'm working on a research project and I have a question. Say I would like to crawl all pages

@Pope3001725

Posted in: #Htaccess #WebCrawlers

I'm working on a research project and I have a question.

Say I would like to crawl all pages of a given site. In the case that my bot is blocked from accessing a certain portion of the site, I would need to know for sure that it has been blocked, and that there exists at least one portion of the site that has not been crawled. Is this technically feasible under the current protocol? In other words, I don't want my bot to be blocked in a deceptive manner, which would lead me to believe that the entire site has been crawled when in fact it hasn't.

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Pope3001725

1 Comments

Sorted by latest first Latest Oldest Best

 

@Cody1181609

I would need to know for sure that it has been blocked.... I don't want my bot to be blocked in a deceptive manner....


It's not really possible to know "for sure" (ie. 100%) whether your bot has been blocked, if it has been blocked in a "deceptive manner".

The site could theoretically return a 200 OK status and what looks like a valid response body yet you have still been "blocked" from seeing the intended content. In order to detect this kind of "block" you could perhaps compare the response you get with a "known valid response" for a "non-blocked" request. But how do you get that "known valid response" and what if the expected response is dynamic in nature?

Google must do something of this nature in determining "cloaked" responses (when Googlebot is served something different to what an ordinary user sees) - but I very much doubt this is 100%.


If .htaccess is used ...


Why the mention of .htaccess? I would have thought that the exact method used to block the bot is irrelevant? But anyway, you could still block a bot "deceptively" with .htaccess alone.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme