Mobile app version of vmapp.org
Login or Join
Pope3001725

: Why do many websites block requests from common HTTP libraries by user-agent? Writing spiders, I have noticed that many sites will return a 403 error if I hit them from popular HTTP software

@Pope3001725

Posted in: #Block #Filtering #Security #UserAgent #WebCrawlers

Writing spiders, I have noticed that many sites will return a 403 error if I hit them from popular HTTP software libraries, unless I manually override the default User-Agent header used by the library.

For example, The Economist magazine blocks my requests if I use the default user agent headers of any Python HTTP library:

$ curl www.economist.com/ -A python-requests/2.9.1 --write-out "%{http_code}n" --silent --output /dev/null
403
$ curl www.economist.com/ -A python-Urllib/2.7 --write-out "%{http_code}n" --silent --output /dev/null
403


But if I fake a browser user agent, put in a nonsense user agent, or provide an empty user agent, they're happy to accept my request:

$ curl www.economist.com/ -A "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/53.0.2785.143 Chrome/53.0.2785.143 Safari/537.36" --write-out "%{http_code}n" --silent --output /dev/null
200
$ curl www.economist.com/ -A '' --write-out "%{http_code}n" --silent --output /dev/null
200
$ curl www.economist.com/ -A banana --write-out "%{http_code}n" --silent --output /dev/null
200


The Economist is the biggest site I've com across with this behaviour, but certainly not the only one - this behaviour seems to be common. Why? What purpose does this blocking serve from the website's perspective? Is it a (misguided and ineffective) security measure? An attempt to get more meaningful user agents from bots? (But for what purpose?) Or does something else motivate these filters?

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Pope3001725

1 Comments

Sorted by latest first Latest Oldest Best

 

@Berumen354

This is due to the number of people who embed these http libraries in their own software for the purpose of scrapping content from other sites, which is often done for the purposes of copyright infringement. Well made crawlers which are legitimate and designed for a specific purpose (like archiver bots and search bots) have their own custom user agent strings to uniquely identify them. Based on this a general feeling held by many who apply these sorts of restrictions to their own sites is that any connections using the default user agent string from these libraries have not been made for a legitimate purpose and any which are caught out by mistake would probably result in the developer contacting the webmaster.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme