: Why do many websites block requests from common HTTP libraries by user-agent? Writing spiders, I have noticed that many sites will return a 403 error if I hit them from popular HTTP software

Posted in: #Block #Filtering #Security #UserAgent #WebCrawlers

Writing spiders, I have noticed that many sites will return a 403 error if I hit them from popular HTTP software libraries, unless I manually override the default User-Agent header used by the library.

For example, The Economist magazine blocks my requests if I use the default user agent headers of any Python HTTP library:

$ curl www.economist.com/ -A python-requests/2.9.1 --write-out "%{http_code}n" --silent --output /dev/null
403
$ curl www.economist.com/ -A python-Urllib/2.7 --write-out "%{http_code}n" --silent --output /dev/null
403

But if I fake a browser user agent, put in a nonsense user agent, or provide an empty user agent, they're happy to accept my request:

$ curl www.economist.com/ -A "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/53.0.2785.143 Chrome/53.0.2785.143 Safari/537.36" --write-out "%{http_code}n" --silent --output /dev/null
200
$ curl www.economist.com/ -A '' --write-out "%{http_code}n" --silent --output /dev/null
200
$ curl www.economist.com/ -A banana --write-out "%{http_code}n" --silent --output /dev/null
200

The Economist is the biggest site I've com across with this behaviour, but certainly not the only one - this behaviour seems to be common. Why? What purpose does this blocking serve from the website's perspective? Is it a (misguided and ineffective) security measure? An attempt to get more meaningful user agents from bots? (But for what purpose?) Or does something else motivate these filters?

10.01% popularity Vote Up Vote Down

: Should I add non indexed subdomains to Google Search Console? My web app consists of static pages / landing pages that should be indexed and dynamic pages which is the outcome / results of

@Pope3001725

Posted in: #GoogleSearchConsole #Subdomain

1 Comments

: Do 404 and 403 errors listed in Google Search Console affect SEO ranking? I'm prerendering a Angular app for SEO. When I check in the webmaster tool, there are a few hundred 404 and 403 errors.

@Pope3001725

Posted in: #403Forbidden #AngularJs #Ranking #Seo

1 Comments

: Should I use rel="nofollow" on links from mirror sites to avoid a possible link scheme I host a network of service websites that consists of 1 main website and several mirrored websites under

@Pope3001725

Posted in: #Google #Links #Nofollow #Penalty

1 Comments

: Adding attributes to the category.tpl file in opencart I'm new to opencart, but am an experienced web developer. I am in the process of upgrading an old website, to a new website (1.5.4 to

@Pope3001725

Posted in: #Opencart

1 Comments

Login to post a comment!

1 Comments

Sorted by latest first Latest Oldest Best

@Berumen354

This is due to the number of people who embed these http libraries in their own software for the purpose of scrapping content from other sites, which is often done for the purposes of copyright infringement. Well made crawlers which are legitimate and designed for a specific purpose (like archiver bots and search bots) have their own custom user agent strings to uniquely identify them. Based on this a general feeling held by many who apply these sorts of restrictions to their own sites is that any connections using the default user agent string from these libraries have not been made for a legitimate purpose and any which are caught out by mistake would probably result in the developer contacting the webmaster.

10% popularity Vote Up Vote Down

Feed

: Why do many websites block requests from common HTTP libraries by user-agent? Writing spiders, I have noticed that many sites will return a 403 error if I hit them from popular HTTP software

More posts by @Pope3001725

: Should I add non indexed subdomains to Google Search Console? My web app consists of static pages / landing pages that should be indexed and dynamic pages which is the outcome / results of

: Do 404 and 403 errors listed in Google Search Console affect SEO ranking? I'm prerendering a Angular app for SEO. When I check in the webmaster tool, there are a few hundred 404 and 403 errors.

: Should I use rel="nofollow" on links from mirror sites to avoid a possible link scheme I host a network of service websites that consists of 1 main website and several mirrored websites under

: Adding attributes to the category.tpl file in opencart I'm new to opencart, but am an experienced web developer. I am in the process of upgrading an old website, to a new website (1.5.4 to

Login to post a comment!

1 Comments

Back to top | Use Dark Theme