: Cookies and useragent is not enough ? to emulate browser? I am building kinda a search engine for online shops, and several website are blocking me although I check the robots.txt Let's take
I am building kinda a search engine for online shops, and several website are blocking me although I check the robots.txt
Let's take an example, I need to get this page:
www.aliexpress.com/item/1pcs-Easy-use-Pet-Animal-Dog-Grooming-Nail-Clippers-Scissors-Trimmer-wholesale-Newest/1339057466.html?spm=2114.10010108.0.64.MGPpDO
According to www.aliexpress.com/robots.txt:
# file: robots.txt,v 1.0 2002/09/23 created by Tsing Kong
# alibaba.com # <URL:http://www.robotstxt.org/wc/exclusion.html#robotstxt>
# Format is:
# User-agent: <name of spider>
# Disallow: <nothing> | <path>
# -----------------------------------------------------------------------------
User-agent: *
Disallow: /search/
Disallow: /productdetail/
User-agent: Baiduspider
Disallow: /promotion/
Disallow: /wholesale?
User-agent: baiduspider
Disallow: /promotion/
the previous web page URL is allowed and I could set any user agent. But that doesn't work.
Also I've tried the following but without luck:
I set a user agent for chrome:
curl_setopt ( $curl, CURLOPT_USERAGENT,
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
);
set cookies:
curl_setopt ( $curl, CURLOPT_HTTPHEADER, array (
Cookie: NAME1=OPAQUE_STRING1; NAME2=OPAQUE_STRING2ali_apach
) );
Also I save the cookies for the request in the database, to send it back with the next request.
I've changed the IP multiple times. I am using amazon ec2 (elasticbeanstalk) to do so.
What am I doing wrong ?
More posts by @Merenda212
1 Comments
Sorted by latest first Latest Oldest Best
Here is a start:
Websites do not like being spidered and having their resources wasted.
You are using Amazon which has an absolutely terrible reputation for
abusive activities.
You are emulating a browser and being dishonest about who you are.
You are clearly a bot and not providing a agent name to block access
via an informational URL within the agent ID.
You may be accessing the site excessively.
You are assuming that you have the right to access the site and not
assuming that you need to behave nicely.
Expect to be blocked.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.