Mobile app version of vmapp.org
Login or Join
Merenda212

: Cookies and useragent is not enough ? to emulate browser? I am building kinda a search engine for online shops, and several website are blocking me although I check the robots.txt Let's take

@Merenda212

Posted in: #Cookie #RobotsTxt #WebCrawlers

I am building kinda a search engine for online shops, and several website are blocking me although I check the robots.txt

Let's take an example, I need to get this page:
www.aliexpress.com/item/1pcs-Easy-use-Pet-Animal-Dog-Grooming-Nail-Clippers-Scissors-Trimmer-wholesale-Newest/1339057466.html?spm=2114.10010108.0.64.MGPpDO
According to www.aliexpress.com/robots.txt:
# file: robots.txt,v 1.0 2002/09/23 created by Tsing Kong
# alibaba.com # <URL:http://www.robotstxt.org/wc/exclusion.html#robotstxt>
# Format is:
# User-agent: <name of spider>
# Disallow: <nothing> | <path>
# -----------------------------------------------------------------------------
User-agent: *
Disallow: /search/
Disallow: /productdetail/

User-agent: Baiduspider
Disallow: /promotion/
Disallow: /wholesale?

User-agent: baiduspider
Disallow: /promotion/


the previous web page URL is allowed and I could set any user agent. But that doesn't work.

Also I've tried the following but without luck:


I set a user agent for chrome:

curl_setopt ( $curl, CURLOPT_USERAGENT,
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
);

set cookies:

curl_setopt ( $curl, CURLOPT_HTTPHEADER, array (
Cookie: NAME1=OPAQUE_STRING1; NAME2=OPAQUE_STRING2ali_apach
) );


Also I save the cookies for the request in the database, to send it back with the next request.
I've changed the IP multiple times. I am using amazon ec2 (elasticbeanstalk) to do so.


What am I doing wrong ?

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Merenda212

1 Comments

Sorted by latest first Latest Oldest Best

 

@Sherry384

Here is a start:


Websites do not like being spidered and having their resources wasted.
You are using Amazon which has an absolutely terrible reputation for
abusive activities.
You are emulating a browser and being dishonest about who you are.
You are clearly a bot and not providing a agent name to block access
via an informational URL within the agent ID.
You may be accessing the site excessively.
You are assuming that you have the right to access the site and not
assuming that you need to behave nicely.


Expect to be blocked.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme