Mobile app version of vmapp.org
Login or Join
Annie201

: Semalt ignores robots.txt, does their own form actually do what they promise? Semalt is blatantly ignoring robots.txt and the best way to block them as a webmaster seems to be blocking Semalt

@Annie201

Posted in: #RobotsTxt #Semalt #WebCrawlers

Semalt is blatantly ignoring robots.txt and the best way to block them as a webmaster seems to be blocking Semalt referral traffic through e.g. .htaccess.

I just found out that they're also have a form on their own website at semalt.com/project_crawler.php on which they claim "YOUR WEBSITE WILL BE ELIMINATED FROM OUR BASE IN 30 MINUTES AFTER YOU FILL IN THE FORM." Considering the way they treat robots.txt and also the fact that some people claim this company is even using botnets to gather data I have my doubts about these claims.

Has anyone had any luck with this? Does this form do what they promise?

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Annie201

1 Comments

Sorted by latest first Latest Oldest Best

 

@Sherry384

I have seen these people before and they are just what you describe. In my database, I see that they do read robots.txt, but they do not offer a bot name to block accesses to your site. This site fits my definition of a bad bot (unwanted / unappreciated). There is plenty of evidence of this to be found with a simple Google search.

When in doubt, just block their crawler IP addresses:

ASN: AS49981 - WorldStream IP Address Range:91.212.229.0 - 91.212.229.255

Htaccess code to block ASN AS49981 IP address range and referrer:

RewriteCond %{HTTP_REFERER} semalt.com [NC, OR]
RewriteCond %{HTTP_HOST} ^91.212.229.([0-2]*[0-5]*[0-5]*)$ [NC]
RewriteRule .* - [F,L]


This is a not a subscriber block so you will not be blocking users.

Further Details:

Semalt.com IP address 217.23.11.15 has ignored robots.txt falling into a bot trap despite reading robots.txt. As well, there is hacker activity from this IP address. This IP address has read images. The following user agents are tied to this IP address:

- (Yes. This is a dash and a common scraper tactic.)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
Mozilla/4.0 (compatible; Synapse)
Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/34.0.1847.116 Safari/537.36
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; TencentTraveler)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; TencentTraveler ; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727)
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 7.1; Trident/5.0)
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.1634 Safari/535.19 YE
Opera/9.80 (Windows NT 5.1; MRA 6.0 (build 5831)) Presto/2.12.388 Version/12.10
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/38.0.2125.104 Safari/537.36


Often, the use of multiple user agents is a tactic to disguise access patterns, however, it is not uncommon for a few user agents to be used over a period of time. Having said that, this list is concerning in it's scope.

I have activity for IP address 217.23.11.135 but nothing to report including reading robots.txt, reading images, or user agents. However, I have it tied to semalt.com as well.

According to www.incapsula.com/blog/semalt-botnet-spam.html:

The perpetrators’ goal is to create backlinks to a certain URL by
abusing publicly-available access logs. Their first step is to locate
vulnerable websites. Offenders do this by using crawl bots, which
typically serve a double function—both as scanners locating vulnerable
targets and as spammers that exploit these vulnerabilities.

Coincidentally, the Semalt bot can execute JavaScript and hold
cookies, thereby enabling it to avoid common challenge-based bot
filtering methods (e.g., asking a bot to parse JavaScript). Because of
its ability to execute JavaScript, the bot appears in Google Analytics
reports as being “human” traffic.

Recently, substantial evidence revealed that Semalt isn’t running a
regular crawler. Instead, it appears to use a botnet generated by
malware hidden in a utility called “Soundfrost.”

Our data shows that, using this malware-infested utility, Semalt has
already infected hundreds of thousands of computers to create a large
botnet. This botnet has been incorporated in Semalt’s referrer spam
campaign and, quite possibly, several other malicious activities.

To put things in numbers, during the last 30 days we saw Semalt bots
attempting to access over 32% of all websites on our service with
spamming attempts originating over 290,000 different IP addresses
around the globe.


To answer your question specifically:


Probably the most antagonizing behavior of all is Semalt’s claims that
you can complete an online form to easily remove your website.
However, instead of stopping the flood of unwanted requests, is seems
that submitting the removal form actually results in increased spam
traffic.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme