Mobile app version of vmapp.org
Login or Join
Debbie626

: How to Keep "Good" Bots That Ignore Robots & Nofollow From Getting Into Honeypots? We have a self generating honeypot that is designed to fill crawlers with junk data (among other things). Its

@Debbie626

Posted in: #CrawlErrors #Honeypots #Nofollow #RobotsTxt #WebCrawlers

We have a self generating honeypot that is designed to fill crawlers with junk data (among other things). Its protected with the right "treaties" and headers so 99% of good bots stay away. Today though it seems as if SEMrush has found its way in and has seen thousands of pages of trash data.

In theory, this would be awesome since it would warp their stats to any competitors trying to sniff our site, but we actually use SEMrush. How do I keep useful badbots like SEMrush from crawling into this honeypot? Seems robots.txt and nofollow have no effect. Here is how its set up:


Honeypot file is named as wp-admin (Wordpress) so stuff should not be hitting it (we don't use WP)
Robots.txt says that all traffic should not visit url example.com/wp-admin
On every page a hidden off-UI display:none link with noindex/nofollow points to example.com/wp-admin
As honeypot is loading, a 403 forbidden HTTP header is set for client
In honeypot, the includes a meta header for nofollow/noindex
After honeypot is loaded there is a CSS overlay to block certain things, or explain what this is to any humans that find themselves there.


So how do I make sure that SEMrush or other tools like it do not get hung in the honey?

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Debbie626

1 Comments

Sorted by latest first Latest Oldest Best

 

@Kimberly868

If you're using Apache web-server you could use an .htaccess configuration to white-list by User Agent and prevent genuine bots from reaching your 'tarpit':

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} SEMrush [NC]
RewriteRule .* - [F,L]

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme