Mobile app version of vmapp.org
Login or Join
Murray432

: Spider-Trap on a GitHub Site I have a GitHub site and I hate web-crawlers that disobey or ignore robots.txt. How would I set up a Spider-Trap on a GitHub site that the robots.txt disallows

@Murray432

Posted in: #Github #RobotsTxt #WebCrawlers

I have a GitHub site and I hate web-crawlers that disobey or ignore robots.txt. How would I set up a Spider-Trap on a GitHub site that the robots.txt disallows the trap, so the good bots aren't trapped but the bad ones are?

PS I have no money for a server or a web host.(I'm 13)

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Murray432

2 Comments

Sorted by latest first Latest Oldest Best

 

@Turnbaugh106

Kudos to you for being concerned about this at 13!

I'm going to ignore your lack of resources and pretend you could spend /mo for a Digital Ocean droplet (effectively a VPS of your own, where you could control all this stuff)...

Let's think about this in another light: What if you hated traffic from social media?

You'd make a list of social media sites that users from (referral traffic) were banned from visiting your site, right?

OK, so what sites are on that list?

Ignore the first 20 sites that came to mind, those sites are already on the blocked list. What other sites are on that list? Dogster? Ning? LifeAt?

Now do you get an idea of how hard it would be to block social traffic?

Here's a list of user agents dedicated to blocking of simple bots that I regularly use via a WordPress plugin.

In the context of a webform, do you know what a 'honeypot' is?

10% popularity Vote Up Vote Down


 

@Si4351233

This is incredibly complex and not something that you will be able to do with a zero budget as it requires access to server side data (which GitHub Pages does not support) and is something that will need ongoing tweaking.

Unfortunately as a rule once you put content on the web it is accessible to all. A robots.txt file tells legitimate known crawlers (like Googlebot, Bingbot, etc) where they are blocked from accessing. It is loosely based on the honour system in that these companies agree to have their bots comply with a robots.txt directive and you trust them to do so, there is no easy way to enforce it.

Some techniques done in the past with limited or no success have been...

IP Address Blocking
Even most legitimate crawlers will come from an extensive IP netblock and the netblocks change frequently so trying to block based on IP can be a never ending battle. On top of that illegitimate scrappers (private people or companies that scrap your site content for their own nefarious purposes) come from standard IP addresses which frequently change and are not associated with any known bots, as such the only bots you could effectively block even for the short term are bots which would respect your robots.txt file anyway.

Rate Limiting
You can implement rate limiting on your site so that a single IP address can't access your site any faster than a predetermined number of times in a second. This is reasonably good as it won't block human users unless they are speeding their way through your site, but on the flip side this can only be done with server side software and configurations and so won't work in the GitHub Pages environment.

Monitor Logs For Unusual Activity
Regularly check your server logs for unusual activity such as a large number of pages being accessed in a relatively short amount of time from the same IP. Once again this depends on access to server logs, as well it is not a hard and fast dependable method as IP addresses can change frequently, and many companies have a large number of users funnelled through a very small number of IP addresses so a group of a few hundred users could appear to the rest of the world to have the same IP address as everyone else.

Monitor Time For User Activity
As IP address monitoring is not the best you need to depend on other indicators. Some that you can use are how long it takes to fill out a form from when it is loaded to when it has been filled in and submitted. Check if a button has been pressed to submit the form or if the submit has been done programatically.

Use a CAPTCHA To Protect Forms
If your concern is around forms being submitted by bots then you can add a CAPTCHA to protect those forms. Not a guarantee of a human but a reasonably good test to block most of the poorly written scrappers out there.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme