: Is there a program to prevent downloading a whole site I want to avoid reinventing the wheel, so I am wondering if there is an (open source) program that you can use to protect a web server
I want to avoid reinventing the wheel, so I am wondering if there is an (open source) program that you can use to protect a web server that has paid content by preventing a user from downloading all content to its hard drive; such as detecting atypical behavior (e.g. saving each page, quickly working through all contents) and being able to automatically react on it.
Most bigger sites have something like that, throwing you onto a captcha every now and then, so I assume that there is some implementation also in the open source world, but I did some searches, e.g. for ‘outlier detection’, ‘behavior analysis’ or ‘intrusion detection’ and found nothing except Snort, which rather seems to be a desirable part of a firewall than a web server. I would have expected the solution to be more like a proxy server, Apache module or Typo3 extension. Maybe I conducted unfortunate searches in lack of a precise English expression for such a component.
Do you know a way to protect against this by some sort of behaviour analysis/outlier detection, something more than checking for easy-to-fake user agent and referrer?
More posts by @Lee4591628
3 Comments
Sorted by latest first Latest Oldest Best
You can, but it's a lot of work with a potential of no pay-off.
There is no software out there called protect example.com on a shelf somewhere. There definitely isn't on the open-source sites. Because website content is different for each server, there are a few things that you may have at your disposal. Pay per content is the only way to stop this from happening, even then you have to look at sites like Apple and Amazon to see how they deliver content to get an idea of how your customer base will feel about it.
Basically if anyone devotes enough time to your server they will find away around any of these protections.
You can create a table of BOT user agents. Then you can write them to a file and block their access from Apache using mod_rewrite. To get around this, they will simply change the name of their user agent to something like "Firefox".
Depending on how the person is accessing your site, if they're using a greedy bot or a plug-in from a browser like DownLoadThemAll in FireFox, then you can (provided you have access to the actual server itself) use something like the IPTables Firewall in Linux to block access to the server where the user is making multiple open connections at the same time. They can still make it around this block by adding limits to their download speed.
Verify bots with a honeypot. Typically bots(programs) will see links that most users can not (white on a white background, div under another div, etc). If someone is downloading a link that you've not made visible for a typical user, then you can block users that download this content. For my own attempts at staving off this behavior, I've looked at blocking IP ranges, blocking specific user agents, and blocking bots that I've found with honeypots specifically by tracking them with a cookie.
Use hashed passed strings to verify a user. You can append a string to the URL containing a required hash for each authenticated user. If the user makes multiple connections to the site (or connections from multiple IP addresses at the same time) you can close and lock their session. Then explain to the user about your download policy. (This needs to be coded into your web application.)
CAPTCHAs are easily compromised and can annoy your users. If your regular users see that they can't access content because of a CAPTCHA it may erode your user base or increase the number of help desk calls because people can't get to content they're supposed to be able to access.
A pay solution might be to host your content with a service that offers DLP (data loss prevention) solutions.
Check for off-site referrals. Often when a user sees content they can open it by selecting a link. You can check to see if these links have a referral of your site, then prevent access to the users with something like .htaccess in Apache. Most good bots will supply the referrer. This can be spoofed in an application like cURL though.
What are the downsides?
False positives are always a risk when implementing any sort of precautions to catch people based on Bayesian matching. You can inadvertently block your real users. Additionally you can cause your site to no longer show up in web searches if you block bots sent by Google or other search engines.
What can you do about it... legally?
If you see content on the web that is your content you can report them to sites like Google and they will delist the content from the page's rankings. Outside of that you would have to report the people to your government agency that handles copyrighted content theft and wait for them to get to your case.
You could use fail2ban to monitor for unusual user agents or too many page requests in a short period of time and automatically block the requesting IP address for a set period of time. The default config has settings available to watch Apache logs for too many 404 requests. You could easily alter that to watch for all requests. Too many in very short timeframe might indicate a scripted download.
I would generally recommend against this. You'll probably anger some users. And unless you ban the IP forever they'll eventually be able to download everything. If you're going to do this at all it should probably be programmed custom into your app so you can provide methods to bypass it or show friendly warning pages.
What about www.copyscape.com/online-copyright-protection/
This can be used to
1) If someone copy's from your site this site emails you.
I put this on my site after someone linked to my site from a porn site. I then changed my terms to read like this.
Terms:
Sites that contain obscene, hateful, pornographic or otherwise objectionable materials, including linking to or back linking to our site without permission is illegal and will not be allowed.
Within 2 days the link was gone.
This might help it would not hurt to try. Just rewrite the message to suite your needs.
I hope you or someone else will find this helpful.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.