: Wget not respecting my robots.txt. Is there an interceptor? I have a website where I post csv files as a free service. Recently I have noticed that wget and libwww have been scraping pretty
I have a website where I post csv files as a free service. Recently I have noticed that wget and libwww have been scraping pretty hard and I was wondering how to circumvent that even if only a little.
I have implemented a robots.txt policy. I posted it below..
User-agent: wget
Disallow: /
User-agent: libwww
Disallow: /
User-agent: *
Disallow: /
Issuing a wget from my totally independent ubuntu box shows that wget against my server just doesn't seem to work like so....
myserver.com/file.csv
Anyway I don't mind people just grabbing the info, I just want to implement some sort of flood control, like a wrapper or an interceptor.
Does anyone have a thought about this or could point me in the direction of a resource. I realize that it might not even be possible. Just after some ideas.
Janie
More posts by @Vandalay111
1 Comments
Sorted by latest first Latest Oldest Best
If you decide you want to block wget and libwww you can either redirect them to a page telling them why you're blocking them with this code:
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^libwww
RewriteRule ^(.*)$ www.example.com/blocked.html
Or you can flat out reject their request with this code:
SetEnvIfNoCase user-agent "^wget " bad_bot=1
SetEnvIfNoCase user-agent "^libwww" bad_bot=1
<FilesMatch "(.*)">
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</FilesMatch>
Just place either snippet in a .htaccess file in your root directory or the directory where the files they are downloading from are.
I have used the second snippet to block bots from a site I had that was getting scraped a lot. I haven't used the first snippet but it looks like it should work well if you choose to go that route.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.