Mobile app version of vmapp.org
Login or Join
Vandalay111

: Wget not respecting my robots.txt. Is there an interceptor? I have a website where I post csv files as a free service. Recently I have noticed that wget and libwww have been scraping pretty

@Vandalay111

Posted in: #RobotsTxt

I have a website where I post csv files as a free service. Recently I have noticed that wget and libwww have been scraping pretty hard and I was wondering how to circumvent that even if only a little.

I have implemented a robots.txt policy. I posted it below..

User-agent: wget
Disallow: /

User-agent: libwww
Disallow: /

User-agent: *
Disallow: /


Issuing a wget from my totally independent ubuntu box shows that wget against my server just doesn't seem to work like so....
myserver.com/file.csv

Anyway I don't mind people just grabbing the info, I just want to implement some sort of flood control, like a wrapper or an interceptor.

Does anyone have a thought about this or could point me in the direction of a resource. I realize that it might not even be possible. Just after some ideas.

Janie

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Vandalay111

1 Comments

Sorted by latest first Latest Oldest Best

 

@Kristi941

If you decide you want to block wget and libwww you can either redirect them to a page telling them why you're blocking them with this code:

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^libwww
RewriteRule ^(.*)$ www.example.com/blocked.html

Or you can flat out reject their request with this code:

SetEnvIfNoCase user-agent "^wget " bad_bot=1
SetEnvIfNoCase user-agent "^libwww" bad_bot=1
<FilesMatch "(.*)">
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</FilesMatch>


Just place either snippet in a .htaccess file in your root directory or the directory where the files they are downloading from are.

I have used the second snippet to block bots from a site I had that was getting scraped a lot. I haven't used the first snippet but it looks like it should work well if you choose to go that route.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme