Every time I read questions like this I think of Kevin Spacey's character in Henry and June. The fellow who was always writing his greatest Novel, but was so worried about someone stealing his ideas he kept it locked away in a briefcase, carried close to his chest...

Every linux user is a "legit" wget user. I use it often for grabbing debs, videos, binaries, whatever. It's easy to drive from the command line so, yeah, it makes a great scraper. But that's definitely not its only use, and making it appear as firefox or msie is just one --parameter away, so you're wasting your time blocking it. If anything, if you do that you're going to attract the attention of anyone passing by; they'll change the user-agent string and start digging for what you have "hidden."

10% popularity Vote Up Vote Down

@Margaret670

wget has legitimate uses, yes, but it's also quite useful for Web scraping. However, I don't think you should try to block it (or any other agent) by using the user agent string.

wget respects, by default, your robots.txt file. It's true that a scraper can just switch that option off, but guess what -- it's just as easy to use --user-agent MSIE(blahblah) and impersonate Internet Explorer if you start blocking at the HTTP level. I've written scraping scripts before and you'd better believe changing the UA is one of the first steps (if that doesn't work you could always switch gears and simply write a script to automate IE, of course).

If you're really concerned, you'll need to try and catch bot-like behavior -- pages without referrers, too many requests in too short a time, etc. However, I'm afraid you'll quickly find that it's pretty trivial for someone who wants to scrape your site to bypass any measure you could possibly take (short of those that would be too onerous to your users, like only allowing one page view per hour or something). This is also likely to be a big time sink.

Essentially, if legitimate users can see your page, there's not much you can do to keep scrapers from seeing it too.

10% popularity Vote Up Vote Down

@Sherry384

Why another answer? Because both are right but with caveats. I study these things as a subject area regarding security.

WGet is an offline web browser. It is not interactive like Linx which is a text based browser that works something like any other browser. WGet is most often used to capture resources such as downloads, videos, audio files and capture and create a localized copy of a website. It can be used by scrapers, but it would awkward to do and would require hand written code to parse files. I have done this for customers. I would not recommend it.

WGet is not a scraping tool. It is generally not used by scrapers directly because there are so many software scrapers out there that it would not make sense. However, the use of WGet does allow the user to do things that would not otherwise happen. For example, I can create a resource URL corrected version of a page, part of a site, or whole site, to browse offline off of my hard drive. It is possible for WGet to download a page, part of a site, or whole site corrected for redeployment somewhere else. But from my experience, this does not happen.

WGet is not generally a tool that anyone really needs to do work against your site.

And that is exactly my point. Anyone using WGet is not a casual user and often has something nefarious in mind. But not always. I have used WGet to download research papers, data, and any other resources where I would have to click a bunch of links essentially one at a time and pay attention to the process. For example, I could specify what page to look at, and what resources I wanted to download and I could trigger it to run on my robot server completely unattended. I use WGet for valid work. But users like myself are extremely rare, and I would not be using it against your site without your knowing it.

I would block this agent. I have it blocked on all the sites I control. If you are not offering downloads videos, audio files, e-books, and other similar resources for open downloading, block this agent. The user is likely up to no good. In fact, during my study, outside of a resource site, I have not seen a valid WGet user.

10% popularity Vote Up Vote Down

@Rivera981

Wget is just a command line tool for linux that fetches resources over HTTP - all this tells you is that someone accessed your site via a command line, it could have been a bot scraping you, but there's no way of knowing for sure

If your site is password protected properly, there shouldn't be any need to block particular user agents :) x

10% popularity Vote Up Vote Down

@Tiffany637

wget is often used for scraping. It's a command-line tool to download webpages and their assets. If your website isn't being publicized, you can almost be sure that it's a bot doing scraping. So yes, you could block it, but also be aware you may need to do something more sophisticated than blocking it with robots.txt since wget can easily be told to ignore robots.txt

To this particular user agent in .htaccess, you could add the following:

BrowserMatchNoCase Wget/1.12 (linux-gnu) wget
Order Deny,Allow
Deny from env=wget

10% popularity Vote Up Vote Down

Feed

: Should I block agent Wget/1.12 (linux-gnu)? Is that a scraper? It tried accessing my site which is currently password protected for testing purposes. shall I block it?

More posts by @Sent6035632

: How do I see the exact position of my website in google SERP? Whenever I search with keyword in my computer, it shows my website in first page and in reality(other computers) it is not showing

: Effect on SEO if I keep images in different domain? This question is related to another question that I asked earier: https://stackoverflow.com/questions/22558410/how-to-not-send-a-cookie-to-a-subdomain.

: Should I add the WWW version of my site to Google Webmaster Tools or Non-WWW or both? I've found here similar topics but not exactly similar and without satisfying references - I'm going

: My Server is returning a 404 instead of 403 when access if forbidden - is it good or bad? via .htaccess, I forbid access to folders that have no index page inside (domain.com/images for example),

Login to post a comment!

5 Comments

Back to top | Use Dark Theme