Mobile app version of vmapp.org
Login or Join
Twilah146

: How do I deal with content scrapers? Possible Duplicate: How to protect SHTML pages from crawlers/spiders/scrapers? My Heroku (Bamboo) app has been getting a bunch of hits from a

@Twilah146

Posted in: #Heroku #RobotsTxt #Spam

Possible Duplicate:
How to protect SHTML pages from crawlers/spiders/scrapers?




My Heroku (Bamboo) app has been getting a bunch of hits from a scraper identifying itself as GSLFBot. Googling for that name produces various results of people who've concluded that it doesn't respect robots.txt (eg, www.0sw.com/archives/96).
I'm considering updating my app to have a list of banned user-agents, and serving all requests from those user-agents a 400 or similar and adding GSLFBot to that list. Is that an effective technique, and if not what should I do instead?

(As a side note, it seems weird to have an abusive scraper with a distinctive user-agent.)

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Twilah146

1 Comments

Sorted by latest first Latest Oldest Best

 

@Turnbaugh106

Perisable press has a good run down on dealing with content scrapers, as does Chris Coyer at CSS Tricks the general view is do nothing and take advantage of it where you can. Summary of good advice from perishable press below...


How to Deal with Content Scrapers

So what is the best strategy for dealing with content-scraping
scumbags? My personal three-tiered strategy includes the following
levels of action:


Do nothing.
Always include lots of internal links
Stop them with a well-placed slice of htaccess


These are the tools I use when dealing with content scrapers. For
bigger sites like DigWP.com, I agree with Chris that no action is
really required. As long as you are actively including plenty of
internal links in your posts, scraped content equals links back to
your pages. For example, getting a link in a Smashing Magazine article
instantly provides hundreds of linkbacks thanks to all of thieves and
leeches stealing Smashing Mag’s content. Sprinkling a few internal
links throughout your posts benefits you in some fantastic ways:


Provides links back to your site from stolen/scraped content
Helps your readers find new and related pages/content on your site
Makes it easy for search engines to crawl deeply into your site


So do nothing if you can afford not to worry about it; otherwise, get
in the habit of adding lots of internal links to take advantage of the
free link juice. This strategy works great unless you start getting
scraped by some of the more sinister sites. In which case..


The stack network is content scraper city so it would be interesting to hear the advice of some of the high level admin's on this topic...

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme