Mobile app version of vmapp.org
Login or Join
Eichhorn148

: Rule out third party scraping, but allow Google crawling How to make scraping of own content through wget, httrack etc. impossible, but allow crawling through googlebot? This should be done without

@Eichhorn148

Posted in: #Google #Googlebot #ScraperSites #WebCrawlers

How to make scraping of own content through wget, httrack etc. impossible, but allow crawling through googlebot?

This should be done without showing to googlebot other content, as to other user agents.

And, please, better avoid IP recognition in your advices, if this is in general possible!

In current setup it works already based on IP recognition and the server goes periodically down. The setup is like:


first layer: nginx as caching,
second layer: apache with mod_security. mod_security makes IP recognition and manages traffic,
third layer: tomcat with CMS).


The main bottleneck is currently mod_security, and, partly, the way from mod_security to tomcat. Setup change is outside of manifold including viable solutions.

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Eichhorn148

1 Comments

Sorted by latest first Latest Oldest Best

 

@Megan663

It is almost impossible to rule out third party scraping entirely. The first line of defense is a robots.txt file:

User-Agent: Googlebot
Disallow:

User-Agent: *
Disallow: /


That will disallow all crawlers except Googlebot that obey robots.txt.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme