: Rule out third party scraping, but allow Google crawling How to make scraping of own content through wget, httrack etc. impossible, but allow crawling through googlebot? This should be done without

Posted in: #Google #Googlebot #ScraperSites #WebCrawlers

How to make scraping of own content through wget, httrack etc. impossible, but allow crawling through googlebot?

This should be done without showing to googlebot other content, as to other user agents.

And, please, better avoid IP recognition in your advices, if this is in general possible!

In current setup it works already based on IP recognition and the server goes periodically down. The setup is like:

first layer: nginx as caching,
second layer: apache with mod_security. mod_security makes IP recognition and manages traffic,
third layer: tomcat with CMS).

The main bottleneck is currently mod_security, and, partly, the way from mod_security to tomcat. Setup change is outside of manifold including viable solutions.

10.01% popularity Vote Up Vote Down

: Is there a way to host a Blogger blog at the apex domain? Is there a way to host a Blogger blog at the apex domain? I want to map http://example.com/ to http://example.blogspot.com/. Unfortunately,

@Eichhorn148

Posted in: #Blogger #Domains

2 Comments

: Do crawlers add page views on a forum? I have a MyBB forum and I get multiple page views on new threads within 3-5 minutes of posting. I manually submit new URLs through Google Webmaster

@Eichhorn148

Posted in: #Forum #SpamBots

1 Comments

: Is there a benefit to using AMP on already fast mobile sites? I have a blog that scores 100/100 on page speed insights mobile. It seems like I already follow most of what AMP does like inlining

@Eichhorn148

Posted in: #Amp #Mobile #PageSpeed

1 Comments

: Placing javascripts at the page's bottom and HTTP/2 Does the placement of javascripts at the bottom of html document still gives any loadtime wins, if the site runs on HTTP/2?

@Eichhorn148

Posted in: #Http #Javascript #LoadTime #Performance

1 Comments

Login to post a comment!

1 Comments

Sorted by latest first Latest Oldest Best

@Megan663

It is almost impossible to rule out third party scraping entirely. The first line of defense is a robots.txt file:

User-Agent: Googlebot
Disallow:

User-Agent: *
Disallow: /

That will disallow all crawlers except Googlebot that obey robots.txt.

10% popularity Vote Up Vote Down

Feed

: Rule out third party scraping, but allow Google crawling How to make scraping of own content through wget, httrack etc. impossible, but allow crawling through googlebot? This should be done without

More posts by @Eichhorn148

: Is there a way to host a Blogger blog at the apex domain? Is there a way to host a Blogger blog at the apex domain? I want to map http://example.com/ to http://example.blogspot.com/. Unfortunately,

: Do crawlers add page views on a forum? I have a MyBB forum and I get multiple page views on new threads within 3-5 minutes of posting. I manually submit new URLs through Google Webmaster

: Is there a benefit to using AMP on already fast mobile sites? I have a blog that scores 100/100 on page speed insights mobile. It seems like I already follow most of what AMP does like inlining

: Placing javascripts at the page's bottom and HTTP/2 Does the placement of javascripts at the bottom of html document still gives any loadtime wins, if the site runs on HTTP/2?

Login to post a comment!

1 Comments

Back to top | Use Dark Theme