Mobile app version of vmapp.org
Login or Join
Cody1181609

: What bots are really worth letting onto a site? Having written a number of bots, and seen the massive amounts of random bots that happen to crawl a site, I am wondering as a webmaster, what

@Cody1181609

Posted in: #Bingbot #Googlebot #RobotsTxt #Seo #WebCrawlers

Having written a number of bots, and seen the massive amounts of random bots that happen to crawl a site, I am wondering as a webmaster, what bots are really worth letting onto a site?

My first thought is that allowing bots onto the site can potentially bring real traffic to it. Is there any reason to allow bots that are not known to be sending real traffic onto a site, and how do you spot these "good" bots?

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Cody1181609

2 Comments

Sorted by latest first Latest Oldest Best

 

@Frith620

I had problems with Baidu bots slowing down my server while the search engine was sending almost no traffic. These bots do not respect the robots.txt file so to block Baidu bots just paste the following into your htccess file.

# User-agent: Baiduspider
# Baiduspider+(+http://www.baidu.com/search/spider_jp.html)
# Baiduspider+(+http://www.baidu.com/search/spider.htm)

# IP range
# 180.76

RewriteCond %{REMOTE_ADDR} ^180.76. [OR]
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [NC]
RewriteRule .* - [F,L]


I've also had problems with Bing/Microsoft spiders crawling too fast, unlike Baidu they do respect the robots.txt file so;

User-agent: bingbot
Crawl-delay: 1

User-agent: msnbot
Crawl-delay: 1

10% popularity Vote Up Vote Down


 

@Alves908

Within the realm of normal bots, it all depends on what you appreciate and only you can decide that. Of course there is Google, Bing/MSN/Yahoo!, Baidu, and Yandex. These are the the major search engines. There are also the various SEO and backlink sites. Right or wrong, I allow a couple of the big ones have access to my site, but generally, they are useless sites. I block archive.org not only in robots.txt, but by domain name and IP address. This is because they ignore robots.txt big time! This is something that you need to get a feel for. Do not get fooled by agent names. Often they are forged by bad people. Now days, I am getting thousands of page requests from sources claiming to be Baidu, but are not. Get to know these spiders by domain names and IP address blocks and learn to deal with them on that level. The good ones obey robots.txt.

But I must warn you, there are a TON of stealth bots, rogue bots, scrapers, and so on that you will want to search your log analysis frequently and block. This 5uck5! But it has to be done. They largest threat from them these days are low quality links to your site. My updated anti-bot security code I implemented this year has dropped 7700 low quality links automatically. Of course, my code still needs work, but you get the point. The bad bots still steal site potential.

It won't be long before you get the hang of it.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme