Mobile app version of vmapp.org
Login or Join
Cooney921

: Preventing high bandwidth usage from Yandex Although I put a row for Yandex into my robots.txt file, sometimes Yandex indexes my website aggressively. So I hard coded a part and check for user

@Cooney921

Posted in: #RobotsTxt #SearchEngines #UserAgent #WebCrawlers #Yandex

Although I put a row for Yandex into my robots.txt file, sometimes Yandex indexes my website aggressively. So I hard coded a part and check for user agent, and serve cached file if user agent is like this: "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"

But when I check statcounter logs I recently saw that other Yandex related bots frequently crawl my site. They are similar to following. I took this info from my cPanel log:

Beeline (128.69.243.12)
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; InfoPath.2)
Referer: yandex.ru/yandsearch?text=example.com&lr=213
Beeline (89.178.108.247)
Referer: yandex.ru/yandsearch?text=example.com&lr=213 Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2)


How can I block or serve cached pages these bots?

When I check for $_SERVER['HTTP_USER_AGENT'] I can't see "yandex.ru" in referrer. Referrer comes empty. Is it possible to find out referrer in cPanel log, but can't take it from HTTP_USER_AGENT ??

And I also I don't want to ban IPs because there exist too many IPs related with this issue and they are changed periodically. So how can I find out this bot?

Does anybody has similar issue?
Thank you

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Cooney921

1 Comments

Sorted by latest first Latest Oldest Best

 

@Sue5673885

Use a robots.txt crawl delay as described in help.yandex.com/search/?id=1112639
Example:

User-agent: Yandex
Crawl-delay: 2 # specifies a 2 second timeout


Before you start banning this bot, you should first verify that your logs are actually Yandex and not someone else who is spoofing the user agent to look like they are yandex. A tactic used by competitors to hussle you into blocking or delaying a bot so they can out rank you. Perform a DNS lookup : help.yandex.com/search/?id=1112029
You can serve a cache copy dependant on user agent in a number of ways. If you use apache you can do it with mod_rewrite rules. If you use PHP you can do it by sniffing the $_SERVER['HTTP_USER_AGENT'] variable or even use the function get_browser(). How you build a cache is also varied and can be done in 101 ways. Honestly though for the best performance you should always be using caching.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme