: How to identify the client is a search robot? I have built my entire site using AJAX (indeed it's GWT). I have also implemented AJAX crawling proposed by Google. However, after the implementation,
I have built my entire site using AJAX (indeed it's GWT). I have also implemented AJAX crawling proposed by Google. However, after the implementation, I found that neither Yahoo , Bing, nor Baidu implemented that scheme!
I'm wondering if there is a way to identify the web client is a search robot. If they are, they will be shown the HTML snapshot I created.
It will be best if I can identify them in APACHE level, then I can just do a mod_rewrite. But it's still ok if I can do that in PHP or GWT.
More posts by @Pierce454
2 Comments
Sorted by latest first Latest Oldest Best
You can check the User Agent HTTP Header. www.user-agents.org/ is a good place for identifying who are the crawlers.
You can also read more about logging in Apache. You can generate a special log for a list of user agents (bots) for example.
Search engine robots are, as far as the client is concerned, no different from any other user-agent. Indeed is worth noting that many search engines (Google in particular) can get unhappy if their robots are served different content than regular visitors. This means that they tend to use generic user agent strings (e.g. Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)), but usually with some detail buried deeper as in the provided example.
The best way of detecting such robots is to use an IP filter. You'll need to either compile your own list or rely on one like this.
Using such a list should enable you to handle all major search engine robots. Adding rewrite rules based on IP is also fairly simple so it should meet your requirements. Just be sure to update it every once in a while.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.