Mobile app version of vmapp.org
Login or Join
Ann8826881

: AWStats: Visits from IP address vs Crawlers I use AWStats in cPanel to see stats of my website. Under Hosts section I see one IP address that has visited 150 pages. I am not sure if one

@Ann8826881

Posted in: #Awstats

I use AWStats in cPanel to see stats of my website. Under Hosts section I see one IP address that has visited 150 pages. I am not sure if one person would have visited 150 pages using a browser. But if these 150 pages have been visited using a software application, then should not it be listed under Robots/Spider section.

So how do I determine if I should block a certain IP address that has visited several hundred pages of my website?

Thanks

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Ann8826881

1 Comments

Sorted by latest first Latest Oldest Best

 

@Sherry384

This is right up my ally, however, I use Sawmill and other tools and do not use AWStats. I would suggest that there are better site performance products out there that are free and I suggest that you install one of them. It will help you to better know what is going on.

Check out these links: en.wikipedia.org/wiki/Web_analytics http://en.wikipedia.org/wiki/List_of_web_analytics_software

I recommend looking at: www.openwebanalytics.com/ http://piwik.org/ (seems to be the best)

Now to answer your question.

Yes there are a bunch of bots out there and it seem like it can be a full time job of controlling their behavior. But how to know which ones are bad? That is though. Generally speaking, begin by research what search engines there are and what agents, domain names, and IP address blocks they use so that you become familiar. You may want to keep a list. Obviously there is Google, Bing, Yandex, Baidu, and so on, but there are valid smaller ones too that you will want to decide whether to allow or not.

Each one will have a method of blocking their accesses using robots.txt. www.robotstxt.org/ tells you how to use the robots.txt file. Each valid spider/bot will have as part of their agent name a URL that refers to how to block using robots.txt.

Any bad bot will, however, try and just plain spider your site anyway. Many of them are data miners and content scrapers that will use your content in some form or other to monetize their own site. These will give themselves away. Here is what to look for:

Does the agent name contain a referring URL? No = Bad.
Does the agent name change over time? Yes = bad.
Does the OS change over time? Yes = bad.
Does the bot access many pages very quickly? Yes = bad. (avg. speed = one access : 2sec.)
Does the bot obey robots.txt? No = bad.
Does the bot access images? No = bad.
Does the bot access javascript? No = bad. Yes = user.

There are more giveaways, but they require more work.

You can also look up the domain name and IP address on the web and get an opinion. This just happens to be part of what my site is about.

Keep in mind that some of these spiders are not necessarily bad. It can archive.org, ir any of the many backlink research sites that try and deterine some SEO statistics. It's not all bad. You have to decide which you like and which you do not like.

Another consideration is this. Many accesses can be hacker landscaping or hack attempts. Landscaping is where the hacker tries to determine what site tools you are using and any vulnerabilities that may exist. Hack attempts are just that. Attempts. There is some blurring between the two and either one is bad.

You will want to become familiar with your web server blocking methods. For Apache, you would use the .htaccess file. This link will get you started: httpd.apache.org/docs/2.2/howto/htaccess.html
Again, this is what my website is about. If you want to give us some example accesses, I can update the answer with something more specific. This happens to be one of the subject areas I research. If I know something, I will let you know.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme