Mobile app version of vmapp.org
Login or Join
Margaret670

: Piwik visitor log spam I use Piwik to track people visiting my website. Unfortunately a lot of the visitors that appear in the visitor log are not real people but spam bots. Normally, these

@Margaret670

Posted in: #Analytics #Matomo #Spam

I use Piwik to track people visiting my website. Unfortunately a lot of the visitors that appear in the visitor log are not real people but spam bots. Normally, these bots use a provider from a country that differs from my target audience. They always refer to a spam site, as you can see in the referrer column.



Because my website is relatively small, theses bots distort statistics (visits over time etc.) and make the visitor map almost useless. Is there anything I can do to block them?

10.03% popularity Vote Up Vote Down


Login to follow query

More posts by @Margaret670

3 Comments

Sorted by latest first Latest Oldest Best

 

@Vandalay111

Since this question was asked Piwik got a new feature where it by default ignores visitors with referrers known for such kind of referrer spam. piwik.org/blog/2015/05/stopping-referrer-spam/
If you come across new domains, you can submit them to the community-contributed list of referrer spammers: github.com/piwik/referrer-spam-blacklist

10% popularity Vote Up Vote Down


 

@Alves908

You will find abusive bots in your log files if you have a website of any age. For some, it is a huge problem, while some sites experience less abuse.

You have an excellent idea about being able to edit the log file. Perhaps there should be a tool. Unless you are able to write code, it is not really practical to edit the log file to remove entries and there is no tool that I am aware of that does this for you. Most people seek to block these accesses as a way to keep their log files clean.

It is not exactly easy to determine who to block when. I am in the security research realm and this is a topic area for me and I tell you it is always a judgement call. But I will give you some clues.

When you look at your log file or log file analysis, you want to look for a few things:


Accesses that do not request images.
Accesses that do not request robots.txt.
Accesses that do not obey robots.txt.
Accesses that occur rapidly within a time interval that is not likely
a human.
If accesses change browser or operating system at any point.


There are more clues of course, but it does get complicated.


A bad bot may or may not request images. The fact that a page view is
followed by image requests is not necessarily an indication of a
human. However if accesses do not include image requests, then it is
a bot.
A bad bot may or may not request robots.txt. Just because a bot
requests robots.txt does not mean it is a well behaved bot.
If a bad bot requests robots.txt and it attempts to access areas
restricted by robots.txt then it should be blocked. You can create a
small image link to restricted area. It can be a page, directory
without index enabled, another image- it does not matter. Just make
sure it is something that a human would not likely follow. Just do
not a 1 pixel link. Make is a small image. If any access to this area
occurs, you should block access.
Bad bots often access sites at a pace that is impossible to be a
human. A human can click links at a pace just less than one second.
If you have at least three accesses within 2 seconds, it is likely a
bot.
Some bad bots can change browsers and operating systems over time,
but not always. If this occurs, then it is safe to block.


This is an area where you need to use your best judgement. You can Google domain names and IP addresses to see what experience other people have and whether anyone else is blocking access to what you have found. Use the list above to make a judgement for yourself. You will begin to see some patterns.


Bad spiders come from similar bad neighborhoods.
Bad spiders use a block of similar IP addresses.
Bad spiders use subscriber sub-domains from telecos.


It depends which web server you have of course. I have not worked with IIS in a long time nor have I used any of the newer web servers. I know Apache so I will give some examples that you can use in your .htaccess file if you have Apache.

RewriteCond %{REMOTE_HOST} example.com [NC]
RewriteRule .* - [F,L]


-and-

RewriteCond %{REMOTE_ADDR} 10.0.1.101 [NC]
RewriteRule .* - [F,L]

10% popularity Vote Up Vote Down


 

@Kevin317

You can block them by ip in Piwik...

To exclude all traffic from a given IP or IP range, log in Piwik as the Super User.
Click on Settings > Websites. Below the list of websites, you will find the option to
specify “Global list of Excluded IPs”. You can define a given IP address, or IP ranges
(132.4.3.* or 143.2.. for example) to be excluded from being tracked on all websites.
Each Piwik admin user can also specify the list of IPs or IP ranges to exclude for
specific websites.


piwik.org/faq/how-to/#faq_80

There are probably better programatic solutions, such as only displaying the tracking code to specific countries, or blocking any ip found in the Project Honeypot database.

However, a simpler solution may be to remove any <noscript> tags from your tracking code. Robots can rarely read javascript, but most human users can. Whilst you would not track any human users without javascript if this was removed, it should increase the overall accuracy.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme