Mobile app version of vmapp.org
Login or Join
Gloria169

: Massive 404 attack with non existent URLs. How to prevent this? The problem is a whole load of 404 errors, as reported by Google Webmaster Tools, with pages and queries that have never been

@Gloria169

Posted in: #CrawlErrors #GoogleSearchConsole

The problem is a whole load of 404 errors, as reported by Google Webmaster Tools, with pages and queries that have never been there. One of them is viewtopic.php, and I've also noticed a scary number of attempts to check if the site is a WordPress site (wp_admin) and for the cPanel login. I block TRACE already, and the server is equipped with some defense against scanning/hacking. However, this doesn't seem to stop. The referrer is, according to Google Webmaster, totally.me.

I have looked for a solution to stop this, because it isn't certainly good for the poor real actual users, let alone the SEO concerns.

I am using the Perishable Press mini black list (found here), a standard referrer blocker (for porn, herbal, casino sites), and even some software to protect the site (XSS blocking, SQL injection, etc). The server is using other measures as well, so one would assume that the site is safe (hopefully), but it isn't ending.

Does anybody else have the same problem, or am I the only one seeing this? Is it what I think, i.e., some sort of attack? Is there a way to fix it, or better, prevent this useless resource waste?

EDIT
I've never used the question to thank for the answers, and hope this can be done. Thank you all for your insightful replies, which helped me to find my way out of this. I have followed everyone's suggestions and implemented the following:


a honeypot
a script that listens to suspect urls in the 404 page and sends me
an email with user agent/ip, while returning a standard 404 header
a script that rewards legitimate users, in the same 404 custom page,
in case they end up clicking on one of those urls.
In less than 24 hours I have been able to isolate some suspect IPs, all listed in Spamhaus. All the IPs logged so far belong to spam VPS hosting companies.


Thank you all again, I would have accepted all answers if I could.

10.06% popularity Vote Up Vote Down


Login to follow query

More posts by @Gloria169

6 Comments

Sorted by latest first Latest Oldest Best

 

@Mendez628

Like many have already said, this is not an attack but an attempt to probe or scan your site app and/or your server capabilities. The best way to filter out all of these useless traffic and potentially dangerous scans is to implement a WAF (Web Application Firewall). This will catch all the different attempts and flag them and only then send real legitimate clean traffic to your servers and web app.

You can use cloud based DNS WAF or dedicated devices. I personally use Incapsula and F5 ASM for different client sites. Costs are as low as 0 a month and helps tremendously. It also gives better protection to your clients and lessens resources on the web servers themselves which willl save you money and increase speed, plus these devices offer PCI 6.6 compliance and reviews with reports.

Hope this helps.

10% popularity Vote Up Vote Down


 

@Sent6035632

Indeed it sounds like bot frenzy. We have been getting hammered as well by thousands of IP's across many hosts, most likely unbeknownst to the site OP. Before i offer some helpful solutions, one question back that i have is:

Q: How are you seeing 404's from your site as a whole in Google webmaster tools? GWT is the output of Googlebots findings, not the output of other bots. Also, those other bots don't run JS for analytics...do you have some kinda API thing going to GWT where you can see your server stats? If not, it may be cause for alarm since this is googlebot itself finding errors.


If this is JUST googlebot errors, this could indicate someone had planted links to your site on forums and things for targets of malicious real-human-pc bots hitting it. Think harverstor+planter running on some exploited server, setting up a ton of targets for future "spam contracts" to portal through.
If you do indeed know that its reporting your full server stats, then you need some tools. A few apps and services may help you trim it down. Assuming youre running a linux server:


1) Start adding offending IP's to an htaccess blacklist. It looks like "deny from 192.168.1.1" and will 403 forbidden them. Dont get carried away just block the biggens. Check them against the sites in step 4) to make sure they arent real peoles ISP's. You can copy this file and stick it on any account/app beyond firewall even.

2) Install APF. its real easy to manage the firewall via SSH in linux. As you build the ht, add them in APF like so "apf -d 192.168.1.1". Ht seems redundant because of APF, but Ht is portable.

3) Install cPanel Hulk and make sure to whitelist your IP's so it never locks you out if you forget a pass. This will also be a nice source of IP's to add to ht + apf. It has some smarts to it so it can intelligently mitigate the brute force login attempts.

4) Hook up with stopforumspam.com and projecthoneypot.org and get their modules running. Both help alot to deny known requests and identify+report new brutes/nets/chinaspam. There are email filters you can use too, but gmail is owning it when it comes to spam filter.

5) Since the bots never let up, protect your admin paths. If you run wordpress, change the admin path, add captcha, etc. If you use SSH, change the login port to something non-used, then turn off SSH root login. Create a "radmin" you must log into first, then su for root.


A note about captcha, if you run your own captcha on a high volume site and dont deny the bot frenzy at firewall/ht level, they may be hammering your cpu cycles due to image generation in all those "antispam" widgets.
A note about load, if you run CentOS on your server, and have VPS abilities, CloudLinux is fantastic for hardening and load control. Say a bot gets through, CageFS is there to limit it into an account. Say they decide to DDoS....LVE is there to keep the account (site) load capped as to not crash your server. Its a good add to accent the whole system of "misintent entity management" :)


Just some thoughts, i hope that helps you out

10% popularity Vote Up Vote Down


 

@Ann8826881

There are tons of scripts out there that optimistically scan random IP addresses on the internet to find vulnerabilities known in various kinds of software. 99.99% of the time, they find nothing (like on your site,) and that 0.01% of the time, the script will pwn the machine and do whatever the script controller wants. Typically, these scripts are run by anonymous botnets from machines that have previously been pwnd, not from the actual machine of the original script kiddie.

What should you do?


Make sure that your site is not vulnerable. This requires constant vigilance.
If this generates so much load that normal site performance is impacted, add an IP-based blocking rule to avoid accepting connections from the particular site.
Learn to filter out scans for CMD.EXE or cPanel or phpMyAdmin or tons of other vulnerabilities when looking through your server logs.


You seem to believe that any 404 returned from your server to anyone will impact what Google thinks about your site. This is not true. Only 404s returned by Google crawlers, and perhaps Chrome users, will affect your site. As long as all links on your site are proper links, and you don't invalidate links you have previously exposed to the world, you will not see any impact. The script bots don't talk to Google in any way.

If you are getting attacked in a real way, you will need to sign up for some kind of DoS mitigation provider service. Verisign, Neustar, CloudFlare, and Prolexic are all vendors that have various kinds of plans for various kinds of attacks -- from simple web proxying (which may even be free from some providers,) to DNS based on-demand filtering, to full BGP based point-of-presence swings that sends all of your traffic through "scrubbing" data centers with rules that mitigate attacks.

But, it sounds from what you're saying, that you're just seeing the normal vulnerability scripts that any IP on the Internet will see if it's listening on port 80. You can literally put up a new machine, start an empty Apache, and within a few hours, you'll start seeing those lines in the access log.

10% popularity Vote Up Vote Down


 

@Chiappetta492

Explanation of the problem

First of all you are not the only one having this problem - everyone is. What you have seen is a result of automated bots crawling every IP and looking for common vulnerabilities. So they basically try to find what things are you using and if you use phpmyadmin they will later try to a bunch of standard username password combinations.

I am surprised that this sort of thing you found just now (you may be you just started your server). The problem is that you can not block their IP address forever (most probably this is infected computer and his actual user does not aware of what it is doing, also there are a lot of such IPs).

SEO effect

It has no effect at all. It just means that someone tried to access something on you computer and it was not there

Does it really matters?

Sure, these people try to probe you for some problems. Moreover they are wasting your resources (your server need to react in some way) and poluting your log file

How should I fix it

I had the same problem that I tried to fix and the best tool (simplicity to use vs what I can do with it) I was able to find is fail2ban

You are also lucky enough because I already found a way to fix the same problem and even documented it here (so you do not need to find how to install it and how to make it work). Check my question on ServerFault. But please read a little bit about fail2ban to know ho is it working.

10% popularity Vote Up Vote Down


 

@Candy875

This probably isn't actually an attack but a scan or probe.

Depending on the scanner/prober, it might be benign, meaning it is just looking for issues in some type of research capacity or it could have a function to automatically attack if it finds an opening.

Web browsers put valid referrer information but other programs can just make up whatever referrer they like.

The referrer is simply a piece of information that is optionally provided by programs accessing your web site. It can be anything they choose to set it to such as totally.me or random.yu. It can even be a real web site that they just selected.

You can't really fix this or prevent it. If you tried to block every request of this type, you end up having to maintain a very large list and it isn't worth it.

As long as your host keeps up with patches and preventing vulnerabilities, this should not cause you any actual problems.

10% popularity Vote Up Vote Down


 

@Heady270

I often see another site that links to tons of pages on my site that don't exist. Even if you are clicking on that page and not seeing the link:


The site might previously have had those links
The site may be cloaking and serving those links only to Googlebot and not to visitors


It is a waste of resources, but it won't confuse Google and it won't hurt your rankings. Here is what Google's John Mueller (who works on Webmaster Tools and Sitemaps) has to say about 404 errors that appear in Webmaster tools:


HELP! MY SITE HAS 939 CRAWL ERRORS!!1

I see this kind of question several times a week; you’re not alone - many websites have crawl errors.


404 errors on invalid URLs do not harm your site’s indexing or ranking in any way. It doesn’t matter if there are 100 or 10 million, they won’t harm your site’s ranking. googlewebmastercentral.blogspot.ch/2011/05/do-404s-hurt-my-site.html
In some cases, crawl errors may come from a legitimate structural issue within your website or CMS. How you tell? Double-check the origin of the crawl error. If there's a broken link on your site, in your page's static HTML, then that's always worth fixing. (thanks +Martino Mosna)
What about the funky URLs that are “clearly broken?” When our algorithms like your site, they may try to find more great content on it, for example by trying to discover new URLs in JavaScript. If we try those “URLs” and find a 404, that’s great and expected. We just don’t want to miss anything important (insert overly-attached Googlebot meme here). support.google.com/webmasters/bin/answer.py?answer=1154698 You don’t need to fix crawl errors in Webmaster Tools. The “mark as fixed” feature is only to help you, if you want to keep track of your progress there; it does not change anything in our web-search pipeline, so feel free to ignore it if you don’t need it.
support.google.com/webmasters/bin/answer.py?answer=2467403 We list crawl errors in Webmaster Tools by priority, which is based on several factors. If the first page of crawl errors is clearly irrelevant, you probably won’t find important crawl errors on further pages.
googlewebmastercentral.blogspot.ch/2012/03/crawl-errors-next-generation.html There’s no need to “fix” crawl errors on your website. Finding 404’s is normal and expected of a healthy, well-configured website. If you have an equivalent new URL, then redirecting to it is a good practice. Otherwise, you should not create fake content, you should not redirect to your homepage, you shouldn’t robots.txt disallow those URLs -- all of these things make it harder for us to recognize your site’s structure and process it properly. We call these “soft 404” errors.
support.google.com/webmasters/bin/answer.py?answer=181708 Obviously - if these crawl errors are showing up for URLs that you care about, perhaps URLs in your Sitemap file, then that’s something you should take action on immediately. If Googlebot can’t crawl your important URLs, then they may get dropped from our search results, and users might not be able to access them either.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme