Mobile app version of vmapp.org
Login or Join
Connie744

: Why is Google downloading binaries from my web site and using bandwidth? Since about mid-August 2014, several Google servers have been downloading all of the (very) large binary files on my web

@Connie744

Posted in: #Bandwidth #Google #Proxy

Since about mid-August 2014, several Google servers have been downloading all of the (very) large binary files on my web site, about once a week. The IPs all show as owned by Google, and look like this: google-proxy-66-249-88-199.google.com. These are GET requests, and they are greatly affecting my server traffic.

Prior to this, I didn't see any traffic from these Google proxy IPs, so this seems to be something relatively new. I do see all kinds of traffic from other Google IPs, all of them googlebot and HEAD requests only.

I wouldn't be worried about this except that all of these files are being downloaded by Google about every week or so. The bandwidth used is starting to get excessive.

I've speculated that since many of these files are Windows executables, perhaps Google is downloading them to perform malware scans. Even if that's true, does that really need to happen every week?

Example traffic from google proxy IPs in November so far:

google-proxy-64-233-172-95.google.com: 8.09 GB
google-proxy-66-102-6-104.google.com: 7.50 GB
google-proxy-66-249-83-245.google.com: 3.35 GB
google-proxy-66-249-84-131.google.com: 1.54 GB
google-proxy-66-249-83-131.google.com: 4.98 GB
google-proxy-66-249-83-239.google.com: 2.48 GB
google-proxy-66-249-88-203.google.com: 2.94 GB
google-proxy-66-249-88-201.google.com: 2.58 GB
google-proxy-66-249-88-199.google.com: 4.89 GB


Update #1 : I forgot to mention that the files in question are already in the site's robots.txt file. To make sue the robots.txt configuration is working properly, I also used the robots.txt tester in Google Webmaster Tools, which shows that the files are definitely being blocked for all Google bots, with one exception: Adsbot-Google. I'm not sure what that's about either. AND I searched Google for some of the files, and they do NOT appear in search results.

Update #2 : Example: between 5:12am and 5:18am PST on November 17, about half a dozen IPs (all google-proxy) did GETs on all of the binary files in question, 27 in total. On November 4 between 2:09pm and 2:15pm PST, those same IPs did basically the same thing.

Update #3 : At this point it seems clear that although these are valid Google IPs, they are part of Google's proxy service, and not part of Google's web crawling system. Because these are proxy addresses, there's no way to determine where the GET requests are actually originating, or whether they are coming from one place or many. Based on the sporadic nature of the GETs, it doesn't appear that there is anything nefarious going on; it's likely just someone deciding to download all the binaries while using Google's proxy service. Unfortunately, that service seems to be completely undocumented, which doesn't help. From a site administrator's standpoint, proxies are rather annoying. I don't want to block them, because they have legitimate uses. But they can also be misused.

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Connie744

1 Comments

Sorted by latest first Latest Oldest Best

 

@Smith883

I did some research for this question and found some interesting thins, such as:

1. Is it a fake crawler? -> stackoverflow.com/questions/15840440/google-proxy-is-a-fake-crawler-for-example-google-proxy-66-249-81-131-google-c
Conclusion from the user:


These 'crawlers' are not crawlers but are part of the live website
preview used in the Google search engine.

I have tried this, to show one of my websites in the preview and yes,
there it is, received a blockedIP message.

If you want users to be able to view a preview of your website, you
have to accept these 'crawlers'.

Like others said: "the root domain of that URL is google.com and that
can't be easily spoofed".

Conclusion: You can trust these bot's or crawlers and it is used to
show a preview in google search.


We know the live preview is not downloading your files, so let's jump to question 2.

2. Is it part of Google services? -> Is this Google proxy a fake crawler: google-proxy-66-249-81-131.google.com?

Conclusion:


I think, some people are using Google services (like Google translate,
Google mobile, etc.) for accessing (blocked) websites (in schools
etc.) but also for DOS attacks and similar activity.


My guess on this is the same as the above. Someone is trying to use a Google service to access your files, such as translator.

If, as you say, the files are already being blocked by the robots.txt, this can only be a manual request.

EDIT: To address the OP Comment extensively:

Can the crawlers ignore the robots.txt? Yes. Here's a list
I don't think Google does that, which means it can be other bots using Google proxies.

Can it be a bad bot? Yes, and for that I recommend:

.htaccess banning:

RewriteCond %{REMOTE_HOST} ^209.133.111..* [OR]
RewriteCond %{HTTP_USER_AGENT} Spider [OR]
RewriteCond %{HTTP_USER_AGENT} Slurp
RewriteRule ^.*$ X.html [L]


This code can ban IP's or User agent's.

Or use a Spider Trap, featured here

I keep my opinion that this is a manual request.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme