: Is this Google proxy a fake crawler: google-proxy-66-249-81-131.google.com? Recently I discover that some variants of a google proxy visits my sites. I doubt these are legal Google crawlers because
Recently I discover that some variants of a google proxy visits my sites. I doubt these are legal Google crawlers because these crawlers are NOT always behind a proxy (like the hostname describes) and identify itself as a browser. The hostname is formatted similar/like Googlebot but with the string 'proxy' added to it.
My PHP blocking class blocks these crawlers, but is it correct to block these ones? What are they and are these from google or is it fake?
Here some info about one of these crawlers:
BlockedIp Notifier Report - IP:66.249.81.131:: has been blocked
Ticket ID : {EVNT_136877_2013040520130402_33147_10348}
Event type : Access blocked
Event date : 04/05/2013 - 19:17:47 (server date-time)
Event counter : First occurring
Processed url : streambutler.net/
From url : www.google.com/search
Domain : streambutler.net
Domain IP : 95.170.70.213
Visitor IP : 66.249.81.131
Proxy IP : 66.249.81.131
Critical : Yes
Action required : No
Additional information
Problem : Bad Proxy - via 66.249.81.131
Hostname : google-proxy-66-249-81-131.google.com
Block : Yes
Refferer : www.google.com/search
AgentString : Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.4 (KHTML, like G...
Browser : Chrome 22.0.1229
Platform : Linux
Robot : No
Mobile : No
Tablet : No
Console : No
Crawler : No
Agent_type : browser
Agent_name : chrome
Agent_version : 22.0.1229
Os_type : linux
Os_name : linux
Agent_languagetag : en
Status : ok
Request : 66.249.81.131
Languagecode : us
Country : United States
Region : California
City : Mountain View
Zipcode : 94043
Latitude : 37.406
Longitude : -122.079
Timezone : -07:00
Available from : 'http
Areacode : 0
Dmacode : 0
Continentcode : na
Currencycode : USD
Currencysymbol : $
Currencysymbol_utf8 : $
Currencyconverter : 1
Extended : 1
Organization : NULL
other variants found
google-proxy-66-249-81-131.google.com (identifies itself as Firefox
6.0?)
google-proxy-66-249-81-148.google.com (tries to access a javascript file)
google-proxy-66-249-81-131.google.com
google-proxy-66-249-81-111.google.com (tries to access a javascript
file)
google-proxy-66-249-81-164.google.com
The first one in the list is a weird one, Firefox 6.0 on Windows 7 and same IP as example above but is not a proxy in the next log? If it is mobile proxy, this is very weird or not?
Ticket ID : {EVNT_164838_2013040520130402_33147_10348}
Event type : Access blocked
Event date : 04/05/2013 - 19:19:07 (server date-time)
Event counter : First occurring
Processed url : streambutler.net/
From url : Unknown or direct link
Domain : streambutler.net
Domain IP : 95.170.70.213
Visitor IP : 66.249.81.131
Proxy IP : (not present)
Critical : Yes
Action required : No
Additional information
Problem : Blocked Server IP address (analysis) - 66.249.81.131
Hostname : google-proxy-66-249-81-131.google.com
Block : Yes
Refferer : (direct access)
AgentString : Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0 ...
Browser : Firefox 6.0
Platform : Windows 7
Robot : No
Mobile : No
Tablet : No
Console : No
Crawler : No
Agent_type : browser
Agent_name : firefox
Agent_version : 6.0
Os_type : windows
Os_name : windows 7
Agent_languagetag : en
Status : ok
Request : 66.249.81.131
Languagecode : us
Country : United States
Region : California
City : Mountain View
Zipcode : 94043
Latitude : 37.406
Longitude : -122.079
Timezone : -07:00
Available from : 'http
Areacode : 0
Dmacode : 0
Continentcode : na
Currencycode : USD
Currencysymbol : $
Currencysymbol_utf8 : $
Currencyconverter : 1
Extended : 1
Organization : NULL
Anyone has info about these?
More posts by @LarsenBagley505
7 Comments
Sorted by latest first Latest Oldest Best
When your servers are getting hit by bots, always Google their IP address before blocking them.
A search for "IP address 66.249.81.131" shows that this is an IP is that is owned by Google.
When a search for an IP address doesn't return a company that you want crawling your site, it's most likely safe to block it.
I came upon this thread while researching a handful of unusual log entries. They are logged as google proxy in the same fashion as the posted question. But the referer in the IIS log states google.com/search and included a UserAgent which looks real. However, if this was a real crawler they would not need to imitate an Agent.
BUT the clincher is that this site is not live and is not on google's search yet. In fact I had suspected I downloaded a virus a day or two ago, and I must have hand typed this complete address while testing the site. So here is someone using a keystroke tracker and trying to follow up on all of my activity, but it looks like they are trying to hide behind the google search proxy? I like the hypothesis about the 197 address.
The stem:/monitor/getAccount is simply a task endpoint which I hit occasionally to verify a new code build for testing. No user or Google would ever find this:
2018-03-09 06:56:29 10.138.0.4 GET /monitor/getAccount - 80 - 66.249.80.26 Mozilla/5.0+(X11;+Linux+x86_64)+AppleWebKit/537.36+(KHTML,+like+Gecko;+Google+Web+Preview)+Chrome/41.0.2272.118+Safari/537.36 - www.google.com/search app.tru-stats.com 200 0 1236 0 426 31203
I have also found that google proxy accessed my website several times (30+) in the very same second:
66.249.81.106 - - [30/Aug/2013:01:26:35 +0200] "GET /index.php HTTP/1.1" 200 280329
66.249.81.106 - - [30/Aug/2013:01:26:35 +0200] "GET /index.php HTTP/1.1" 200 280329
66.249.81.106 - - [30/Aug/2013:01:26:35 +0200] "GET /index.php HTTP/1.1" 200 280329
66.249.81.106 - - [30/Aug/2013:01:26:35 +0200] "GET /index.php HTTP/1.1" 200 280329
...
and rising my server loads. This was strange because in robots.txt I set:
Crawl-delay: 1
(crawler (google) should access the site at a maximum frequency of 1 queries per second (cca), Google does NOT ignore this setting).
So I tried to create a PHP script to block google(any) IPs if IP does it for more than 30 seconds, but I discovered something different. With this code, i was searching for the visitor IP address:
function get_visitor_ip_address($server)
{
foreach (array('HTTP_CLIENT_IP', 'HTTP_X_FORWARDED_FOR', 'HTTP_X_FORWARDED', 'HTTP_X_CLUSTER_CLIENT_IP', 'HTTP_FORWARDED_FOR', 'HTTP_FORWARDED', 'REMOTE_ADDR') as $key)
{
//if (array_key_exists($key, $_SERVER) === true)
if ($server->testIp($key))
{
//foreach (explode(',', $_SERVER[$key]) as $ip)
foreach (explode(',', $server->getEscaped($key)) as $ip)
{
$ip = trim($ip); // just to be safe
if (filter_var($ip, FILTER_VALIDATE_IP, FILTER_FLAG_IPV4 | FILTER_FLAG_NO_PRIV_RANGE | FILTER_FLAG_NO_RES_RANGE) !== false) return $ip;
if (filter_var($ip, FILTER_VALIDATE_IP, FILTER_FLAG_IPV6 | FILTER_FLAG_NO_PRIV_RANGE | FILTER_FLAG_NO_RES_RANGE) !== false) return $ip;
}
}
}
}
but this code returned different IP address (usually Middle east, Africa, or similar locations, ie. 197.132.255.244). This is from my PHP logs
IP address 197.132.255.244 banned at 2013-08-30 01:26:35 for the 1. time exceeding 30 visits in a second, banned for 30 minutes
Interestedly, my Apache server stored Google proxy IP address to my access logs, not the 197.132.255.244). See the apache logs at the beginning, same date & time, etc... tested several times
>
>
>
While my PHP script searches for the IP address in several ways, notice the different server params in the PHP code:
'HTTP_CLIENT_IP', 'HTTP_X_FORWARDED_FOR', 'HTTP_X_FORWARDED', 'HTTP_X_CLUSTER_CLIENT_IP', 'HTTP_FORWARDED_FOR', 'HTTP_FORWARDED', 'REMOTE_ADDR'
and this finds and logs the "correct" IP address - 197.132.255.244 (tested several times with various attackers)
whois.domaintools.com/197.132.255.244
>
>
>
My conclusion:
I think, some people are using Google services (like Google translate, Google mobile, etc.) for accessing (blocked) websites (in schools etc.) but also for DOS attacks and similar activity. How?
This way:
www.gmodules.com/ig/proxy?url=http://www.yoursite.com http://www.google.com/translate?langpair=de|en&u=www.yoursite.com
(change to your website instead of yoursite.com)
or other ways:
www.tech-recipes.com/rx/1322/use_google_proxy_bypass_blocked_site/
I think, it's up to you if you find and block the original IP address (197.132.255.244) with the help of this PHP function, which works even when the attacker is using a Google Proxy, and you will display them short message "you have exceeded our limits" or empty/error page, as I do...
or you block the Google Proxy IP (66.249.81.106 or similar), for example directly in the .httaccess file, if proxy exceeds your allowed limits. You will not block the Google crawler with this, but you may disable the functionality, when someone real (not attacker) wants to translate your webpage etc.
FYI I also received a DOS attack from google-proxy:
host 66.249.82.43
43.82.249.66.in-addr.arpa domain name pointer google-proxy-66-249-82-43.google.com.
log extracts:
Jun 9 21:19:43 gemelos kernel: PAX: From 66.249.82.43: execution attempt in: (null), 00000000-00000000 00000000
Jun 9 21:19:43 gemelos kernel: PAX: terminating task: /usr/sbin/apache2(apache2):25541, uid/euid: 81/81, PC: 00795f72, SP: b01666ec
Jun 9 21:19:43 gemelos kernel: PAX: bytes at PC: ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
Jun 9 21:19:43 gemelos kernel: PAX: bytes at SP-4: a2be3e6c a262fa66 1d620670 00000000 0000000c a3315e34 00000005 1d9ba84c a2be3e6c a2614bb8 1d9baadc 00000003 00010006 a2b222c7 a3316880 a12977a8 1d9baaf0 36183700 1d9ba84c a31a7d26 a1939349
Jun 9 21:19:43 gemelos kernel: grsec: From 66.249.82.43: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/sbin/apache2[apache2:25541] uid/euid:81/81 gid/egid:81/81, parent /usr/sbin/apache2[apache2:29657] uid/euid:0/0 gid/egid:0/0
Jun 10 00:03:40 gemelos kernel: grsec: denied resource overstep by requesting 18 for RLIMIT_NICE against limit 0 for /usr/bin/namecoind[namecoind:27085] uid/euid:105/105 gid/egid:122/122, parent /sbin/init[init:1] uid/euid:0/0 gid/egid:0/0
Jun 10 03:58:05 gemelos kernel: PAX: From 66.249.82.43: execution attempt in: <anonymous mapping>, 00000000-0001f000 00000000
Jun 10 03:58:05 gemelos kernel: PAX: terminating task: /usr/sbin/apache2(apache2):27985, uid/euid: 81/81, PC: (nil), SP: b01666ec
Jun 10 03:58:05 gemelos kernel: PAX: bytes at PC: ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
Jun 10 03:58:05 gemelos kernel: PAX: bytes at SP-4: 1d746620 a262fa66 1d6f34c0 00000000 0000000c a3315e34 00000005 1d7bb4dc a2be3e6c a2614bb8 1d7bb76c 00000003 00010006 a2b222c7 a3316880 a07a23c8 1d7bb780 36183700 1d7bb4dc a31a7d26 a1939349
Jun 10 03:58:05 gemelos kernel: grsec: From 66.249.82.43: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/sbin/apache2[apache2:27985] uid/euid:81/81 gid/egid:81/81, parent /usr/sbin/apache2[apache2:29657] uid/euid:0/0 gid/egid:0/0
Got it! These 'crawlers' are not crawlers but are part of the live website preview used in the Google search engine.
I have tried this, to show one of my websites in the preview and yes, there it is, received a blockedIP message.
If you want users to be able to view a preview of your website, you have to accept these 'crawlers'.
Like others said: "the root domain of that URL is google.com and that can't be easily spoofed".
Conclusion: You can trust these bot's or crawlers and it is used to show a preview in Google search.
Here is the issue with UserAgent and the reason why it is most likely a legitimate crawler:
Web servers can be configured to respond to any of the headers in the web page requests including the UserAgent. If Google webbots all looked the same then I could have a shady website that makes itself look like an encyclopedia of useful information to webbots while delivering complete crap to general users with various other UserAgents with the goal being to score high on all sorts of Google searches. I'm pretty sure Google is smarter than that. They will use all sorts of bots with all sorts of UserAgents to verify content. Google can also use information gleaned from scans of that sort to detect smart sites that will flavour their content for different browsers in useful ways.
After all, the root domain of that URL is google.com and that can't be easily spoofed.
Also, accessing javascript files is completely normal for a Google webbot. It's looking for more URLs in the javascript. You can prevent google from scanning your scripts by using <script rel="nofollow" src="code.js"></script>.
These are not fake and are used, these are private proxies used by staff members for various manual tasks/audits/reviews and should not be blocked...
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.