: Spiders crawling broken links Tl:Dr Unknown bot crawling the same broken (HTTP 400) URL over and over again. Different User Agent and different country of origin. The Problem It seems at least

Posted in: #Apache #CrawlErrors #Debian #Url #WebCrawlers

Tl:Dr
Unknown bot crawling the same broken (HTTP 400) URL over and over again. Different User Agent and different country of origin.

The Problem

It seems at least once a week we're getting a big burst of HTTP 400 errors being hit on our site (we have logging to inform us). We'll check the logs in the morning and there's anywhere between 50 - 200 hits onto this single URL /foo/bar/item/.

What We Know

This URL appears on almost every page of our site (product listings) but is always formed as /foo/bar/item/857398 with an integer item ID on the end. When it's hit without an ID it correctly throws a HTTP 400 Invalid Request.

It seems this is a spider of some sort:

It hits with different user agents, seeming to vary between IE6, Firefox 5 and opera 8
It hits in small bursts of 2 - 10 requests every 30 minutes
It doesn't run JavaScript, as I can't find any trace of it in Google Analytics
It doesn't request any images linked on the page, the logs just list page after page, with no image requests between
It's very often proxy-ed to lots of different countries (we use Geo IP to trace as far as possible from the header information)
It doesn't send any HTTP_REFERER headers to trace which page it picked the URL up from

We've placed this URL in robots.txt as /foo/ because none of that URL subset should be indexable (almost all of it requires login).

I'm lost after that, it's still hitting this same URL over and over, I'm guessing it's picking it up from each individual page and just trying to fetch it every time, there doesn't seem to be any intelligence in remembering which URLs don't work.

I know this is almost impossible to stop as it's a public facing website being accessed by anyone who cares, but does anyone have any suggestions?

I also can't understand what they're achieving with such an inefficient crawling algorithm, or could this be some other kind of bot?

Update

Here the $_SERVER dump, with identifying information redacted, everything else is intact.

$_SERVER=array (
'REDIRECT_AC_HEADERS' => '',
'REDIRECT_SCRIPT_URL' => '/foo/bar/item/',
'REDIRECT_SCRIPT_URI' =>
'http://www.example.com/foo/bar/item/',
'REDIRECT_STATUS' => '200',
'AC_HEADERS' => '',
'SCRIPT_URL' => '/foo/bar/item/',
'SCRIPT_URI' =>
'http://www.example.com/foo/bar/item/',
'HTTP_HOST' => 'www.example.com',
'HTTP_USER_AGENT' => 'Mozilla/5.0 (Windows NT 5.1; U; en) Opera
8.01',
'HTTP_ACCEPT' =>
'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'HTTP_COOKIE' => 'frontend=sfasdfasdfasdfasdfasdfdsf;
frontend=sdfasdfasdfasdfasdfa',
'HTTP_VIA' => '1.1 localhost',
'HTTP_CONNECTION' => 'Keep-Alive',
'PATH' => '/usr/local/bin:/usr/bin:/bin',
'SERVER_SIGNATURE' => '<address>Apache/2.2.16 (Debian) Server at example.com Port 80</address>
',
'SERVER_SOFTWARE' => 'Apache/2.2.16 (Debian)',
'SERVER_NAME' => 'www.example.com',
'SERVER_ADDR' => '**.**.**.**',
'SERVER_PORT' => '80',
'REMOTE_ADDR' => '**.**.**.**',
'DOCUMENT_ROOT' => '/var/www/example.com/website/',
'SERVER_ADMIN' => 'webmaster@example.com',
'SCRIPT_FILENAME' => '/var/www/example.com/website/index.php',
'REMOTE_PORT' => '51735',
'REDIRECT_URL' => '/foo/bar/item/',
'GATEWAY_INTERFACE' => 'CGI/1.1',
'SERVER_PROTOCOL' => 'HTTP/1.1',
'REQUEST_METHOD' => 'GET',
'QUERY_STRING' => '',
'REQUEST_URI' => '/foo/bar/item/',
'SCRIPT_NAME' => '/index.php',
'PATH_INFO' => '/foo.bar/item/',
'PHP_SELF' => '/index.php/foo/bar/item/'
)

10.01% popularity Vote Up Vote Down

: Will multilingual URL fix give me SEO juice? I'm working on a fairly large site, which has multilingual options, but not correctly implemented: Currently: www.example.com/this-is-a-url-of-a-certain-page

@Alves908

Posted in: #Multilingual #Seo #Url

2 Comments

: Analytics reports multiple different screen resolutions for (apparently) a single visitor I have a site with only a small number of visitors. Using the "days since last visit" dimension in Google

@Alves908

Posted in: #Analytics #GoogleAnalytics #Resolution

1 Comments

: No-ip.com how to hide static IP address I'm using no-ip.com in "Port 80 redirect" mode - in fact my provider doesn't mind port 80 being used it's just I don't want to reveal my static IP.

@Alves908

Posted in: #Dns #StaticIp

1 Comments

: Back link not showing For this website http://www.it.uu.se/workshop/infinity2010/, there is a link from here http://atva10.comp.nus.edu.sg/assocEvents.html but when searching in Google. link:http://www.it.uu.se/workshop/infinity2010/

@Alves908

Posted in: #GoogleIndex #Seo

2 Comments

Login to post a comment!

1 Comments

Sorted by latest first Latest Oldest Best

@Courtney195

Ideally you'd set a 301 Redirect from /foo/bar/item/ to either your homepage or to the main list of products (like a category-type page). This means:

Any robots will be automatically taken to a valid page
Any users will be automatically taken to a usable page
Your error log should be much cleaner
Search engines will stop picking up broken pages

If you have Webmaster Tools set up then these probably show under Crawl Errors, so you could click the "Linked From" tab and see if any pages have linked to that URL directly and fix the links. Even after fixing any broken links, the 301 Redirect option is still worth keeping in place.

10% popularity Vote Up Vote Down

Feed

: Spiders crawling broken links Tl:Dr Unknown bot crawling the same broken (HTTP 400) URL over and over again. Different User Agent and different country of origin. The Problem It seems at least

More posts by @Alves908

: Will multilingual URL fix give me SEO juice? I'm working on a fairly large site, which has multilingual options, but not correctly implemented: Currently: www.example.com/this-is-a-url-of-a-certain-page

: Analytics reports multiple different screen resolutions for (apparently) a single visitor I have a site with only a small number of visitors. Using the "days since last visit" dimension in Google

: No-ip.com how to hide static IP address I'm using no-ip.com in "Port 80 redirect" mode - in fact my provider doesn't mind port 80 being used it's just I don't want to reveal my static IP.

: Back link not showing For this website http://www.it.uu.se/workshop/infinity2010/, there is a link from here http://atva10.comp.nus.edu.sg/assocEvents.html but when searching in Google. link:http://www.it.uu.se/workshop/infinity2010/

Login to post a comment!

1 Comments

Back to top | Use Dark Theme