Mobile app version of vmapp.org
Login or Join
Alves908

: Spiders crawling broken links Tl:Dr Unknown bot crawling the same broken (HTTP 400) URL over and over again. Different User Agent and different country of origin. The Problem It seems at least

@Alves908

Posted in: #Apache #CrawlErrors #Debian #Url #WebCrawlers

Tl:Dr
Unknown bot crawling the same broken (HTTP 400) URL over and over again. Different User Agent and different country of origin.

The Problem

It seems at least once a week we're getting a big burst of HTTP 400 errors being hit on our site (we have logging to inform us). We'll check the logs in the morning and there's anywhere between 50 - 200 hits onto this single URL /foo/bar/item/.

What We Know

This URL appears on almost every page of our site (product listings) but is always formed as /foo/bar/item/857398 with an integer item ID on the end. When it's hit without an ID it correctly throws a HTTP 400 Invalid Request.

It seems this is a spider of some sort:


It hits with different user agents, seeming to vary between IE6, Firefox 5 and opera 8
It hits in small bursts of 2 - 10 requests every 30 minutes
It doesn't run JavaScript, as I can't find any trace of it in Google Analytics
It doesn't request any images linked on the page, the logs just list page after page, with no image requests between
It's very often proxy-ed to lots of different countries (we use Geo IP to trace as far as possible from the header information)
It doesn't send any HTTP_REFERER headers to trace which page it picked the URL up from


We've placed this URL in robots.txt as /foo/ because none of that URL subset should be indexable (almost all of it requires login).

I'm lost after that, it's still hitting this same URL over and over, I'm guessing it's picking it up from each individual page and just trying to fetch it every time, there doesn't seem to be any intelligence in remembering which URLs don't work.

I know this is almost impossible to stop as it's a public facing website being accessed by anyone who cares, but does anyone have any suggestions?

I also can't understand what they're achieving with such an inefficient crawling algorithm, or could this be some other kind of bot?

Update

Here the $_SERVER dump, with identifying information redacted, everything else is intact.

$_SERVER=array (
'REDIRECT_AC_HEADERS' => '',
'REDIRECT_SCRIPT_URL' => '/foo/bar/item/',
'REDIRECT_SCRIPT_URI' =>
'http://www.example.com/foo/bar/item/',
'REDIRECT_STATUS' => '200',
'AC_HEADERS' => '',
'SCRIPT_URL' => '/foo/bar/item/',
'SCRIPT_URI' =>
'http://www.example.com/foo/bar/item/',
'HTTP_HOST' => 'www.example.com',
'HTTP_USER_AGENT' => 'Mozilla/5.0 (Windows NT 5.1; U; en) Opera
8.01',
'HTTP_ACCEPT' =>
'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'HTTP_COOKIE' => 'frontend=sfasdfasdfasdfasdfasdfdsf;
frontend=sdfasdfasdfasdfasdfa',
'HTTP_VIA' => '1.1 localhost',
'HTTP_CONNECTION' => 'Keep-Alive',
'PATH' => '/usr/local/bin:/usr/bin:/bin',
'SERVER_SIGNATURE' => '<address>Apache/2.2.16 (Debian) Server at example.com Port 80</address>
',
'SERVER_SOFTWARE' => 'Apache/2.2.16 (Debian)',
'SERVER_NAME' => 'www.example.com',
'SERVER_ADDR' => '**.**.**.**',
'SERVER_PORT' => '80',
'REMOTE_ADDR' => '**.**.**.**',
'DOCUMENT_ROOT' => '/var/www/example.com/website/',
'SERVER_ADMIN' => 'webmaster@example.com',
'SCRIPT_FILENAME' => '/var/www/example.com/website/index.php',
'REMOTE_PORT' => '51735',
'REDIRECT_URL' => '/foo/bar/item/',
'GATEWAY_INTERFACE' => 'CGI/1.1',
'SERVER_PROTOCOL' => 'HTTP/1.1',
'REQUEST_METHOD' => 'GET',
'QUERY_STRING' => '',
'REQUEST_URI' => '/foo/bar/item/',
'SCRIPT_NAME' => '/index.php',
'PATH_INFO' => '/foo.bar/item/',
'PHP_SELF' => '/index.php/foo/bar/item/'
)

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Alves908

1 Comments

Sorted by latest first Latest Oldest Best

 

@Courtney195

Ideally you'd set a 301 Redirect from /foo/bar/item/ to either your homepage or to the main list of products (like a category-type page). This means:


Any robots will be automatically taken to a valid page
Any users will be automatically taken to a usable page
Your error log should be much cleaner
Search engines will stop picking up broken pages


If you have Webmaster Tools set up then these probably show under Crawl Errors, so you could click the "Linked From" tab and see if any pages have linked to that URL directly and fix the links. Even after fixing any broken links, the 301 Redirect option is still worth keeping in place.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme