: Spiders crawling broken links Tl:Dr Unknown bot crawling the same broken (HTTP 400) URL over and over again. Different User Agent and different country of origin. The Problem It seems at least
Tl:Dr
Unknown bot crawling the same broken (HTTP 400) URL over and over again. Different User Agent and different country of origin.
The Problem
It seems at least once a week we're getting a big burst of HTTP 400 errors being hit on our site (we have logging to inform us). We'll check the logs in the morning and there's anywhere between 50 - 200 hits onto this single URL /foo/bar/item/.
What We Know
This URL appears on almost every page of our site (product listings) but is always formed as /foo/bar/item/857398 with an integer item ID on the end. When it's hit without an ID it correctly throws a HTTP 400 Invalid Request.
It seems this is a spider of some sort:
It hits with different user agents, seeming to vary between IE6, Firefox 5 and opera 8
It hits in small bursts of 2 - 10 requests every 30 minutes
It doesn't run JavaScript, as I can't find any trace of it in Google Analytics
It doesn't request any images linked on the page, the logs just list page after page, with no image requests between
It's very often proxy-ed to lots of different countries (we use Geo IP to trace as far as possible from the header information)
It doesn't send any HTTP_REFERER headers to trace which page it picked the URL up from
We've placed this URL in robots.txt as /foo/ because none of that URL subset should be indexable (almost all of it requires login).
I'm lost after that, it's still hitting this same URL over and over, I'm guessing it's picking it up from each individual page and just trying to fetch it every time, there doesn't seem to be any intelligence in remembering which URLs don't work.
I know this is almost impossible to stop as it's a public facing website being accessed by anyone who cares, but does anyone have any suggestions?
I also can't understand what they're achieving with such an inefficient crawling algorithm, or could this be some other kind of bot?
Update
Here the $_SERVER dump, with identifying information redacted, everything else is intact.
$_SERVER=array (
'REDIRECT_AC_HEADERS' => '',
'REDIRECT_SCRIPT_URL' => '/foo/bar/item/',
'REDIRECT_SCRIPT_URI' =>
'http://www.example.com/foo/bar/item/',
'REDIRECT_STATUS' => '200',
'AC_HEADERS' => '',
'SCRIPT_URL' => '/foo/bar/item/',
'SCRIPT_URI' =>
'http://www.example.com/foo/bar/item/',
'HTTP_HOST' => 'www.example.com',
'HTTP_USER_AGENT' => 'Mozilla/5.0 (Windows NT 5.1; U; en) Opera
8.01',
'HTTP_ACCEPT' =>
'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'HTTP_COOKIE' => 'frontend=sfasdfasdfasdfasdfasdfdsf;
frontend=sdfasdfasdfasdfasdfa',
'HTTP_VIA' => '1.1 localhost',
'HTTP_CONNECTION' => 'Keep-Alive',
'PATH' => '/usr/local/bin:/usr/bin:/bin',
'SERVER_SIGNATURE' => '<address>Apache/2.2.16 (Debian) Server at example.com Port 80</address>
',
'SERVER_SOFTWARE' => 'Apache/2.2.16 (Debian)',
'SERVER_NAME' => 'www.example.com',
'SERVER_ADDR' => '**.**.**.**',
'SERVER_PORT' => '80',
'REMOTE_ADDR' => '**.**.**.**',
'DOCUMENT_ROOT' => '/var/www/example.com/website/',
'SERVER_ADMIN' => 'webmaster@example.com',
'SCRIPT_FILENAME' => '/var/www/example.com/website/index.php',
'REMOTE_PORT' => '51735',
'REDIRECT_URL' => '/foo/bar/item/',
'GATEWAY_INTERFACE' => 'CGI/1.1',
'SERVER_PROTOCOL' => 'HTTP/1.1',
'REQUEST_METHOD' => 'GET',
'QUERY_STRING' => '',
'REQUEST_URI' => '/foo/bar/item/',
'SCRIPT_NAME' => '/index.php',
'PATH_INFO' => '/foo.bar/item/',
'PHP_SELF' => '/index.php/foo/bar/item/'
)
More posts by @Alves908
1 Comments
Sorted by latest first Latest Oldest Best
Ideally you'd set a 301 Redirect from /foo/bar/item/ to either your homepage or to the main list of products (like a category-type page). This means:
Any robots will be automatically taken to a valid page
Any users will be automatically taken to a usable page
Your error log should be much cleaner
Search engines will stop picking up broken pages
If you have Webmaster Tools set up then these probably show under Crawl Errors, so you could click the "Linked From" tab and see if any pages have linked to that URL directly and fix the links. Even after fixing any broken links, the 301 Redirect option is still worth keeping in place.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.