Mobile app version of vmapp.org
Login or Join
Heady270

: Looking for explanation of Apache behavior on manipulating headers Tried to get an answer at ServerFault - they don't know, all real gurus are sitting here. The background of the issue: Googlebot

@Heady270

Posted in: #403Forbidden #Apache #Htaccess #HttpHeaders

Tried to get an answer at ServerFault - they don't know, all real gurus are sitting here.

The background of the issue: Googlebot creates non-existing URLs and tries to crawl them. On some URLs Apache fires 404 (correctly), on another URLs - 403 (wrong). I can't catch URLs with RegEx, where Apache fires 403, so i can't properly rewrite them to force 404.

I created following workaround to force 404 instead of 403:

i add to htaccess

ErrorDocument 403 /404.php
ErrorDocument 404 /404.php


also for both cases the same file.

And then, to force the correct header, i add to 404.php, at the beginning, <?php http_response_code(404); ?>
On this way i show Googlebot 404 even there, where Apache tries to answer with 403.

The question is: could somebody explain me, how this workaround indeed works detailedly? How i'm able to manipulate header on this way? I thought always, Apache decides, which answer code to serve, before it looks into htaccess...

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Heady270

1 Comments

Sorted by latest first Latest Oldest Best

 

@Alves908

how this workaround indeed works


PHP runs later in the request, so most of the time you can simply override any headers that Apache has already set in your PHP code. That's pretty much it.

(Aside: Sending 403s through your 404 handler in this way obviously makes it harder to trigger a real 403 from your Apache config/.htaccess, if you should need to.)


most of the time


However, if there is a serious error (things are not working normally) then the server might respond with a 500 Internal Server Error - this is something that you may not be able to trap in your own code.

Also, by default, Apache is configured to return a (system generated) 404 for requests that contain an encode slash (%2F) - this is also something that you cannot override (without disabling this feature).

There are other situations where Apache will take over (mod_security etc), but otherwise, if things are running normally, you should be able to manipulate the entire response headers.


I thought always, Apache decides, which answer code to serve, before it looks into htaccess...


It does, but any code in .htaccess will override this. (Providing there are no restrictions in the server config preventing this.)


Googlebot creates non-existing URLs and tries to crawl them.


A lot of people see this behaviour. However, I don't think Googlebot is "creating" these URLs out of nowhere. It is more likely that these URLs are being found somewhere. (Or it's not actually a real Googlebot.)


On some URLs Apache fires 404 (correctly), on another URLs - 403 (wrong). I can't catch URLs with RegEx, where Apache fires 403, so i can't properly rewrite them to force 404.


Apache (mod_dir) will trigger a 403 when requesting a directory that doesn't contain an index document and where server-generated directory indexes are forbidden (hence the "403 Forbidden" response). mod_dir will also try to "fix" these URLs by appending a trailing slash (if omitted) - you will not be able to match the URL unless you include the trailing slash in your pattern (mod_dir fires early). So, this does sound like it might be a mod_dir issue. However, we'd need to see the URLs in question (and probably ask more questions about the server config / .htaccess files) to check this out.

Unless there is something else going on, you should still be able to trap/rewrite these URLs. Changing all 403s to 404s is not a particularly desirable workaround.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme