: "Index of /" Pages > Created By Host ? > Block from Indexing I have a my small static site hosted on a shared server. Following a Duplicate Title Tag warnings in GWT > HTML Improvements,
I have a my small static site hosted on a shared server.
Following a Duplicate Title Tag warnings in GWT > HTML Improvements, and a subsequent Google site:domain search, I discovered that pages are being indexed which I did not create. In fact, heaps of pages (far more than my small site) are now indexed.
The pages:
seem to be auto-generated by my host;
have urls that seem to correlate to each directory in the site some pages with a dynamic url part added at the end - one of several consistent forms
"?MA", "?NA" etc - and some pages without the dynamic part - e.g. my_site/assets/css;
have a title tag of "Index of / [the corresponding url WITHOUT the dynamic part]
list in a table format assets in that directory folder (.css, .jpg, etc) with headings last modified, size and description;
could be a litespeed thing, because every page says "Proudly served
by Litespeed Web Server ..".
I am guessing I want to block these pages because they:
don't contain any real content,
are not mobile friendly; and
are causing duplicate title tag errors because several dynamic versions are being created for some directories (perhaps those directories have different image types in them?).
I note that this seeming related question seems to suggest that blocking might be done in the .htaccess file. It mentions inserting Option -index somewhere in .htaccess.
However, I have a few questions:
I am not sure what is causing this and why this is just now appearing, the site has been around a little while now and this is new - though I have made recent changes;
I am not sure if that is all there is to it? Do I simply put that line (Option -index) anywhere in my .htaccess file?
Should also be doing some sort of robots.txt command too?
Any help would be greatly appreciated.
More posts by @Pierce454
2 Comments
Sorted by latest first Latest Oldest Best
I recommend the following:
Communicate with your host and see if they can correct the problem on their end at the server level using httpd.conf or creating a .htaccess for you
If not or if it still poses a problem, create an .htaccess file with just the one statement, namely Options -Indexes with a blank line following it and upload it to the root folder
Either way, also create a sitemap.xml file that you can submit to Google using the Google Search Console. The format for the sitemap.xml file in its simplest for is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.domain_name.com/webpage_name.html</loc></url>
</urlset>
ANSWER UPDATED TO TRY AVOID CONFUSION: Thanks to @w3dk , who spurred me to look into this more and give a (slightly) better explanation.
ANSWERS TO MY QUESTIONS (PARAPHRASING):
1. What caused this?
Apache's Indexes option was turned on.
I think I caused this implementing extensionless urls, but I am not sure.
In the event, having the Indexes option turned on meant that:
a. when a directory URL (ending in a "/") was requested (in this case by googlebot which does this as part of learning/crawling through a site);
b. and there was no index.html or other expected file (as specified via DirectoryIndex which tells the server what files to look for when the requested url is a directory - index.html, index.php, index.htm etc);
c. my server would create an index.html file listing the files in the directory (using mod_autoindex) and give it to Googlebot;
d. then Googlebot would in turn add that page (the server-created index.html file) to search - resulting in the unexpected appearance of those pages (mainly from my assets directories where no index.html lives) in the google index.
2. How to Fix?
First, ask yourself if you need to turn if off.
If you're not hiding anything and don't care about it - just leave it.
I decided I wanted to turn it off because I thought it might hurt SEO as:
Google was indexing pages without much valuable content.
It was causing title tag conflicts.
Was potentially wasting crawl budget.
But in all likelihood it probably doesn't affect seo at all. Perhaps having them even helps a little? Who knows?
I didn't see any changes either way.
You turn it off by adding "Options -Indexes" to .htaccess file - basically saying don't create an index if one doesn't exist. If you have access to the config file (I didn't as I am on shared hosting) you could probably do it there.
HOWEVER: Note that doing this does not result in a 404 (file not found response) when those directory URLs (ones without an index.html file in thiem) are requested.
Somewhat counterintuitively, turning off Indexes makes the server return a 403 Error (forbidden access) instead!!
I don't know why. The proper response, in my view, is a 404. I haven't said to the server: "Restrict Access". I said: "Don't make an index file".
Anyway, the good news is that the 403 (forbidden) response will cause google to remove the url from their index quickly - more quickly than a 404 (not found) response.
The bad news is that your custom 404 page will may not display for those urls. Instead they might get the generic 403 forbidden error.
If so, you can go back to you .htaccess file and add a new line ErrorDocument 403 /404.html - which says, if a 403 error is experienced, return the 404.html document located in the root.
Users should now see the custom file.
However, I suspect that this would still returns the page with a 403 error header.
I don't know though. I got lazy didn't set this up.
I briefly looked for a better solution to stopping creating an index.html without a 403 error, but couldn't find one.
If the 404.html is still being returned with a 403 response code though, and you want to get rid of it, a thought might be to try a specific redirect to the .404.html instead of returning the 404.html?
That could get cumbersome if you have lots of asset directories without an index.html file in them (like I do).
3. Anything else?
I thought about a robots.txt, but there is technically no content to restrict.
Also, the 403 response removed the urls pretty quickly and listed crawl errors don't hurt SEO apparently - meant more as a warning of potential problems than anything else.
For the record, I did consider keeping the index pages and using a noindex tag to prevent indexing - but I couldn't figure out how to access the index.html file created by the server.
Hope this helps.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.