: Google not indexing site, robots being blocked I have a small static website which allows customers to purchase a PDF containing some data based from an input from them. The PDF's are a page
I have a small static website which allows customers to purchase a PDF containing some data based from an input from them. The PDF's are a page or two and follow a standard template, and various values change based upon the what the customers are searching for, and I expect them to be unique per customer (i.e 99% chance of having a unique PDF, although majority of the content within the PDF may be similar).
Previously I haven't made the PDF's searchable, however I've wanted to start including them and have written some code that dynamically gets all the available PDF's and generates a page where they are listed with a hyperlink to open them using this PHP code embedded in a HTML page :-
$directory = "../logs/reports/";
$files = glob($directory . "*.pdf");
foreach($files as $phpfile)
{
$filename = basename($file);
echo '<a href="/includes/getreport.php?file='.urlencode($filename).'">'.basename($file).'</a><br>';
}
My sitemap is reflecting the new page with weekly change frequency
<url>
<loc>https://www.myurl.com/reportlist.html</loc>
<changefreq>Weekly</changefreq>
</url>
I forgot to update my robots.txt initially for an allow rule as I had a disallow on /includes/ where the PHP file that retrieves the report is located, so I updated it and it now looks like the below
User-agent: *
Disallow: /font/
Disallow: /includes/
Disallow: /js/
Disallow: /wdsl/
Allow: /includes/getreport.php
My Search Console is now showing the following under Index Status
so the links to the PDF's are being crawled, but apparently links are blocked by robots.txt from being crawled. If I check the robots.txt Tester of a URL I get the following
I've tried putting the allow statement in the robots.txt at the top originally, and then at the below as per the current setup, and it appears to have been seen by Google after the changes.
If run Fetch as Bingbot I get the following
which suggest that robots.txt can access the page and retrieve the PDF, and although it looks like there is no visible text, I'm not overally worried as according to webmasters.googleblog.com/2011/09/pdfs-in-google-search-results.html these PDF's may be searchable by OCR, and I can cut and paste text from them via a PDF reader.
Lastly if I check my search results via site:mysite.com I see a number of links with "A description for this result is not available because of this site's robots.txt" so it seems to think they aren't searchable.
So my question is two-fold, firstly why are these appearing as blocked in Search Console, not sure if waiting longer is the answer as the Index Status seems to be changing and the number of indexed pages is dropping, but the blocked by robots seems to remain the same. Secondly will indexing this sort of content be problematic?
More posts by @Caterina187
1 Comments
Sorted by latest first Latest Oldest Best
It appears you're using query strings on a PHP page to generate the file links. Try adding a wildcard to the end of your ALLOW, as this will form part of the URI.
Also, have you checked the URL parameters section to see how Google is treating these variables? You can also explicitly set behaviour, make sure Google understands how to use the query strings when it's indexing.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.