Mobile app version of vmapp.org
Login or Join
Caterina187

: Google not indexing site, robots being blocked I have a small static website which allows customers to purchase a PDF containing some data based from an input from them. The PDF's are a page

@Caterina187

Posted in: #GoogleSearchConsole #Pdf #RobotsTxt

I have a small static website which allows customers to purchase a PDF containing some data based from an input from them. The PDF's are a page or two and follow a standard template, and various values change based upon the what the customers are searching for, and I expect them to be unique per customer (i.e 99% chance of having a unique PDF, although majority of the content within the PDF may be similar).

Previously I haven't made the PDF's searchable, however I've wanted to start including them and have written some code that dynamically gets all the available PDF's and generates a page where they are listed with a hyperlink to open them using this PHP code embedded in a HTML page :-

$directory = "../logs/reports/";

$files = glob($directory . "*.pdf");

foreach($files as $phpfile)
{
$filename = basename($file);
echo '<a href="/includes/getreport.php?file='.urlencode($filename).'">'.basename($file).'</a><br>';
}


My sitemap is reflecting the new page with weekly change frequency

<url>
<loc>https://www.myurl.com/reportlist.html</loc>
<changefreq>Weekly</changefreq>
</url>


I forgot to update my robots.txt initially for an allow rule as I had a disallow on /includes/ where the PHP file that retrieves the report is located, so I updated it and it now looks like the below

User-agent: *
Disallow: /font/
Disallow: /includes/
Disallow: /js/
Disallow: /wdsl/
Allow: /includes/getreport.php


My Search Console is now showing the following under Index Status


so the links to the PDF's are being crawled, but apparently links are blocked by robots.txt from being crawled. If I check the robots.txt Tester of a URL I get the following



I've tried putting the allow statement in the robots.txt at the top originally, and then at the below as per the current setup, and it appears to have been seen by Google after the changes.

If run Fetch as Bingbot I get the following



which suggest that robots.txt can access the page and retrieve the PDF, and although it looks like there is no visible text, I'm not overally worried as according to webmasters.googleblog.com/2011/09/pdfs-in-google-search-results.html these PDF's may be searchable by OCR, and I can cut and paste text from them via a PDF reader.

Lastly if I check my search results via site:mysite.com I see a number of links with "A description for this result is not available because of this site's robots.txt" so it seems to think they aren't searchable.

So my question is two-fold, firstly why are these appearing as blocked in Search Console, not sure if waiting longer is the answer as the Index Status seems to be changing and the number of indexed pages is dropping, but the blocked by robots seems to remain the same. Secondly will indexing this sort of content be problematic?

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Caterina187

1 Comments

Sorted by latest first Latest Oldest Best

 

@Kimberly868

It appears you're using query strings on a PHP page to generate the file links. Try adding a wildcard to the end of your ALLOW, as this will form part of the URI.

Also, have you checked the URL parameters section to see how Google is treating these variables? You can also explicitly set behaviour, make sure Google understands how to use the query strings when it's indexing.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme