: Google not indexing site, robots being blocked I have a small static website which allows customers to purchase a PDF containing some data based from an input from them. The PDF's are a page

Posted in: #GoogleSearchConsole #Pdf #RobotsTxt

I have a small static website which allows customers to purchase a PDF containing some data based from an input from them. The PDF's are a page or two and follow a standard template, and various values change based upon the what the customers are searching for, and I expect them to be unique per customer (i.e 99% chance of having a unique PDF, although majority of the content within the PDF may be similar).

Previously I haven't made the PDF's searchable, however I've wanted to start including them and have written some code that dynamically gets all the available PDF's and generates a page where they are listed with a hyperlink to open them using this PHP code embedded in a HTML page :-

$directory = "../logs/reports/";

$files = glob($directory . "*.pdf");

foreach($files as $phpfile)
{
$filename = basename($file);
echo '<a href="/includes/getreport.php?file='.urlencode($filename).'">'.basename($file).'</a><br>';
}

My sitemap is reflecting the new page with weekly change frequency

<url>
<loc>https://www.myurl.com/reportlist.html</loc>
<changefreq>Weekly</changefreq>
</url>

I forgot to update my robots.txt initially for an allow rule as I had a disallow on /includes/ where the PHP file that retrieves the report is located, so I updated it and it now looks like the below

User-agent: *
Disallow: /font/
Disallow: /includes/
Disallow: /js/
Disallow: /wdsl/
Allow: /includes/getreport.php

My Search Console is now showing the following under Index Status

so the links to the PDF's are being crawled, but apparently links are blocked by robots.txt from being crawled. If I check the robots.txt Tester of a URL I get the following

I've tried putting the allow statement in the robots.txt at the top originally, and then at the below as per the current setup, and it appears to have been seen by Google after the changes.

If run Fetch as Bingbot I get the following

which suggest that robots.txt can access the page and retrieve the PDF, and although it looks like there is no visible text, I'm not overally worried as according to webmasters.googleblog.com/2011/09/pdfs-in-google-search-results.html these PDF's may be searchable by OCR, and I can cut and paste text from them via a PDF reader.

Lastly if I check my search results via site:mysite.com I see a number of links with "A description for this result is not available because of this site's robots.txt" so it seems to think they aren't searchable.

So my question is two-fold, firstly why are these appearing as blocked in Search Console, not sure if waiting longer is the answer as the Index Status seems to be changing and the number of indexed pages is dropping, but the blocked by robots seems to remain the same. Secondly will indexing this sort of content be problematic?

10.01% popularity Vote Up Vote Down

: Mix up of two domains in search result I would like to know why my two domains are mixed up in the search result. The two domains are https://vym.io and https://remotebase.io. When I search

@Caterina187

Posted in: #GoogleAnalytics #GoogleSearch #GoogleSearchConsole

1 Comments

: Using established domain names and phrases in your sentences as a keywords to boost your rank? Let's imagine a situation. I run a company that sells trucks and my nearest webrank competitors

@Caterina187

Posted in: #Keywords #Seo

2 Comments

: Can you renew a domain through a different company? I have a .com domain that is expiring soon (in about a week). Can I renew my domain through a different company or is it stuck with my

@Caterina187

Posted in: #DomainRegistration #Domains

1 Comments

: Is it considered blackhat to show structured data to search engine bots but not humans? Is it considered blackhat if the JSON-LD structured data is only shown to search engines? The content

@Caterina187

Posted in: #Blackhat #Cloaking #JsonLd #Seo #StructuredData

1 Comments

Login to post a comment!

1 Comments

Sorted by latest first Latest Oldest Best

@Kimberly868

It appears you're using query strings on a PHP page to generate the file links. Try adding a wildcard to the end of your ALLOW, as this will form part of the URI.

Also, have you checked the URL parameters section to see how Google is treating these variables? You can also explicitly set behaviour, make sure Google understands how to use the query strings when it's indexing.

10% popularity Vote Up Vote Down

Feed

: Google not indexing site, robots being blocked I have a small static website which allows customers to purchase a PDF containing some data based from an input from them. The PDF's are a page

More posts by @Caterina187

: Mix up of two domains in search result I would like to know why my two domains are mixed up in the search result. The two domains are https://vym.io and https://remotebase.io. When I search

: Using established domain names and phrases in your sentences as a keywords to boost your rank? Let's imagine a situation. I run a company that sells trucks and my nearest webrank competitors

: Can you renew a domain through a different company? I have a .com domain that is expiring soon (in about a week). Can I renew my domain through a different company or is it stuck with my

: Is it considered blackhat to show structured data to search engine bots but not humans? Is it considered blackhat if the JSON-LD structured data is only shown to search engines? The content

Login to post a comment!

1 Comments

Back to top | Use Dark Theme