Mobile app version of vmapp.org
Login or Join
Dunderdale272

: Easy way to identify pages which are index, but not linked to on my site I'm just wondering if there is an easy way to find out what pages Google has indexed, that are not directly linked

@Dunderdale272

Posted in: #Google #Indexing #Links

I'm just wondering if there is an easy way to find out what pages Google has indexed, that are not directly linked on my site, for example: mysite.com/skype.html originally had a link in my site menu, but this link was removed. The page is still available when the url is typed directly, but there is no direct link to it. I don't want Google indexing this page any more and want to put a 'disallow' in my robots.txt.

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Dunderdale272

2 Comments

Sorted by latest first Latest Oldest Best

 

@Odierno851

In most cases, I wouldn't worry about finding that kind of content - it's usually handled well-enough by the search engine algorithms (if it's not relevant, it's not shown).

That said, if you still want to find this kind of content, you can try to determine the difference between the URLs that you know you want to have indexed, and those that you see the search engines crawling (and indexing).

Finding the URLs that you know you want to have indexed can be quite a job. One way to do that is to use a crawler, another way is to extract them from your CMS. If you're working on a larger website, you might already have that in the form of a Sitemap file.

Finding the URLs that search engines have indexed is a bit harder, but you can approximate that by looking at your server logs to see which URLs are successfully crawled (returning "200 OK"). Generally, indexed content is recrawled regularly - somewhere between several times a day and once every few weeks or months. If you can look at your server logs for a longer period of time, you should be able to get a reasonable approximation of the URLs that search engines have been crawling (and therefore, potentially indexing).

Depending on how your site is structured, you will probably have to filter some of those URLs to remove common cruft that search engines already ignore (session-IDs come to mind). Keep in mind that crawled does not necessarily mean indexed, but if it was crawled successfully, at least there's a chance that it can be indexed.

Afterwards you can just compare the list of URLs and get the differences ( stackoverflow.com/questions/4544709 has some suggestions for Unix/Linux command lines to do that). To be complete, you could double-check fetching the final URLs to make sure that they still return "200 OK", and that they don't use a noindex robots/googlebot meta tag (or HTTP header, if you use that).

I'm not aware of any tool that does this whole process for you..

10% popularity Vote Up Vote Down


 

@Ravi8258870

I think the short answer is no, though a lot depends on how much content we're talking about, whether or not you have duplication problems, etc.

The site: operator will show you what Google has indexed (although its accuracy can't be guaranteed), and there are certain tools that will export Google searches to Excel files and such. You could then crawl your site with something like Screaming Frog or Xenu, which will only find pages if they're linked to, and compare that to the index.

However, if you've got even a moderately large amount of content, lots of duplicates, etc., it may be a prohibitively onerous task.

As an aside – and this is covered elsewhere on this site so I won't get into it – robots.txt isn't the right tool for the job. Better to use noindex tag.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme