: Easy way to identify pages which are index, but not linked to on my site I'm just wondering if there is an easy way to find out what pages Google has indexed, that are not directly linked

I'm just wondering if there is an easy way to find out what pages Google has indexed, that are not directly linked on my site, for example: mysite.com/skype.html originally had a link in my site menu, but this link was removed. The page is still available when the url is typed directly, but there is no direct link to it. I don't want Google indexing this page any more and want to put a 'disallow' in my robots.txt.

10.02% popularity Vote Up Vote Down

: Company or website logo as the search result image / Google+ profile I was considering having for this website I'm working on, the website's logo as the image that gets included in the search

@Dunderdale272

Posted in: #GooglePlus #GoogleSearch #Seo

1 Comments

: Google crawls my website too slowly Possible Duplicate: What are the best ways to increase a site's position in Google? I think Google crawls it too slowly (every three days).

@Dunderdale272

Posted in: #Google #WebCrawlers

2 Comments

: Invalid Google markup rel=publisher Why is Google publisher HTML tag invalid according to W3C specification? Is there any way to correct this example: <a href="http://plus.google.com/109469573972649651137?prsrc=3"

@Dunderdale272

Posted in: #GoogleSearchConsole #Validation #WebDevelopment

3 Comments

: Sending Emails via Google SMTP - after some time quit working on a website I use PHPMailer to send automated registration emails, etc and also a newsletter-tool (which loops through the emails

@Dunderdale272

Posted in: #BulkEmail #Email #Gmail #Spam #SpamPrevention

1 Comments

Login to post a comment!

2 Comments

Sorted by latest first Latest Oldest Best

@Odierno851

In most cases, I wouldn't worry about finding that kind of content - it's usually handled well-enough by the search engine algorithms (if it's not relevant, it's not shown).

That said, if you still want to find this kind of content, you can try to determine the difference between the URLs that you know you want to have indexed, and those that you see the search engines crawling (and indexing).

Finding the URLs that you know you want to have indexed can be quite a job. One way to do that is to use a crawler, another way is to extract them from your CMS. If you're working on a larger website, you might already have that in the form of a Sitemap file.

Finding the URLs that search engines have indexed is a bit harder, but you can approximate that by looking at your server logs to see which URLs are successfully crawled (returning "200 OK"). Generally, indexed content is recrawled regularly - somewhere between several times a day and once every few weeks or months. If you can look at your server logs for a longer period of time, you should be able to get a reasonable approximation of the URLs that search engines have been crawling (and therefore, potentially indexing).

Depending on how your site is structured, you will probably have to filter some of those URLs to remove common cruft that search engines already ignore (session-IDs come to mind). Keep in mind that crawled does not necessarily mean indexed, but if it was crawled successfully, at least there's a chance that it can be indexed.

Afterwards you can just compare the list of URLs and get the differences ( stackoverflow.com/questions/4544709 has some suggestions for Unix/Linux command lines to do that). To be complete, you could double-check fetching the final URLs to make sure that they still return "200 OK", and that they don't use a noindex robots/googlebot meta tag (or HTTP header, if you use that).

I'm not aware of any tool that does this whole process for you..

10% popularity Vote Up Vote Down

@Ravi8258870

I think the short answer is no, though a lot depends on how much content we're talking about, whether or not you have duplication problems, etc.

The site: operator will show you what Google has indexed (although its accuracy can't be guaranteed), and there are certain tools that will export Google searches to Excel files and such. You could then crawl your site with something like Screaming Frog or Xenu, which will only find pages if they're linked to, and compare that to the index.

However, if you've got even a moderately large amount of content, lots of duplicates, etc., it may be a prohibitively onerous task.

As an aside – and this is covered elsewhere on this site so I won't get into it – robots.txt isn't the right tool for the job. Better to use noindex tag.

10% popularity Vote Up Vote Down

Feed

: Easy way to identify pages which are index, but not linked to on my site I'm just wondering if there is an easy way to find out what pages Google has indexed, that are not directly linked

More posts by @Dunderdale272

: Company or website logo as the search result image / Google+ profile I was considering having for this website I'm working on, the website's logo as the image that gets included in the search

: Google crawls my website too slowly Possible Duplicate: What are the best ways to increase a site's position in Google? I think Google crawls it too slowly (every three days).

: Invalid Google markup rel=publisher Why is Google publisher HTML tag invalid according to W3C specification? Is there any way to correct this example: <a href="http://plus.google.com/109469573972649651137?prsrc=3"

: Sending Emails via Google SMTP - after some time quit working on a website I use PHPMailer to send automated registration emails, etc and also a newsletter-tool (which loops through the emails

Login to post a comment!

2 Comments

Back to top | Use Dark Theme