Mobile app version of vmapp.org
Login or Join
Reiling115

: How does Google discover and index URLs that aren't linked or in the sitemap I can see multiple URLs of my website are crawled by google. I see it using site: in Google search. I was wondering

@Reiling115

Posted in: #Google #GoogleIndex #Indexing #Seo

I can see multiple URLs of my website are crawled by google. I see it using site: in Google search.

I was wondering what are all possible places from where Google picks these URLs? I checked many of my crawled URLs are not in the sitemap, and we haven't put link of these URLs on any other page too. How would Google discover such content?

Is there anyway I can check all my Google indexed URLs and get information regarding how Google discovered those pages?

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Reiling115

2 Comments

Sorted by latest first Latest Oldest Best

 

@Cugini213

Recently had the same issue and was puzzled about how Google knew about an internal URL at my site.

The directory in question for me was /piwik (an open source alternative to google analytics)..

So Google also crawls links in your source files (like html). If there are links in there like in <meta> or <script> urls in here </script>, Google will crawl and index away..

10% popularity Vote Up Vote Down


 

@Nimeshi995

There are many places Google can go to index your site pages. Your sitemap, and what's on your live site, are only a small part of it. Your XML sitemap is merely a signal to Google, Bing, and other search engines to index your most important pages and to take note of new content (if you're using a CMS and a plugin that automatically updates the sitemap.)

When Google gets into your site, it follows all kinds of links, not just page-level links. It can index files, taxonomies, multiple versions of pages... In a CMS like Drupal, where everything is a node, it can even index portions of pages.

This is why it's important that you know your CMS and how it works on the backend. You have to use a combination of noindex meta, canonicalization, redirects, robots.txt, and Search Console / Bing Webmaster to control what's being crawled/indexed and what isn't.

Using Search Console to look at inbound links, Moz's Open Site Explorer to analyze the linkscape of any individual page, and a tool like Screaming Frog SEO Spider (the first one is free, the second and third are freemium) will allow you to analyze both internal and external links. Between all of these, you should be able to diagnose the source.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme