: Does the google crawler really guess URL patterns and index pages that were never linked against? I'm experiencing problems with indexed pages which were (probably) never linked to. Here's the
I'm experiencing problems with indexed pages which were (probably) never linked to. Here's the setup:
Data-Server: Application with RESTful interface which provides the data
Website A: Provides the data of (1) at website-a.example.com/?id=RESOURCE_ID Website B: Provides the data of (1) at website-b.example.com/?id=OTHER_RESOURCE_ID
So the whole, non-private data is stored on (1) and the websites (2) and (3) can fetch and display this data, which is a representation of the data with additional cross-linking between those.
In fact, the URL /?id=1 of website-a points to the same resource as /?id=1 of website-b. However, the resource id:1 is useless at website-b. Unfortunately, the google index for website-b now contains several links of resources belonging to website-a and vice versa.
I "heard" that the google crawler tries to determine the URL-pattern (which makes sense for deciding which page should go into the index and which not) and furthermore guesses other URLs by trying different values (like "I know that id 1 exists, let's try 2, 3, 4, ...").
Is there any evidence that the google crawler really behaves that way (which I doubt). My guess is that the google crawler submitted a HTML-Form and somehow got links to those unwanted resources.
I found some similar posted questions about that, including "Google webmaster central:
indexing and posting false pages" [link removed] however, none of those pages give an evidence.
More posts by @Carla537
2 Comments
Sorted by latest first Latest Oldest Best
Our experience is that google does seem to 'guess' URL parameters.
We used to have a legacy url structure main.php?id=1 etc. and changed this a year ago to a more SEO friendly structure.
We noticed that items recently entered were still being indexed by google at main.php?id=1234 rather than our spiffy new SEO optimised URL even though this page never existed when we had our old legacy structure. We nowhere, otherwise, had a link to these pages using this old URL.
We reviewed our server logs and noticed someone looking at our pages in a sequential manner using our old legacy url, i.e. main.php?id=1, 2, 3 etc. They would go upwards in batches about 150 or so, and then come back a few hours later and do another 150. We tracked the IP address of the request and found it was a standard google bot IP.
The old legacy URL still worked, as we had not disabled it - we just never referred to it and never had thought anyone would try this.
We solved the problem by putting a 301 redirect in our index.php whenever a URL containing a new page was called. A few hours coding, but it seems to have resolved our issue - new pages added to google seem to contain our SEO URL and we have had no attempted use of our old legacy URL for several weeks.
We can only conclude that the google bot is aware of parameters and does indeed try them even if no real link occurs.
If you have Webmaster Tools setup I would go into the URL parameters section under Site Configuration and see what parameters they have setup by default for your site. My site being WordPress google recognizes some of the parameters, but interestingly for the live chat script and a few other random scripts I use it's also guessed some of the parameters.
I would use noindex nofollow as well as robots.txt where possible.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.