: Does the google crawler really guess URL patterns and index pages that were never linked against? I'm experiencing problems with indexed pages which were (probably) never linked to. Here's the

I'm experiencing problems with indexed pages which were (probably) never linked to. Here's the setup:

Data-Server: Application with RESTful interface which provides the data
Website A: Provides the data of (1) at website-a.example.com/?id=RESOURCE_ID Website B: Provides the data of (1) at website-b.example.com/?id=OTHER_RESOURCE_ID

So the whole, non-private data is stored on (1) and the websites (2) and (3) can fetch and display this data, which is a representation of the data with additional cross-linking between those.

In fact, the URL /?id=1 of website-a points to the same resource as /?id=1 of website-b. However, the resource id:1 is useless at website-b. Unfortunately, the google index for website-b now contains several links of resources belonging to website-a and vice versa.

I "heard" that the google crawler tries to determine the URL-pattern (which makes sense for deciding which page should go into the index and which not) and furthermore guesses other URLs by trying different values (like "I know that id 1 exists, let's try 2, 3, 4, ...").

Is there any evidence that the google crawler really behaves that way (which I doubt). My guess is that the google crawler submitted a HTML-Form and somehow got links to those unwanted resources.

I found some similar posted questions about that, including "Google webmaster central:
indexing and posting false pages" [link removed] however, none of those pages give an evidence.

10.02% popularity Vote Up Vote Down

: Page Titles - Including gender of a fashion product in page titles? I need a bit of help to decide whether it is worth including gender in page titles. In the webmaster tools: I looked at

@Carla537

Posted in: #Seo

1 Comments

: Is it possible to dynamically display a page according to long tail Google search? Let assume that someone types a long-tail keyword, how would you dynamically generate a page based on those

@Carla537

Posted in: #Content #Dynamic #Google #Seo #Traffic

1 Comments

: Huge difference between Facebook Ad Click figures and Apache log requests We're running a facebook ad campaign for our business but there seems to be a huge discrepancy between the number of

@Carla537

Posted in: #Advertising #Apache2 #Facebook #GoogleAnalytics

4 Comments

: Vertical page flip effect I found turn.js as a very simple and nice effect to create page flip effect. The problem is that I'm looking for that effect but vertical not horizontal. Does anybody

@Carla537

Posted in: #Html5 #Javascript #Jquery #LookingForAScript #Plugin

1 Comments

Login to post a comment!

2 Comments

Sorted by latest first Latest Oldest Best

@Dunderdale272

Our experience is that google does seem to 'guess' URL parameters.

We used to have a legacy url structure main.php?id=1 etc. and changed this a year ago to a more SEO friendly structure.

We noticed that items recently entered were still being indexed by google at main.php?id=1234 rather than our spiffy new SEO optimised URL even though this page never existed when we had our old legacy structure. We nowhere, otherwise, had a link to these pages using this old URL.

We reviewed our server logs and noticed someone looking at our pages in a sequential manner using our old legacy url, i.e. main.php?id=1, 2, 3 etc. They would go upwards in batches about 150 or so, and then come back a few hours later and do another 150. We tracked the IP address of the request and found it was a standard google bot IP.

The old legacy URL still worked, as we had not disabled it - we just never referred to it and never had thought anyone would try this.

We solved the problem by putting a 301 redirect in our index.php whenever a URL containing a new page was called. A few hours coding, but it seems to have resolved our issue - new pages added to google seem to contain our SEO URL and we have had no attempted use of our old legacy URL for several weeks.

We can only conclude that the google bot is aware of parameters and does indeed try them even if no real link occurs.

10% popularity Vote Up Vote Down

@Si4351233

If you have Webmaster Tools setup I would go into the URL parameters section under Site Configuration and see what parameters they have setup by default for your site. My site being WordPress google recognizes some of the parameters, but interestingly for the live chat script and a few other random scripts I use it's also guessed some of the parameters.

I would use noindex nofollow as well as robots.txt where possible.

10% popularity Vote Up Vote Down

Feed

: Does the google crawler really guess URL patterns and index pages that were never linked against? I'm experiencing problems with indexed pages which were (probably) never linked to. Here's the

More posts by @Carla537

: Page Titles - Including gender of a fashion product in page titles? I need a bit of help to decide whether it is worth including gender in page titles. In the webmaster tools: I looked at

: Is it possible to dynamically display a page according to long tail Google search? Let assume that someone types a long-tail keyword, how would you dynamically generate a page based on those

: Huge difference between Facebook Ad Click figures and Apache log requests We're running a facebook ad campaign for our business but there seems to be a huge discrepancy between the number of

: Vertical page flip effect I found turn.js as a very simple and nice effect to create page flip effect. The problem is that I'm looking for that effect but vertical not horizontal. Does anybody

Login to post a comment!

2 Comments

Back to top | Use Dark Theme