Mobile app version of vmapp.org
Login or Join
Heady270

: “Invented” URL parameters This is a question on the best action to take when Google is indexing pages that don't really exist. I have a pretty simple pagination system on a set of news

@Heady270

Posted in: #Pagination #Seo #UrlParameters

This is a question on the best action to take when Google is indexing pages that don't really exist.

I have a pretty simple pagination system on a set of news pages where the file is referenced news.php?page=X.

In my Google sitemap I specify the total number of valid pages of this type (currently up to news.php?page=13).

The on-screen pagination is a standard "1,2,3... Next/Previous" layout.

However, Google Search Console reveals that it is monitoring 14,846 pages in this format. For example, news.php?page=7556 and others like it are turning up in search results.

The way the pagination works, news.php?page=7556 will show the same content as news.php?page=13. In other words, the oldest few news items. Needless to say, there are no links anywhere to any news pages other than 1-13.

I don't know for sure if this is having a negative impact on search but I wouldn't want legitimate content to suffer.

So, my question is, what's the best way to stop Google indexing thousands of non-pages? Should I just create a 404 or 301 redirect for any page that doesn't contain legitimate content? If a 301 redirect, what should it redirect to?

UPDATED Monday Nov 13th:

As advised by Ilmari Karonen I have added rel=canonical to the page headers so that a request for news.php?page=7556 shows that the canonical URL is news.php?page=13. I have not added 301 redirects or redirects to 404 error pages for now. I'll monitor the results on Search Console and report back on anything useful.

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Heady270

2 Comments

Sorted by latest first Latest Oldest Best

 

@Deb1703797

You must be running a script of some sorts that is generating data based on the page number given to it. The sad truth is that whoever designed the script did not take errors into serious consideration.

The script must be edited so that if the page number specified exceeds the actual total number of pages you made for your site (13?) then the output ideally needs to be an HTTP error 404 indicating a page is not found, but since Google already tried to index such fictional pages, then the HTTP error code needs to be changed to error 410 to tell Google that the page is not found and to stop scanning for that specific page.

If you have basic programming experience, then you will be able to correct this issue yourself. Otherwise, you will need to contact the developers of the script you're using of their problem and then you should get a new script that replaces the one you have that will work correctly with your website task.

I wouldn't recommend a redirect (301 or 302) for the fictional pages because no value is being offered and it also slows down the server as a result from search robots scanning the fictional page numbers as well as the new URLs they are redirected to, however if you believe a guest may be trying to access a fictional number, then you may want to include on the error page a link to a valid page.

10% popularity Vote Up Vote Down


 

@Hamaas447

If there's no legitimate content at those URLs, just return a 404 status. That's what it's for.

You may also want to include a rel=canonical link in your script's HTML output, to make sure that any other unexpected URL manipulation (like, say, adding extra URL parameters) won't accidentally introduce duplicate content into search engine indexes.



Optionally, you could also do a 301 redirect to the canonical URL, if you detect that your script was accessed via some other URL, but there's no real advantage in SEO terms to doing so. However, if you expect that, for some reason, your users may regularly end up at the same page via several different URLs, then setting up 301 redirects can help ensure that your users will always use those canonical URLs in bookmarks and links.

(For example, Stack Exchange uses both methods: the URL for your question and the URLs for my answer to it are different, but there's a rel=canonical link from the latter to the former. On the other hand, if SE detects that the URL slug doesn't match the question title, it does a 301 redirect.)



Ps. From your description it sounds like your pagination is set up so that, every time a new item is added, it appears at the top of page 1 and the last item on every page gets pushed to the next page. The problem with such a scheme is that, in order to keep their index up to date, Google needs to recrawl all your pages every time a new item is added. If they don't do that fast enough, you may end up with some items missing from Google's index entirely, and some appearing twice, or with stale Google results pointing to pages that no longer actually contain the item the user was searching for.

As long as each of your news items has its own stable canonical URL, with the paginated list only serving as a directory and linking to the stable item URLs, that's not really a major problem. (For example, Stack Exchange's question list works like that, and Google handles it just fine.) In fact, in that case, you might even consider adding a "noindex" robots meta tag to the list pages (or, at least, to all but the first page) to encourage Google to send visitors directly to the item pages instead.

However, if your news items only appear on the numbered list pages, then you should really consider redesigning your site so that each item has a single, stable URL. That will make it much more likely that Google will actually index your news items correctly, and that visitors coming to your site from Google results will actually find what they were looking for.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme