Mobile app version of vmapp.org
Login or Join
Candy875

: How can I get a list of all indexed pages for my domain? We're in the process of migrating an outdated e-commerce cart to Magento. Part of this process will entail configuring proper 301

@Candy875

Posted in: #301Redirect #GoogleIndex #GoogleSearchConsole #Indexing #Magento

We're in the process of migrating an outdated e-commerce cart to Magento. Part of this process will entail configuring proper 301 redirects. I'm aiming to get a list of all indexed pages to throw our solution against and resolve as many potential issues as possible before going live.

Ideally I'd just need a CSV containing the URIs indexed by the Search Engines for our domain.

Looking at a similar question here, it seems there is no simple way to export this data from either Google or Bing's Webmaster tools, given that this cart has tens of thousands of products (and thus tens of thousands of indexed pages).

I've run across a few other 3rd party utilities such as Screaming Frog and web-based ones like searchenginegenie and internetmarketingninjas, but I've never used any of them and am hesitant to start throwing additional traffic at our site unless I know we'll get what we need out of it.

Has anyone out there used these tools to do something similar, or found some way to retrieve more than the top 1000 records from GWT (or something similar from Bing)?

10.06% popularity Vote Up Vote Down


Login to follow query

More posts by @Candy875

6 Comments

Sorted by latest first Latest Oldest Best

 

@Yeniel560

As Google never will return more than 1000 results, my key was from standalone Perl script to query ( with the help of Lynx --accept-cookies ) several segments
for

site:myweb.xxx in the way www.google.es/search?q=site:www.955170000.com+%2B+ AA"&num=50&filter=0

The script calculates the string for search, now is "AA" , next will seek "AB" and so on until "ZZ", but you can select at your way, include numbers, and other characters.

Then each search result ( in my case are only 50 returned results) is filtered to seek for each link for each indexed page. All them are logged to a file.
Now we need to pass though | sort | uniq this file to wipe out repeated links.
I added up to 120 seconds between queries , otherwise Google will claim you re using a robot.

That means in this way ( form AA to AZ ) and 100 results per page I can collect up to 78K indexed pages in 26 hours of processing ( running from unique IP, but you can put 2 or more machines with different IP and save time ).

If you need to collect over 78K ( as not over 100 results per query, and maximum is 1000 total results for each search ) of course you can try up to 1000 for each search string, and in theory you would be able to catch over 7,8 Millions of pages.

Notice many may be duplicated so once you fetch all possible results from Google you need to sort nd filter unique results ( I used sort and uniq *nix commands to do that )

Next step , i.e. to detect duplicate content, or other issues is now easy, or put all collected url into next script to do an URL removal in GWT ( again limited to around 1000 per day ) or re-subscribe uploading to re-index ( limited to around 30K links by Google )

10% popularity Vote Up Vote Down


 

@Bryan171

Use this function in a spreadsheet:
=importXml("http://google.com/search?q=site:YOUR_SITE.com&num=100&start=1","//cite")

That will return indexed pages 1 to 100
repeat with start=101 to get the next 100.

10% popularity Vote Up Vote Down


 

@Martha676

You could also search for site:mydomain.com on Google to get a list of all their indexed pages from your domain, including subdomains.

10% popularity Vote Up Vote Down


 

@Martha676

Whatever outdated e-commerce cart you are using now, if you are able to programatically generate all the product and category URLs in the old platform, you don't need to use 301 redirect. You can use the same URLs in Magento (by updating the core_url_rewrite table). This is a special feature of Magento.

I used to do SEO for a Magento website, which earlier used an outdated e-commerce cart just like yours. The migration was done to Magento keeping the same old URLs.

A word of caution about Magento. There is too much hype about Magento. Although it uses MVC architecture, it is one of the worst e-commerce platforms. This is especially true for large catalogs (true in your case, as you say you have tens of thousands of products and categories). Unless you are using the Enterprise Edition, which uses advanced caching to speed up things (and is very expensive), the Community edition will not help your cause.

They say Magento is SEO friendly. Nothing can be farther from the truth. Magento's own URLs (related to products and product reviews) are a complete mess as regards SEO. It generates multiple URLs (paths) for the same product, if it is assigned to multiple categories. In that case, you need to use Catalog Url Rewrite management anyway, which can be a headache.

Magneto's URL problems were solely responsible for causing a lot of damage to the business of the website I am talking about by hurting their search engine rankings.

Anyway, you shouldn't be bothering too much about which pages are in Google's index (which is multi-layered). Assume that all of them are indexed and 301 redirect all of them, if you decide to use 301 redirect.

Thanks,
Satyabrata Das

10% popularity Vote Up Vote Down


 

@Bryan171

You can download as .csv indexed webpages by Google in your Google Webmaster Tools account.

Just see Download chart data button below chart of indexed webpages (Menu Google Index => Indexation state).

10% popularity Vote Up Vote Down


 

@Lengel546

I have used Screaming Frog, today in fact, and I really love that tool. You get a lot of information in quite a short amount of time. You'll get the metadata in a csv and you can manipulate it easily in Excel. Export everything and then use filters for each column to only display text/html and not images or CSS files.

I am doing that for a site migration right now and used it for one in the past. How many pages are you talking? Here's a comparison on Moz of Xenu vs SF.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme