Mobile app version of vmapp.org
Login or Join
Deb1703797

: Batch removal of URL from Google index We accidentally created a number of links to the same page resulting in Google indexing what should be about 5 000 pages as over 100 000. We have subsequently

@Deb1703797

Posted in: #GoogleIndex #Url

We accidentally created a number of links to the same page resulting in Google indexing what should be about 5 000 pages as over 100 000.

We have subsequently fixed this up, returning 404's, but Google is still testing these URLs - several hundred a day - and returning these as 404 errors in Webmaster Tools. I assume that if we waited Google would eventually sort this out, but as this is not good for our Google ranking - we effectively have 10's of URL for each page - I would like to look to accelerate this removal.

We are manually removing these URL's using Google Webmaster Tools, but this is very time consuming. Unfortunately the URL's are not easy to cover by a single directory as described here

Does anyone have suggestions on batch removal tools of URL?

10.05% popularity Vote Up Vote Down


Login to follow query

More posts by @Deb1703797

5 Comments

Sorted by latest first Latest Oldest Best

 

@Vandalay111

I did a different way, valid for any browser, and is very easy to remove from LIVE fetching URLs as high as 1 per second. Let me explain how i did.

First of all you need to collect your indexed URL you want to remove, instead to figure what is indexed and send some requests for no indexed pages.

You can do some script to collect that info, just open a sock to Google with some Certificate and SSL, and send a GET for your searsh … or call lynx –insecure www.google.com….. Don’t forget to add “filter=0” !!!
Now you can collect up to 100 results from Google ( if you put over 100 you will get just 100 )
The search string must be something like “site:yourdomain.xxx + SOME_STRING” where SOME_STRING is dependant on what you are seeking for remove.
Then repeat it to fech as much as possible URL from Google, tipicaly over 1000 ( I usually got around 10K URL on each pass ).
You need to enter some delay of course, or Google wll claim you are a robot, use at least 45 seconds delay.
Or may be you have enough public IP and can do “ifconfig $new_ip” for each search.
I spend around 24 hours ( 1 search per second, 255 public IP ) to collect a lot of URL from Google index for my site.

Well, now filter the results, order it and uniq it, to avoid duplicates and to get a valid list of the URL you want to remove.
Now the interesting part … I use Xlib to :
open the prefered browser ( Firefox, Chromium, Seamonkey ) , go to the Sumit button , wait 2 seconds and close the window.

With 3 more or less fast computers you can remove around 1 URL per second, that is in 15 seconds you will get the anoying message “you have reached limit”
With faster computer you can save that page, and seek for “reached” and then put your script to sleep one hour.

You can play in a continous loop ( take care your software does a good job !! ) between the fetching part and the removal part.

Now, the collected URL you can also prepare for build a sitemap.xml or some page.html to send to Google in a way if you already removed those pages from your site or return 401 you can go faster no just with search and cache removal, but also with index removal.

I did all these tasks when due some mistake I need to remove around 240000 pages as fast as possible.

Here is my running script ( Perl ) for the removal:

#!/usr/bin/perl
use X11::GUITest qw/
StartApp
WaitWndowViewable
SendKeys
/;
$pre_string=”https://www.google.com/webmasters/tools/removals-request?hl=en&authuser=0&siteUrl=http://YOURSITEHERE/&url=”;

open (IN, “list_url.txt”);
while (!eof(IN)) {
$a = substr(,0,-1);
$a =~ s/=/%3D/g;
$a =~ s/ /%20/g;
$a =~ s/&/%26/;
$a = $pre_string.$a;
if ($a !~ /’/) {
print “$counter $an”;
&send;
if ($counter > 999) { exit; }
$counter++;
}

}
close IN;
exit;

sub send() {
StartApp(“/YOURBROWSERPATH/seamonkey -new-window ‘$a'”);
sleep 2;
my ($GEditWinId) = WaitWindowViewable(‘Search Console – Remove URLs – YOURSITE.com/ – Seamonkey’); # modify if you use other browser
if (!$GEditWinId) {
die (“Couldn’t find the window in time !!!”);
}

# send several tabs
for ($g=0;$g<22;$g++) {
Sendkeys("|t");
}
sleep 0.2;
Sendkeys(" "); # Send SPACE over the Submit button

sleep 2;
#Close the application with Control-W
SendKeys('^(W)');
}


Now I am able to fetch live index from Google, split it on 3 fast computers and do a full removal requests in less than 15 minutes as the maximum allowed per day is less than 1000 request ( limited by Google )

All this runs automatic , in continuous process fetching and removing, building index removal to submit, and even on faster computer watching for response like "limit reached. Try Later" or collecting the already requested for removal URL

10% popularity Vote Up Vote Down


 

@Looi9037786

We just had this problem.

It was really ugly, 60 000 pages of spam created because of loop hole in our spam filter. We manually deleted all the pages, causing 404 error to be present for Google.

Months later the ugly pages were still in Google SERPs.

We searched for bulk removal tools in Google Webmaster, no luck, clearly 1 by 1 removal was not suitable using Google's tool.

We decided the best way was to add all of the 404 pages to our robots.txt file to the disallow list, (read up on this if you're not sure what it is).

The trick, is, to do this within minutes we did the following:


Go to Google Webmaster Tools, crawl errors and there downloaded to .csv the errors.
Open .csv, highlight the column with the URLs
Paste the URLs into a plain text editor, (removes frames). You'll get a list of full URLs.
Now you need to change www.yoursite.co.uk/page-you-want-to-delte into Disallow: /page-you-want-to-delete.
So paste the list into Ms Word, or similar word processor
Go to 'edit' find replace, find, www.yoursite.co.uk replace with Disallow: .
Tweak until you've got it right, paste the results into a basic text editor to remove fancy text formatting.
Bang you have the URLs in a format ready to copy and paste directly to your robots.txt.

10% popularity Vote Up Vote Down


 

@Pierce454

As I see it, you basically have two problems:


Google keeps returning the mistakenly created duplicate URLs in search results, confusing users, and
Googlebot keeps trying to recrawl the duplicate URLs, slowing down the crawling of your actual pages.


The first problem is more serious, since you're actually losing visitors. The best way to solve it would be to return HTTP 301 permanent redirects from the accidentally created URLs to the correct ones, so that your visitors will be sent to the page they were looking for. This will also eventually result in search engines dropping the redirected URLs from their index.

(Using 301 instead of 302 redirects here is important, since search engines interpret a 301 redirect as an instruction to index the target URL instead of the source.)

For the second problem, I'd recommend creating (and regularly updating) an XML sitemap of your actual pages. This will not completely stop Googlebot from crawling the accidentally created URLs (and you don't want it to, since the bot needs to find the 301 redirects), but it will keep the bot informed about updates to your actual pages, so that it can recrawl them faster.

As long as your servers can keep up with the load, you may also want to temporarily increase the maximum crawl rate for your site in Webmaster Tools.

10% popularity Vote Up Vote Down


 

@Murphy175

You can remove individual links from within Webmaster Tools, but there is no way to do it for a lot of links at once.

If you know the address of the pages that were created by accident you could add them to your robots.txt file so that they wouldn't be indexed by Google.

10% popularity Vote Up Vote Down


 

@LarsenBagley505

Use status code 410 as described here support.google.com/webmasters/bin/answer.py?hl=en&answer=1663419
If the page no longer exists, make sure that the server returns a 404 (Not found) or 410 (Gone) HTTP status code. Non-HTML file (like PDFs) should be completely removed from your server.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme