Mobile app version of vmapp.org
Login or Join
Kimberly868

: Preventing robots from crawling specific part of a page As a webmaster in charge of a tiny site that has a forum, I regularly receive complains from users that both the internal search engine

@Kimberly868

Posted in: #Forum #Html #SearchEngines #WebCrawlers

As a webmaster in charge of a tiny site that has a forum, I regularly receive complains from users that both the internal search engine and that external searches (like when using Google) are totally polluted by my users' signatures (they're using long signatures and that's part of the forum's experience because signatures makes a lot of sense in my forum).

So basically I'm seeing two options as of now:


Rendering the signature as a picture and when a user click on the "signature picture" it gets taken to a page that contains the real signature (with the links in the signature etc.) and that page is set as being non-crawlable by search engine spiders). This would consume some bandwidth and need some work (because I'd need an HTML renderer producing the picture etc.) but obviously it would solve the issue (there are tiny gotchas in that the signature wouldn't respect the font/color scheme of the users but my users are very creative with their signatures anyway, using custom fonts/colors/size etc. so it's not that much of an issue).
Marking every part of the webpage that contains a signature as being non-crawlable.


However I'm not sure about the later: is this something that can be done? Can you just mark specific parts of a webpage as being non-crawlable?

10.07% popularity Vote Up Vote Down


Login to follow query

More posts by @Kimberly868

7 Comments

Sorted by latest first Latest Oldest Best

 

@Candy875

You can put the page in a PHP if with an "else" that lead to a captcha that gives the key for the if part.

I don't really care because if the user credential don't match on my page they do get a blank page or is send to the login page.

<?php
session_start();

if(empty($_SESSION['captcha']) or $_SESSION['captcha'] != $key){
header("Location: captcha.php");
}

if(!empty($_SESSION['captcha']) and $_SESSION['captcha'] == $key){

"the page"

}
?>


$key should be a hash of the current day or something that change so it's not sufficient to add the value to the session.

Write in the comment if you want me to add a example captcha because I don't have one on me now.

10% popularity Vote Up Vote Down


 

@Lengel546

One way to do this is to use an image of text rather than plain text.

It is possible that Google will eventually be smart enough to read the text out of the image, so it might not be completely future-proof, but it should work well for at least a while from now.

There's a bunch of disadvantages to this approach. If a person is visually impaired, it's bad. If you want your content to adapt to mobile devices versus desktop computers, it's bad. (and so on)

But it is a method that currently (somewhat) works.

10% popularity Vote Up Vote Down


 

@BetL925

Here is the same answer I provided to noindex tag for google on Stack Overflow:

You can prevent Google from seeing portions of the page by putting those portions in iframes that are blocked by robots.txt.

robots.txt

Disallow: /iframes/


index.html

This text is crawlable, but now you'll see
text that search engines can't see:
<iframe src="/iframes/hidden.html" width="100%" height=300 scrolling=no>


/iframes/hidden.html

Search engines cannot see this text.


Instead of using using iframes, you could load the contents of the hidden file using AJAX. Here is an example that uses jquery ajax to do so:

his text is crawlable, but now you'll see
text that search engines can't see:
<div id="hidden"></div>
<script>
$.get(
"/iframes/hidden.html",
function(data){$('#hidden').html(data)},
);
</script>

10% popularity Vote Up Vote Down


 

@Kimberly868

This is easy.

Before you serve your page you need to know whether it is to a bot, a computer or a phone. You then need to set the content accordingly. This is standard practice in this day and age and core functionality of some CMS's.

There are plenty of solutions on SE for doing redirection based on USER AGENT that can be put in your htaccess. If this suits your forum software then you can run different code off the same DB to deliver what Google needs without the chaff and trimmings.

Alternatively you can put a little line in your PHP code that does a 'if USER AGENT == Googlebot then don't show signatures'.

If you really cannot do that then you can get mod_proxy to serve to the bot and use it to strip out anything your php code generates that the bot not need to see.

Technically Google do not approve of their search engine being shown a different page to what the normal site visitor sees, however, to date, they have not taken the BBC and others that provide browser/IP/visitor-specific content off their search engine results. They also have limited means to see if their bot has been 'conned'.

The alternative solution of hiding content with CSS for it to be re-enabled by a script is also a bit of a grey area. According to their own Webmaster Tools guidelines of 20/6/11 this is not a good idea:
www.google.com/support/webmasters/bin/answer.py?answer=66353
That may not be a tablet cast in stone, but it is up to date and by Google.

The hide the content trick will not work with the minority of people that do not have javascript, this may not be a huge concern, however, waiting for the document to load and then showing the signatures will not be a satisfactory viewing experience as you will think the page has loaded, then it will jump about as the hidden signatures show up to then push the content down the page. This type of page load can be irritating if you have a low-end net-top but may not be noticeable if you have a fast developers machine on a fast internet connection.

10% popularity Vote Up Vote Down


 

@Cofer257

No, there is no way to prevent robots crawling parts of pages. It's a whole page or nothing.

The snippets in Google's search results are usually taken from the meta description on the page. So you could make Google show a specific part of the page by putting that in the meta description tag. With user-generated content it's difficult to get good snippets, but taking the first post of the thread would probably work.

The only other way I can think of is to use Javascript. Something like paulmorriss suggested may work, but I think search engines would still index the content if it's in the HTML. You could remove it from the HTML, store it in Javascript string, then add it back on page load. This gets a bit complex, though.

Finally, one thing to keep in mind: if Google is showing user's signatures in their snippets, it has decided that is the part most relevant to the user's query.

10% popularity Vote Up Vote Down


 

@Gretchen104

Another solution is to wrap the sig in a span or div with style set to display:none and then use Javascript to take that away so the text displays for browsers with Javascript on. Search engines know it's not going to be displayed so shouldn't index it.

This bit of HTML, CSS and javascript should do it:

HTML:

<span class="sig">signature goes here</span>


CSS:

.sig {
display:none;
}


javascript:

<script type="text/javascript">
$(document).ready(function()
{
$(".sig").show();
}
</script>


You'll need to include a jquery library.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme