Mobile app version of vmapp.org
Login or Join
Megan663

: Plus sign in front of URLs in user agents I run a small web crawler and had to decide on what user agent to use for it. Lists of crawler agents as well as Wikipedia suggest the following

@Megan663

Posted in: #Http #UserAgent #WebCrawlers

I run a small web crawler and had to decide on what user agent to use for it. Lists of crawler agents as well as Wikipedia suggest the following format:

examplebot/1.2 (+http://www.example.com/bot.html)


However some bots omit the plus sign in front of the URL. And I wonder what it means in the first place, but couldn't find any explanation. RFC 2616 considers everything in parenthesis a comment and doesn't restrict its format. Yet it is common for browsers to have a semicolon-separated list of tokens in the comment that advertise the version and capabilities of the browser. I don't think this is standardized in any way other than most browsers formatting it similarly. And I couldn't find anything concerning URLs in the comment.

My question is: Why the plus sign? Do I need it?

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Megan663

2 Comments

Sorted by latest first Latest Oldest Best

 

@Ravi8258870

The first usage of this I could find was with the Heritrix crawler. In this manual document, I found the following:


6.3.1.3.2. user-agent The initial user-agent template you see when you first start heritrix will look something like the following:

Mozilla/5.0 (compatible; heritrix/0.11.0 +PROJECT_URL_HERE

You must change at least the PROJECT_URL_HERE and put in place a website that webmasters can go to to view information on the organization or person running a crawl.

The user-agent string must adhere to the following format:

[optional-text] ([optional-text] +PROJECT_URL [optional-text]) [optional-text]

The parenthesis and plus sign before the URL must be present. Other examples of valid user agents would include:

my-heritrix-crawler (+http://mywebsite.com)

Mozilla/5.0 (compatible; bush-crawler +http://whitehouse.gov)

Mozilla/5.0 (compatible; os-heritrix/0.11.0 +http://loc.gov on behalf to the Library of Congress)

10% popularity Vote Up Vote Down


 

@Heady270

I downloaded all the user agents from www.user-agents.org/ and ran a script to count the number of them that used the + style links vs plain links. I excluded the "non-standard" user agent strings that don't match RFC 2616.

Here are the results:

Total: 2471
Standard: 2064
Non-standard: 407
No link: 1391
With link: 673
Plus link: 145
Plain link: 528
Plus link only: 86
Plain link only: 174


So of the 673 user agents that include a link only 21% include the plus. Of the 260 user agents that have a comment that is just a link, only 33% include the plus.

Based on this analysis, the plus is common, but the majority of user agents choose not to use it. It is fine to leave it out, but it is common enough that it would also be fine to include it.

Here is the Perl script that performed this analysis if you want to run it yourself.

#!/usr/bin/perl

use strict;

my $doc="";

while(my $line = <>){
$doc.=$line;
}

my @agents = $doc =~ /<td class="left">[ trn]+(.*?)&nbsp;/gs;

my $total = 0;
my $standard = 0;
my $nonStandard = 0;
my $noHttp = 0;
my $http = 0;
my $plusHttp = 0;
my $noPlusHttp = 0;
my $linkOnly = 0;
my $plusLinkOnly = 0;

for my $agent (@agents){
$total++;
if ($agent =~ /^(?:[a-zA-Z0-9.-_]+(?:/[a-zA-Z0-9.-_]+)?(?: ([^)]+))?[ ]*)+$/){
print "Standard: $agentn";
$standard++;
if ($agent =~ /http/i){
print "With link: $agentn";
$http++;
if ($agent =~ /+http/i){
print "Plus link: $agentn";
$plusHttp++;
} else {
print "Plain link: $agentn";
$noPlusHttp++;
}
if ($agent =~ /(http[^ ]+)/i){
print "Plain link only: $agentn";
$linkOnly++;
} elsif ($agent =~ /(+http[^ ]+)/i){
print "Plus link only: $agentn";
$plusLinkOnly++;
}
} else {
print "No link: $agentn";
$noHttp++;
}
} else {
print "Non-standard: $agentn";
$nonStandard++;
}
}

print "
Total: $total
Standard: $standard
Non-standard: $nonStandard
No link: $noHttp
With link: $http
Plus link: $plusHttp
Plain link: $noPlusHttp
Plus link only: $plusLinkOnly
Plain link only: $linkOnly
";

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme