Mobile app version of vmapp.org
Login or Join
Nickens628

: Recent statistics on HTML usage in the wild? I have seen some fascinating statistics about what HTML markup is actually in use on the web. There is Web Authoring Statistics from Google in 2005,

@Nickens628

Posted in: #Html #Research

I have seen some fascinating statistics about what HTML markup is actually in use on the web.

There is Web Authoring Statistics from Google in 2005, and MAMA from Opera in 2008.

Have there been any similar studies more recently than this?

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Nickens628

2 Comments

Sorted by latest first Latest Oldest Best

 

@Samaraweera270

I recently needed to come up with some HTML element statistics. I downloaded a 600k page web corpus from 2009 and got the statistics below for number of uses of the tags in that corpus capped at 1000.


a 39923975
td 29707781
br 17266040
div 15644761
tr 15340012
img 13539744
option 9319944
li 8531300
span 7573226
table 5999791
font 5329688
b 4419975
p 3762731
input 3413040
script 3368258
strong 1855448
meta 1809099
link 1262896
ul 1246231
hr 712094
form 703914
dd 521143
i 503215
center 483503
h2 477122
title 446862
body 445094
head 436058
html 435397
h3 400073
th 393534
em 363613
dt 363085
label 342376
h1 318542
select 311449
style 292716
tbody 275069
nobr 251825
small 242097
noscript 227474
u 227338
area 221226
param 204798
h4 162877
dl 142403
iframe 126827
o 86064
sup 72331
h5 64141
fieldset 55447
textarea 54044
object 53226
embed 51864
cite 47406
scr 47050
tt 45748
big 44210
optgroup 43329
blockquote 42869
base 42147
map 41401
col 40191
wbr 36995
legend 32738
d 29939
ol 28879
thead 27083
spacer 26848
pre 26118
h6 24766
s 24510
button 23417
code 21190
rdf 20960
abbr 20356
acronym 20186
w 16818
noindex 14038
dfn 12974
marquee 11350
v 11165
strike 10522
address 9725
description 9058
sc 8677
caption 8633
st1 8374
colgroup 8191
item 8090
pubdate 7160
layer 6156
sub 5634
ins 5433
category 5106
guid 4936
document 4678
del 4639
frame 4548
dc 4323
image 4306
var 3729
variable 3292
tfoot 3282
x 3120
xs 2963
zeroboard 2870
js 2825
ilayer 2823
frameset 2738
media 2669
rx 2658
author 2581
c 2563
h 2227
ifr 2213
xml 2134
m 2007
csobj 1972
n 1948
set 1872
l 1870
permits 1848
samp 1767
menuitem 1738
requires 1732
noframes 1714
z 1686
this 1655
q 1638
t 1629
f 1622
content 1612
license 1533
left 1501
srch 1488
kbd 1482
sfels 1466
hs 1438
para 1434
comments 1394
changeimages 1393
list 1362
actinic 1358
csaction 1352
skype 1338
myarr 1337
index 1283
blip 1279
scri 1262
mlp 1205
e 1199
url 1199
basefont 1167
channel 1153
if 1152
u1 1151
g 1137
xsl 1137
literal 1108
rss 1105
itemtemplate 1053
j 1036
blink 1022
len 1018
id 1001
align 1000

Here are selected tags from HTML5 and others that are less frequent.


bdo 425
quote 123
time 61
mark 23
bdi 5




Methodology

I didn't try anything particularly fancy. I just took the whole corpus (2.4GB gzipped) and ran it through perl

gzip -c web200904.gz | perl extractTags.pl


where the perl script does

use strict;

my %tags = ();
my $count = 0;

sub emit() {
foreach my $key (keys(%tags)) {
print "$key:$tags{$key}n";
}
%tags = ();
$count = 0;
}

sub count($) {
my $tagname = $_[0];
$tagname = "L$tagname";
$tags{$tagname} += 1;
if (++$count >= 1000000) {
emit();
}
return "";
}

while (<STDIN>) {
s/<([a-zA-Z][a-zA-Z0-9-]*)/count()/ge
}

emit();


I then aggregated the output using a python script.

import re

f = open('/tmp/tagCounts', 'r')

counts = {}

pattern = re.compile(r'^([^:]+):([0-9]+)n?$')

while True:
line = f.readline()
if not line: break
match = pattern.match(line)
if match:
name, count = match.groups()
count = int(count)
counts[name] = counts.get(name, 0) + count

counts = counts.items()
counts.sort(lambda a, b: cmp(b[1], a[1]) or cmp(a[0], b[0]))

for (name, count) in counts:
print "%s%s%d" % (name, ' ' * (max(1, (40 - len(name)))), count)




Caveats

This script uses a very simple heuristic to look for tags, but will be confused by a '<' followed by a word in some contexts:


Comparisons in JS or CSS: <script>if (x<notatag) ...
Commented out tags: <!-- <notatag> -->,
Non-tag content in <title>, <textarea>, <noscript>, or similar elements.

10% popularity Vote Up Vote Down


 

@Goswami781

This dates from January 2011, but isn't that widespread: try.powermapper.com/demo/statsversions.aspx

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme