: How should I deal with user agent parsing in logs? My web app project includes logging functionality so we can see where visitors are coming from (referrer URL), what the popular user agents
My web app project includes logging functionality so we can see where visitors are coming from (referrer URL), what the popular user agents are, what pages are most popular, etc. The log is stored in SQL Server, and when I query the user agents I use a large (almost 100 lines) and growing CASE statement to separate the user agents using string matching (i.e. if the user agent contains the string "Firefox/9" then it's Firefox 9). Is there a better way to do this so I don't have to continually add to that CASE statement to deal with new browser releases?
Also, how should I deal with less common, weird/unknown user agents? I've seen the following in the logs and been unable to find good information online about what they are:
WordPress/3.3.1; www.facecolony.org Mozilla/4.0 ( www.hairirons.org redips; <a href=http://hairirons.org/>chi hair iron</a>)
I'd guess they're bots/crawlers, but the sites they point to don't appear to reference web crawlers (or even be available sometimes). I've seen other user agents aren't familiar to me, but I know they're bots because they include "bot" or "spider" or something similar in them.
More posts by @Shanna517
4 Comments
Sorted by latest first Latest Oldest Best
I would use the PHP function get_browser but if you are logging this data from access logs and visits to the website. I would just hook into www.openwebanalytics.com which is a great free open source analytics program which you host yourself. It's database schema is available in the wiki for ideas on how it works. Using case is not at all efficient.
Completely agree that using a SQL query is not the most effective approach to this. However, if the logs you happen to need are in SQL already, and it's a one off or data sizes a not large enough to warrant improving the approach, here's a handy expression that can be used as a starting point...
Based on Sarath's post here: www.consultsarath.com/contents/articles/KB000009-tsql-function-how-to-parse-useragent-to-extract-browser-name.aspx, but refined to handle the various IE user agent string variations when IE's Compatibility mode is enabled, as documented here msdn.microsoft.com/en-us/library/ms537503(v=vs.85).aspx.
SELECT CASE
WHEN PATINDEX('Mozilla/5.0+(compatible;+MSIE+10.0%',[cs(User-Agent)]) > 0 THEN 'Internet Explorer 10.0 (standards mode)'
WHEN PATINDEX('%Trident/6.0%',[cs(User-Agent)]) > 0 THEN 'Internet Explorer 10.0 (compat mode)'
WHEN PATINDEX('Mozilla/5.0+(compatible;+MSIE+9.0%',[cs(User-Agent)]) > 0 THEN 'Internet Explorer 9.0 (standards mode)'
WHEN PATINDEX('%Trident/5.0%',[cs(User-Agent)]) > 0 THEN 'Internet Explorer 9.0 (compat mode)'
WHEN PATINDEX('Mozilla/4.0+(compatible;+MSIE+8.0%',[cs(User-Agent)]) > 0 THEN 'Internet Explorer 8.0 (standards mode)'
WHEN PATINDEX('%Trident/4.0%',[cs(User-Agent)]) > 0 THEN 'Internet Explorer 8.0 (compat mode)'
WHEN PATINDEX('Mozilla/4.0+(compatible;+MSIE+7.0%',[cs(User-Agent)]) > 0 THEN 'Internet Explorer 7.0'
WHEN PATINDEX('Mozilla/4.0+(compatible;+MSIE+6.0%',[cs(User-Agent)]) > 0 THEN 'Internet Explorer 6.0'
WHEN PATINDEX('Mozilla/4.0+(compatible;+MSIE+5.0%',[cs(User-Agent)]) > 0 THEN 'Internet Explorer 5.0'
WHEN PATINDEX('%Firefox%',[cs(User-Agent)]) > 0 THEN 'Mozilla ' + REPLACE(SUBSTRING([cs(User-Agent)], PATINDEX('%Firefox%',[cs(User-Agent)]), 100),'/', ' ')
WHEN PATINDEX('%Chrome%',[cs(User-Agent)]) > 0 THEN 'Google ' + REPLACE(SUBSTRING([cs(User-Agent)], PATINDEX('%Chrome%',[cs(User-Agent)]), PATINDEX('%Safari%',[cs(User-Agent)]) - PATINDEX('%Chrome%',[cs(User-Agent)])),'/', ' ')
WHEN PATINDEX('%Safari%',[cs(User-Agent)]) > 0 THEN 'Safari ' + REPLACE(SUBSTRING([cs(User-Agent)], PATINDEX('%Version%',[cs(User-Agent)]), PATINDEX('%Safari%',[cs(User-Agent)])- PATINDEX('%Version%',[cs(User-Agent)])),'Version/', '')
WHEN PATINDEX('%Opera%',[cs(User-Agent)]) > 0 THEN REPLACE(SUBSTRING([cs(User-Agent)], PATINDEX('%Opera%',[cs(User-Agent)]), PATINDEX('%(%',[cs(User-Agent)])-2),'/', ' ')
WHEN PATINDEX('%bot%',[cs(User-Agent)]) > 0 THEN 'bot'
ELSE 'Unknown Browser: ' + [cs(User-Agent)] END
AS Browser FROM Log
As per Sarath's original post, this can be wrapped in a Function (if you have that level of permissions in the database you're querying).
Using the programming functions in SQL is a pretty bad way to deal with this. A decent method would probably be to add an extra field to the table for the browser name (or ID from another table), and every time you store a row in the database, use your programming language there (e.g. PHP) to decide on the user agent and store that. Then you can do a select and group without too much trouble.
If you change your browser definitions in the future you could run a script that will go through and update the entire database at once.
For unknown user agents, you could try extracting a URL from the string with regular expressions and storing that, e.g. www.facecolony.org or www.hairirons.org in your example. If you can't find a URL then I would just put "Unknown" and you can go and look at those individually.
Note, the ones you posted look suspiciously like referrer spam to me.
I'd reccommend that you consider implementing a free analytics product like google analytics (information on alternatives available here) it would make your task much easier and you'd get a whole lot more useful and accurate information about your visitors without the lengthy work!
Only one of the user agents you list is actually a user agent Mozilla/4.0 looks like a partial string for Internet Explorer 6 take a look at this for an explanation.
The other WordPress/3.3.1 is actually a content management system (no idea why it appears here!).
www.user-agents.org/ and www.useragentstring.com/ have lists of user agent strings, the former offers an xml download, though I don't know how up to date they are.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.