Extracting Browser Version Statistics from Nginx Logs: Parsing User-Agent Strings for Market Share Analysis


4 views

When analyzing web traffic, nginx logs contain valuable User-Agent strings that look something like this:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1

We need to extract just the browser name and major version. Here's how we can approach this with command-line tools:

Assuming standard nginx log format, this AWK command extracts and summarizes browser versions:

awk '{split($(NF-1),ua," "); 
      for(i in ua) {
        if(ua[i] ~ /[Mm]ozilla|Chrome|Safari|Firefox|Edge|Opera|IE|Trident/) {
          match(ua[i], /(Firefox|Chrome|Safari|Opera|Edge|MSIE|Trident)[\/ ]([0-9]+)/, matches);
          if(matches[1]) {
            browser = matches[1];
            if(browser == "Trident") browser = "IE"; # Handle IE11+
            versions[browser matches[2]]++
          }
        }
      }
     } 
     END {
       for(v in versions) print versions[v], v
     }' access.log | sort -nr

For more robust parsing, consider these specialized tools:

  • GoAccess: Real-time web log analyzer with built-in UA parsing
  • AWStats: Advanced log file analyzer with detailed browser reports
  • ELK Stack: For large-scale log analysis with User-Agent processor plugin

For more control, here's a Python script using the user-agents library:

from collections import defaultdict
import user_agents
import re

counts = defaultdict(int)

with open('access.log') as f:
    for line in f:
        # Extract User-Agent string (adjust based on your log format)
        ua_string = re.search(r'"([^"]*)"', line.split('"')[-2]).group(1)
        ua = user_agents.parse(ua_string)
        key = f"{ua.browser.family} {ua.browser.version_string.split('.')[0]}"
        counts[key] += 1

for browser, count in sorted(counts.items(), key=lambda x: x[1], reverse=True):
    print(f"{count} {browser}")

Some User-Agent strings require special handling:

# Microsoft Edge (Chromium-based)
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.74 Safari/537.36 Edg/79.0.309.43"

# Internet Explorer 11
"Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko"

For these cases, you'll need additional pattern matching rules in your parsing logic.

To get clean, sorted output like in your example, pipe the results through additional Unix tools:

your_parsing_command | sort -nr | head -20

When analyzing web traffic, we often need to identify browser usage patterns from Nginx access logs. The typical log format contains the User-Agent string, which looks like:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36

For market share analysis, we usually only care about the browser family and major version (e.g., Chrome 91), not minor versions or operating systems.

Here are three effective methods to extract and count browser major versions:

1. Using awk and sed

cat access.log | awk -F\" '{print $6}' | \
sed -n 's/.*$Firefox\|Chrome\|Safari\|Opera\|Edge\|IE\|MSIE\|Trident$.*/\1/p' | \
sort | uniq -c | sort -rn

This gives you raw counts but doesn't extract versions. For version extraction:

cat access.log | awk -F\" '{print $6}' | \
grep -Eo '(Firefox|Chrome|Safari|Opera|Edge|IE|MSIE|Trident)[/ ]+[0-9]+' | \
cut -d/ -f1 | sort | uniq -c | sort -rn

2. Using logparser Tools

For more sophisticated analysis, use GoAccess:

goaccess access.log --log-format=COMBINED --browsers-file=/path/to/browsers.list

Or AWStats with proper configuration for browser detection.

3. Python Script Solution

For maximum flexibility, here's a Python script:

import re
from collections import defaultdict

pattern = re.compile(
    r'(?:Firefox|Chrome|Safari|Opera|Edge|IE|MSIE|Trident)[/ ]+([0-9]+)',
    re.IGNORECASE
)

counts = defaultdict(int)

with open('access.log') as f:
    for line in f:
        # Extract User-Agent (6th field in combined log format)
        ua = line.split('"')[5]
        match = pattern.search(ua)
        if match:
            browser = match.group(0).split('/')[0]
            version = match.group(1)
            key = f"{browser}{version[0]}"  # First digit of version
            counts[key] += 1

for browser, count in sorted(counts.items(), key=lambda x: x[1], reverse=True):
    print(f"{count} {browser}")

Some User-Agents require special handling:

  • Internet Explorer 11 masquerades as other browsers
  • Mobile browsers often include the word "Mobile"
  • Bots and crawlers should be filtered out

Here's an enhanced pattern that handles these cases:

pattern = re.compile(
    r'(?:Firefox|Chrome|Safari|Opera|Edg|IE|MSIE|Trident|Android)[/ ]+([0-9]+)|'
    r'(?:iPhone|iPod|iPad).+Version/(\d+)',
    re.IGNORECASE
)

For better presentation, pipe the results to a simple bar chart:

python analyze_browsers.py | \
awk '{printf("%-8s ", $2); for(i=0;i<$1/50;i++) {printf("#")}; print ""}'

Or generate a CSV for spreadsheet import:

Browser,Count
Chrome9,1200
Firefox8,900
Safari14,600