How to Replay Apache Access Logs for Realistic HTTP Server Load Testing


9 views

Traditional load testing tools like JMeter and Apache Benchmark often fall short when trying to reproduce complex, real-world traffic patterns that trigger edge-case bugs. The beauty of using actual access logs is they contain the exact sequence of requests, timings, and unique combinations that occurred in production.

Here are three battle-tested solutions:


# Option 1: Using GoReplay (gor)
gor file-server :80 --input-file access.log --output-http-track-response

# Option 2: Using ApacheBench with log processing
awk '{print $7}' access.log > urls.txt
ab -n 10000 -c 100 -p urls.txt http://yourserver/

# Option 3: Custom Python solution
import re
from locust import HttpUser, task, between

class LogReplayUser(HttpUser):
    wait_time = between(0.1, 0.5)
    
    @task
    def replay_requests(self):
        with open("access.log") as f:
            for line in f:
                match = re.search(r'\"(GET|POST) (.+?) HTTP', line)
                if match:
                    method, path = match.groups()
                    if method == "GET":
                        self.client.get(path)
                    elif method == "POST":
                        self.client.post(path)

When replaying logs, pay special attention to:

  • Timing simulation (real-time vs accelerated playback)
  • Session handling and cookie persistence
  • Dynamic parameter replacement
  • Header and authentication token handling

For more sophisticated testing scenarios:


# Using GoReplay with rate limiting and middleware
gor --input-file access.log --output-http "http://staging-server|10" \
    --middleware "/path/to/parse_and_modify.js" \
    --speed 2x
    
# JavaScript middleware example
module.exports = function(req) {
    // Replace production API keys with test keys
    if(req.headers['X-Api-Key'] === 'prod-key-123') {
        req.headers['X-Api-Key'] = 'test-key-456';
    }
    return req;
}

Compare these metrics between production and test environments:

  • Response time distributions
  • Error rates per endpoint
  • Database query patterns
  • Cache hit ratios

Watch out for these pitfalls:

  • Missing dependent resources in test environment
  • Hardcoded production URLs in the logs
  • Session-dependent workflows breaking
  • Differing request header expectations

When debugging performance issues or race conditions, synthetic load tests often fail to replicate the exact conditions that trigger bugs in production. The nuanced patterns of real user traffic - varying request sequences, mixed HTTP methods, and organic timing - are difficult to simulate with tools like JMeter or Apache Benchmark.

Apache access logs contain golden data: timestamps, request URIs, HTTP methods, response codes, and even user agents. By replaying these logs:

  • Preserve the exact request sequences that triggered the bug
  • Maintain relative timing between requests (or scale it)
  • Include all edge cases from production traffic

Here's how to transform logs into replayable traffic using GoAccess and cURL:

# First, parse logs into replayable format
goaccess access.log -o replay.sh --log-format='%h %^[%d:%t %^] "%r" %s %b' \
  --output-format=SHELL --no-global-config

# Sample output line from replay.sh:
# curl -X GET "http://yourdomain.com/api/v1/users" -H "User-Agent: Mozilla/5.0"

For higher throughput, convert logs to Siege's URL format:

awk '{print $7}' access.log | grep -v 'static' > urls.txt
siege -c 50 -d 1 -f urls.txt -i -t 10M

For exact timing reproduction, use this Python script:

import re
import time
from datetime import datetime
import subprocess

log_pattern = r'(\S+) \S+ \S+ $$([^]]+)$$ "(\S+) (\S+) \S+" (\d+)'

with open('access.log') as f:
    prev_time = None
    for line in f:
        match = re.match(log_pattern, line)
        if not match:
            continue
            
        ip, timestamp, method, path, status = match.groups()
        log_time = datetime.strptime(timestamp, '%d/%b/%Y:%H:%M:%S %z')
        
        if prev_time:
            delay = (log_time - prev_time).total_seconds()
            time.sleep(delay)
            
        subprocess.run(['curl', '-X', method, f'http://localhost{path}'])
        prev_time = log_time

When replaying at scale:

  • Replace session IDs and CSRF tokens dynamically
  • Handle rate limiting by distributing across IPs
  • Filter out static assets to focus on API endpoints
  • Consider DNS resolution overhead in timing calculations

Ensure your replay matches production behavior:

# Compare response code distribution
awk '{print $9}' original.log | sort | uniq -c
awk '{print $9}' replay.log | sort | uniq -c

# Check for new errors
grep ' 50[0-9] ' replay.log | awk '{print $7}' | sort | uniq -c