When dealing with aggressive web scrapers, simple IP-based blocking often fails because determined attackers rotate through their ISP's IP pool. Here's what I observed in my logs:
203.0.113.45 - - [01/Jan/2023:12:34:56] "GET /products HTTP/1.1" 200
203.0.113.78 - - [01/Jan/2023:12:35:01] "GET /products HTTP/1.1" 200
203.0.113.112 - - [01/Jan/2023:12:35:07] "GET /products HTTP/1.1" 200
First, gather WHOIS data for sample IPs using the command line:
whois 203.0.113.45 | grep -Ei 'netname|origin|mnt-by'
# Output:
# netname: MALICIOUS-ISP-NET
# origin: AS12345
# mnt-by: MAINT-MALICIOUS-ISP
Or programmatically with Python:
import pythonwhois
def get_whois_data(ip):
data = pythonwhois.get_whois(ip)
return {
'netname': data.get('netname'),
'asn': data.get('origin'),
'mnt_by': data.get('mnt-by')
}
Using the bgp.tools API to get all prefixes for an ASN:
import requests
def get_asn_prefixes(asn):
url = f"https://api.bgp.tools/prefixes/{asn}"
response = requests.get(url)
return response.json()['prefixes']
# Example usage:
prefixes = get_asn_prefixes("AS12345")
Generate iptables rules from the CIDR ranges:
def generate_iptables_rules(prefixes):
rules = []
for prefix in prefixes:
rules.append(f"iptables -A INPUT -s {prefix} -j DROP")
return rules
# Save to a shell script
with open('block_isp.sh', 'w') as f:
f.write("#!/bin/bash\n")
f.write("\n".join(generate_iptables_rules(prefixes)))
For continuous monitoring, integrate this with your log analysis:
import subprocess
def update_isp_blocks():
# Get suspicious IPs from logs
suspicious_ips = [...] # Your log analysis logic
# Get ASN from first IP
asn = get_whois_data(suspicious_ips[0])['asn']
# Fetch and apply new rules
prefixes = get_asn_prefixes(asn)
for rule in generate_iptables_rules(prefixes):
subprocess.run(rule, shell=True)
Create a custom Fail2Ban filter and action:
# /etc/fail2ban/filter.d/isp-abuse.conf
[Definition]
failregex = ^<HOST>.*"(GET|POST).*"(?:.*(?:HTTP/\d\.\d"\s+200|Spider))?$
# /etc/fail2ban/action.d/iptables-asn.conf
[Definition]
actionban = /usr/local/bin/block_asn.sh <asn>
When dealing with aggressive web scrapers using rotating IPs from the same ISP, traditional IP-based blocking becomes ineffective. Here's a technical deep dive into solving this at the ASN level.
First, gather sample IPs from your logs and run WHOIS lookups. The key fields to note are:
whois 203.0.113.45 | grep -E 'netname|origin|mnt-by'
# Example output:
# netname: MALICIOUS-ISP-NET
# origin: AS12345
# mnt-by: MAINT-MALICIOUS-ISP
Use the RIPE/ARIN REST APIs to get all prefixes for an ASN:
# Using RIPE's API
curl -s "https://stat.ripe.net/data/announced-prefixes/data.json?resource=AS12345" | jq '.data.prefixes[].prefix'
# Using ARIN's REST API
curl -s "https://whois.arin.net/rest/asn/AS12345/prefixes" | grep -oE '([0-9]{1,3}\.){3}[0-9]{1,3}/[0-9]{1,2}'
Here's a Python script to generate iptables rules for an entire ASN:
import requests
import subprocess
def block_asn(asn):
url = f"https://stat.ripe.net/data/announced-prefixes/data.json?resource={asn}"
response = requests.get(url).json()
for prefix in response['data']['prefixes']:
cidr = prefix['prefix']
subprocess.run(['iptables', '-A', 'INPUT', '-s', cidr, '-j', 'DROP'])
print(f"Blocked {cidr}")
block_asn("AS12345")
For real-time checking, consider using MaxMind's GeoIP or Team Cymru's IP-to-ASN service:
# Using Team Cymru's service
whois -h whois.cymru.com " -v 203.0.113.45"
# Output includes ASN and network info
1. Be cautious when blocking entire ASNs - you might affect legitimate users
2. Monitor your block lists regularly
3. Combine with rate limiting for better protection