How to Sort Domain Names by TLD and Subdomains for SquidGuard Whitelist Cleanup

When managing web filtering whitelists in SquidGuard, you might encounter a frustrating behavior: if both example.com and www.example.com exist in your whitelist, SquidGuard will only honor the more specific www.example.com entry. This becomes problematic when you need to whitelist entire domains.

Sorting domains from TLD to left helps visually identify and merge overlapping entries. For example:

original order:
example.com
www.example.com

reverse-sorted order:
www.example.com
example.com

The sorted version makes it immediately obvious that example.com covers both.

Here are several ways to achieve this sorting:

Using Python

def reverse_domain_sort(domains):
    return sorted(domains, 
                 key=lambda x: list(reversed(x.split('.'))))

domains = [
    "www.activityvillage.co.uk",
    "ajax.googleapis.com",
    # ... other domains ...
]

sorted_domains = reverse_domain_sort(domains)
print('\n'.join(sorted_domains))

Using AWK (Unix/Linux)

awk '{
    split($0, arr, ".");
    for (i=length(arr); i>=1; i--) {
        printf "%s", arr[i];
        if (i>1) printf ".";
    }
    printf "\t%s\n", $0;
}' domains.txt | sort | cut -f2

Using PowerShell (Windows)

$domains = Get-Content .\domains.txt
$domains | Sort-Object { 
    $parts = $_.Split('.')
    [array]::Reverse($parts)
    $parts -join '.'
}

Consider these special cases in your implementation:

Internationalized domain names (IDNs)
Domains with trailing dots
Case sensitivity (though DNS is case-insensitive)

After sorting, you can more easily:

Identify redundant entries
Merge overlapping domains
Spot potential conflicts

For example, after sorting you might find:

sub.domain.com
domain.com

This clearly shows that domain.com already covers all subdomains.

For large whitelists (10,000+ domains):

Python is generally fastest
AWK handles medium files well
PowerShell may be slower for very large files

When managing SquidGuard whitelists, we often encounter situations where domain entries like example.com and www.example.com coexist. Due to SquidGuard's matching behavior, this can lead to unexpected access restrictions. The solution requires sorting domains by their components in reverse order (from TLD to subdomain).

Consider these example domains:

www.activityvillage.co.uk
ajax.googleapis.com
akhet.co.uk

When sorted traditionally (left-to-right), we get alphabetical ordering that doesn't reflect the actual domain hierarchy. What we need is:

chrome.angrybirds.com
crl.godaddy.com
ajax.googleapis.com
www.activityvillage.co.uk
akhet.co.uk
bbc.co.uk

Here are three practical solutions:

1. Using awk for Quick Sorting

awk -F. '{
    printf "%s", $NF;
    for (i=NF-1; i>=1; i--) {
        printf ".%s", $i
    }
    printf "\t%s\n", $0
}' domains.txt | sort | cut -f2

2. Python Implementation

def reverse_domain(domain):
    parts = domain.split('.')
    return '.'.join(reversed(parts))

with open('domains.txt') as f:
    domains = [line.strip() for line in f if line.strip()]

sorted_domains = sorted(domains, key=reverse_domain)

for domain in sorted_domains:
    print(domain)

3. PowerShell Solution

Get-Content .\domains.txt | 
ForEach-Object {
    $parts = $_.Split('.')
    [array]::Reverse($parts)
    New-Object PSObject -Property @{
        Original = $_
        Reversed = $parts -join '.'
    }
} | 
Sort-Object -Property Reversed | 
Select-Object -ExpandProperty Original

After sorting, you can easily spot and remove redundant entries. For example:

# Before sorting
example.com
www.example.com
sub.www.example.com

# After sorting
sub.www.example.com
www.example.com
example.com

This visual grouping makes it obvious which domains might be causing conflicts in your whitelist.

For large lists (100k+ domains), consider these optimizations:

Use parallel processing in Python with multiprocessing
For extremely large files, implement external sorting
Add validation to skip malformed domains

Here's an optimized Python version for large datasets:

import concurrent.futures

def process_domain(domain):
    try:
        return ('.'.join(reversed(domain.strip().split('.'))), domain.strip())
    except:
        return None

with open('large_domains.txt') as f:
    with concurrent.futures.ThreadPoolExecutor() as executor:
        results = list(filter(None, executor.map(process_domain, f)))
        
sorted_domains = [d[1] for d in sorted(results, key=lambda x: x[0])]

ServerDevWorker