Modern web content relies heavily on CSS for layout and presentation, making traditional tools like htmldoc
insufficient. When converting complex HTML documents with CSS styling to PDF, we need solutions that properly interpret modern web standards.
Here are the most robust command-line tools currently available:
# wkhtmltopdf (WebKit-based)
sudo apt-get install wkhtmltopdf
wkhtmltopdf --enable-local-file-access input.html output.pdf
# Headless Chrome/Chromium
chrome --headless --disable-gpu --print-to-pdf=input.html output.pdf
# WeasyPrint (CSS Paged Media Module)
pip install weasyprint
weasyprint input.html output.pdf
For professional-grade output, these tools support extensive customization:
# wkhtmltopdf with custom margins and TOC generation
wkhtmltopdf \
--margin-top 20mm \
--margin-bottom 20mm \
--margin-left 10mm \
--margin-right 10mm \
--toc \
--enable-local-file-access \
input.html output.pdf
# Headless Chrome with custom paper size
chrome --headless --disable-gpu \
--print-to-pdf=input.html \
--no-margins \
--virtual-time-budget=10000 \
output.pdf
For CSS-heavy documents, WeasyPrint provides the best support for modern layout techniques:
/* Sample CSS for print-optimized output */
@media print {
@page {
size: A4;
margin: 2cm;
@bottom-center {
content: "Page " counter(page);
}
}
.no-print {
display: none !important;
}
}
For batch processing large numbers of HTML files:
# Parallel processing with GNU Parallel
find . -name "*.html" | parallel -j 4 wkhtmltopdf {} {.}.pdf
# Using xargs for memory efficiency
find . -name "*.html" -print0 | xargs -0 -P 4 -I {} wkhtmltopdf {} {}.pdf
- For missing fonts, ensure system fonts match web fonts
- Use
--javascript-delay
for dynamic content - Enable
--enable-local-file-access
for local resources - Set explicit
@page
rules in CSS for consistent pagination
Traditional tools like htmldoc
fail to meet contemporary web standards by lacking CSS support. In today's web ecosystem where CSS drives layout (not nested tables), this creates fundamentally broken PDF outputs. Let's examine modern approaches:
The most reliable method leverages actual browser engines. Here are three production-tested approaches:
# Using Puppeteer (Node.js)
const puppeteer = require('puppeteer');
async function htmlToPdf(htmlFile, outputPath) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(file://${htmlFile}, { waitUntil: 'networkidle0' });
await page.pdf({
path: outputPath,
format: 'A4',
printBackground: true
});
await browser.close();
}
This QT WebKit wrapper remains popular despite being unmaintained since 2018:
# Basic conversion
wkhtmltopdf --enable-local-file-access input.html output.pdf
# Advanced options
wkhtmltopdf \
--margin-top 15mm \
--header-html header.html \
--footer-center "[page]/[topage]" \
input.html output.pdf
For current projects, consider these actively maintained tools:
# Using WeasyPrint (Python)
weasyprint input.html output.pdf
# With custom stylesheet
weasyprint -s print.css input.html output.pdf
For consistent results across environments:
# Chromium-based conversion
docker run -v $(pwd):/files \
zenika/alpine-chrome \
--no-sandbox \
--print-to-pdf=/files/output.pdf \
/files/input.html
For batch processing 100+ files:
- wkhtmltopdf: ~200ms per page (single thread)
- Puppeteer: ~500ms (including Chrome startup)
- WeasyPrint: ~300ms for simple layouts
The optimal choice depends on your CSS complexity and performance requirements.