How to Find PDF Files That Don’t Contain a Specific String Using find and grep


3 views

When working with large collections of PDF files, we often need to filter files based on their content. While finding files containing a string is straightforward, the inverse operation requires special handling in command-line tools.

The standard approach uses grep's -L flag (or --files-without-match), which outputs filenames that don't contain the pattern:

find /Users/me/PDFFiles/ -type f -name "*.pdf" -exec grep -L 'Font' '{}' ';'

For large directories, the basic -exec approach can be slow because it spawns a new grep process for each file. Here's a more efficient version using xargs:

find /Users/me/PDFFiles/ -type f -name "*.pdf" -print0 | xargs -0 grep -L 'Font'

For more complex scenarios where you might need to check multiple conditions:

find /Users/me/PDFFiles/ -type f -name "*.pdf" \\
  -exec sh -c '! grep -q "Font" "$1"' sh {} \\; -print

For modern systems, rg (ripgrep) offers better performance:

rg --files-without-match 'Font' -g '*.pdf' /Users/me/PDFFiles/

Some PDFs might contain binary data that could confuse grep. Adding -a treats all files as text:

find /Users/me/PDFFiles/ -type f -name "*.pdf" -exec grep -aL 'Font' '{}' ';'

You can combine this with other find conditions, like file size or modification time:

find /Users/me/PDFFiles/ -type f -name "*.pdf" -size +1M \\
  -mtime -30 -exec grep -L 'Font' '{}' ';'

When working with document processing systems or performing automated checks, we often need to identify files that don't contain certain patterns. This is particularly useful for:

  • Validating document compliance (missing required elements)
  • Finding files that haven't been properly processed
  • Quality control checks in automated workflows

The command you're currently using:

find /Users/me/PDFFiles/ -type f -name "*.pdf" -exec grep -H 'Font' '{}' ';'

Does three things:

  1. Searches recursively in /Users/me/PDFFiles/
  2. Filters for files (-type f) with .pdf extension
  3. Executes grep to find 'Font' in each file

To find files that don't contain the string, we have several options:

Method 1: Using grep's -L flag

The most straightforward solution:

find /Users/me/PDFFiles/ -type f -name "*.pdf" -exec grep -L 'Font' '{}' ';'

Key differences:

  • -L instead of -H: Shows filenames that don't contain the pattern
  • Removed the -H flag which is redundant here

Method 2: Using ! with grep Return Status

Alternative approach using grep's exit status:

find /Users/me/PDFFiles/ -type f -name "*.pdf" -exec sh -c '! grep -q "Font" "$0"' '{}' ';' -print

This works because:

  • grep -q gives exit status 1 when pattern isn't found
  • ! inverts the status
  • sh -c executes the shell command

For large collections of PDFs, consider these optimizations:

# Parallel processing with xargs
find /Users/me/PDFFiles/ -type f -name "*.pdf" -print0 | xargs -0 -P 4 grep -L "Font"

# Using faster grep implementations (if available)
find /Users/me/PDFFiles/ -type f -name "*.pdf" -exec rg -L 'Font' '{}' ';'

Edge cases you might encounter:

  • Binary PDFs: Add -a to grep to handle binary files properly
  • Case sensitivity: Use -i for case-insensitive search
  • Large files: Add --max-count=1 to stop after first match

Complete example with all options:

find /Users/me/PDFFiles/ -type f -name "*.pdf" -exec grep -aLi --max-count=1 -L 'Font' '{}' ';'

Depending on your system, these might be more efficient:

# Using ripgrep (rg)
rg --files-without-match "Font" -g "*.pdf" /Users/me/PDFFiles/

# Using fd-find + rg
fd -e pdf . /Users/me/PDFFiles/ -x rg -L "Font"