When working with large collections of PDF files, we often need to filter files based on their content. While finding files containing a string is straightforward, the inverse operation requires special handling in command-line tools.
The standard approach uses grep
's -L
flag (or --files-without-match
), which outputs filenames that don't contain the pattern:
find /Users/me/PDFFiles/ -type f -name "*.pdf" -exec grep -L 'Font' '{}' ';'
For large directories, the basic -exec
approach can be slow because it spawns a new grep
process for each file. Here's a more efficient version using xargs
:
find /Users/me/PDFFiles/ -type f -name "*.pdf" -print0 | xargs -0 grep -L 'Font'
For more complex scenarios where you might need to check multiple conditions:
find /Users/me/PDFFiles/ -type f -name "*.pdf" \\
-exec sh -c '! grep -q "Font" "$1"' sh {} \\; -print
For modern systems, rg
(ripgrep) offers better performance:
rg --files-without-match 'Font' -g '*.pdf' /Users/me/PDFFiles/
Some PDFs might contain binary data that could confuse grep. Adding -a
treats all files as text:
find /Users/me/PDFFiles/ -type f -name "*.pdf" -exec grep -aL 'Font' '{}' ';'
You can combine this with other find conditions, like file size or modification time:
find /Users/me/PDFFiles/ -type f -name "*.pdf" -size +1M \\
-mtime -30 -exec grep -L 'Font' '{}' ';'
When working with document processing systems or performing automated checks, we often need to identify files that don't contain certain patterns. This is particularly useful for:
- Validating document compliance (missing required elements)
- Finding files that haven't been properly processed
- Quality control checks in automated workflows
The command you're currently using:
find /Users/me/PDFFiles/ -type f -name "*.pdf" -exec grep -H 'Font' '{}' ';'
Does three things:
- Searches recursively in /Users/me/PDFFiles/
- Filters for files (-type f) with .pdf extension
- Executes grep to find 'Font' in each file
To find files that don't contain the string, we have several options:
Method 1: Using grep's -L flag
The most straightforward solution:
find /Users/me/PDFFiles/ -type f -name "*.pdf" -exec grep -L 'Font' '{}' ';'
Key differences:
-L
instead of-H
: Shows filenames that don't contain the pattern- Removed the
-H
flag which is redundant here
Method 2: Using ! with grep Return Status
Alternative approach using grep's exit status:
find /Users/me/PDFFiles/ -type f -name "*.pdf" -exec sh -c '! grep -q "Font" "$0"' '{}' ';' -print
This works because:
grep -q
gives exit status 1 when pattern isn't found!
inverts the statussh -c
executes the shell command
For large collections of PDFs, consider these optimizations:
# Parallel processing with xargs
find /Users/me/PDFFiles/ -type f -name "*.pdf" -print0 | xargs -0 -P 4 grep -L "Font"
# Using faster grep implementations (if available)
find /Users/me/PDFFiles/ -type f -name "*.pdf" -exec rg -L 'Font' '{}' ';'
Edge cases you might encounter:
- Binary PDFs: Add
-a
to grep to handle binary files properly - Case sensitivity: Use
-i
for case-insensitive search - Large files: Add
--max-count=1
to stop after first match
Complete example with all options:
find /Users/me/PDFFiles/ -type f -name "*.pdf" -exec grep -aLi --max-count=1 -L 'Font' '{}' ';'
Depending on your system, these might be more efficient:
# Using ripgrep (rg)
rg --files-without-match "Font" -g "*.pdf" /Users/me/PDFFiles/
# Using fd-find + rg
fd -e pdf . /Users/me/PDFFiles/ -x rg -L "Font"