When working with large PostgreSQL databases containing tables with millions of records (like event logs), full table dumps are often impractical. The pg_dump utility by default exports entire tables, which becomes inefficient when you only need recent records or specific data subsets.
PostgreSQL provides the --data-only
and --table
flags for pg_dump, but the real power comes with the --where
option:
pg_dump -h your_host -U your_user -d your_db \
--table=your_table \
--data-only \
--where="created_at > '2012-05-01'" \
-f recent_events_dump.sql
For more control over the output format, you can use PostgreSQL's COPY command through psql:
psql -h your_host -U your_user -d your_db -c \
"COPY (SELECT * FROM event_log WHERE created_at > '2012-05-01') \
TO STDOUT WITH CSV HEADER" > recent_events.csv
When dealing with multi-GB tables:
- Add appropriate indexes on your filter columns (created_at in this case)
- Consider using
--jobs
flag for parallel dumping (PostgreSQL 9.3+) - For very large exports, split into smaller batches by date ranges
You can control the output format with additional options:
pg_dump --format=custom --compress=5 --blobs \
--table=event_log --where="created_at > '2023-01-01'" \
-f events_2023.dump
To restore the subset data to another database:
psql -h new_host -U new_user -d new_db -f recent_events_dump.sql
For recurring exports (e.g., monthly logs), create a shell script:
#!/bin/bash
DATE=$(date -d "1 month ago" +%Y-%m-%d)
pg_dump -h localhost -U postgres -d app_db \
--table=events \
--where="created_at > '$DATE'" \
-f /backups/events_last_month.sql
When working with large PostgreSQL tables (particularly log tables that grow continuously), we often need to extract specific subsets of data for analysis or sharing. The standard pg_dump
utility exports entire tables by default, which becomes problematic when dealing with multi-GB tables where only a fraction of the data is needed.
Here are the most effective methods to export a temporal subset of your PostgreSQL data:
1. Using COPY Command with WHERE Clause
The most straightforward method for table subsets:
-- Export to CSV
COPY (
SELECT * FROM event_log
WHERE created_at > '2023-05-01'
) TO '/path/to/output.csv' WITH CSV HEADER;
-- For binary format (smaller files)
COPY (
SELECT * FROM event_log
WHERE created_at > '2023-05-01'
) TO '/path/to/output.dat' WITH BINARY;
2. Creating a Temporary View for pg_dump
When you need the full power of pg_dump's format:
-- Create a temporary view
CREATE VIEW recent_events AS
SELECT * FROM event_log WHERE created_at > '2023-05-01';
-- Export the view
pg_dump -U username -d dbname -t recent_events -Fc -f recent_events.dump
-- Remember to drop the view afterward
DROP VIEW recent_events;
3. Using pg_dump with Custom Query (PostgreSQL 12+)
Modern PostgreSQL versions offer direct query support in pg_dump:
pg_dump -U username -d dbname \
--table='event_log' \
--where="created_at > '2023-05-01'" \
-Fc -f partial_export.dump
- For very large tables, add an index on the
created_at
column if not already present - The COPY method is generally fastest for CSV exports
- Binary formats (pg_dump custom format) preserve type information but are slower to generate
- Consider using
ORDER BY created_at DESC LIMIT 10000
if you specifically need the most recent N records
Here's how I recently exported error logs from our production database:
-- First ensure our query is efficient
EXPLAIN ANALYZE
SELECT * FROM server_logs
WHERE log_time > '2023-06-01' AND level = 'ERROR';
-- Then export with compression
COPY (
SELECT id, log_time, message
FROM server_logs
WHERE log_time > '2023-06-01'
AND level = 'ERROR'
ORDER BY log_time DESC
) TO PROGRAM 'gzip > /exports/error_logs_202306.csv.gz'
WITH CSV HEADER;
For very large exports, you can pipe directly to compression utilities:
-- Using gzip compression
psql -U username -d dbname -c \
"COPY (SELECT * FROM event_log WHERE created_at > '2023-05-01') TO STDOUT" \
| gzip > events_after_202305.csv.gz
-- Using zstd for better compression ratios
psql -U username -d dbname -c \
"COPY (SELECT * FROM event_log WHERE created_at > '2023-05-01') TO STDOUT" \
| zstd -o events_after_202305.csv.zst