How to Export a Subset of PostgreSQL Table Data Using Conditional Dumping


14 views

When working with large PostgreSQL databases containing tables with millions of records (like event logs), full table dumps are often impractical. The pg_dump utility by default exports entire tables, which becomes inefficient when you only need recent records or specific data subsets.

PostgreSQL provides the --data-only and --table flags for pg_dump, but the real power comes with the --where option:

pg_dump -h your_host -U your_user -d your_db \
--table=your_table \
--data-only \
--where="created_at > '2012-05-01'" \
-f recent_events_dump.sql

For more control over the output format, you can use PostgreSQL's COPY command through psql:

psql -h your_host -U your_user -d your_db -c \
"COPY (SELECT * FROM event_log WHERE created_at > '2012-05-01') \
TO STDOUT WITH CSV HEADER" > recent_events.csv

When dealing with multi-GB tables:

  • Add appropriate indexes on your filter columns (created_at in this case)
  • Consider using --jobs flag for parallel dumping (PostgreSQL 9.3+)
  • For very large exports, split into smaller batches by date ranges

You can control the output format with additional options:

pg_dump --format=custom --compress=5 --blobs \
--table=event_log --where="created_at > '2023-01-01'" \
-f events_2023.dump

To restore the subset data to another database:

psql -h new_host -U new_user -d new_db -f recent_events_dump.sql

For recurring exports (e.g., monthly logs), create a shell script:

#!/bin/bash
DATE=$(date -d "1 month ago" +%Y-%m-%d)
pg_dump -h localhost -U postgres -d app_db \
--table=events \
--where="created_at > '$DATE'" \
-f /backups/events_last_month.sql

When working with large PostgreSQL tables (particularly log tables that grow continuously), we often need to extract specific subsets of data for analysis or sharing. The standard pg_dump utility exports entire tables by default, which becomes problematic when dealing with multi-GB tables where only a fraction of the data is needed.

Here are the most effective methods to export a temporal subset of your PostgreSQL data:

1. Using COPY Command with WHERE Clause

The most straightforward method for table subsets:

-- Export to CSV
COPY (
    SELECT * FROM event_log 
    WHERE created_at > '2023-05-01'
) TO '/path/to/output.csv' WITH CSV HEADER;

-- For binary format (smaller files)
COPY (
    SELECT * FROM event_log 
    WHERE created_at > '2023-05-01'
) TO '/path/to/output.dat' WITH BINARY;

2. Creating a Temporary View for pg_dump

When you need the full power of pg_dump's format:

-- Create a temporary view
CREATE VIEW recent_events AS 
SELECT * FROM event_log WHERE created_at > '2023-05-01';

-- Export the view
pg_dump -U username -d dbname -t recent_events -Fc -f recent_events.dump

-- Remember to drop the view afterward
DROP VIEW recent_events;

3. Using pg_dump with Custom Query (PostgreSQL 12+)

Modern PostgreSQL versions offer direct query support in pg_dump:

pg_dump -U username -d dbname \
  --table='event_log' \
  --where="created_at > '2023-05-01'" \
  -Fc -f partial_export.dump
  • For very large tables, add an index on the created_at column if not already present
  • The COPY method is generally fastest for CSV exports
  • Binary formats (pg_dump custom format) preserve type information but are slower to generate
  • Consider using ORDER BY created_at DESC LIMIT 10000 if you specifically need the most recent N records

Here's how I recently exported error logs from our production database:

-- First ensure our query is efficient
EXPLAIN ANALYZE 
SELECT * FROM server_logs 
WHERE log_time > '2023-06-01' AND level = 'ERROR';

-- Then export with compression
COPY (
    SELECT id, log_time, message 
    FROM server_logs 
    WHERE log_time > '2023-06-01' 
    AND level = 'ERROR'
    ORDER BY log_time DESC
) TO PROGRAM 'gzip > /exports/error_logs_202306.csv.gz' 
WITH CSV HEADER;

For very large exports, you can pipe directly to compression utilities:

-- Using gzip compression
psql -U username -d dbname -c \
"COPY (SELECT * FROM event_log WHERE created_at > '2023-05-01') TO STDOUT" \
| gzip > events_after_202305.csv.gz

-- Using zstd for better compression ratios
psql -U username -d dbname -c \
"COPY (SELECT * FROM event_log WHERE created_at > '2023-05-01') TO STDOUT" \
| zstd -o events_after_202305.csv.zst