How to Export a Subset of PostgreSQL Table Data Using Conditional Dumping


1 views

When working with large PostgreSQL databases containing tables with millions of records (like event logs), full table dumps are often impractical. The pg_dump utility by default exports entire tables, which becomes inefficient when you only need recent records or specific data subsets.

PostgreSQL provides the --data-only and --table flags for pg_dump, but the real power comes with the --where option:

pg_dump -h your_host -U your_user -d your_db \
--table=your_table \
--data-only \
--where="created_at > '2012-05-01'" \
-f recent_events_dump.sql

For more control over the output format, you can use PostgreSQL's COPY command through psql:

psql -h your_host -U your_user -d your_db -c \
"COPY (SELECT * FROM event_log WHERE created_at > '2012-05-01') \
TO STDOUT WITH CSV HEADER" > recent_events.csv

When dealing with multi-GB tables:

  • Add appropriate indexes on your filter columns (created_at in this case)
  • Consider using --jobs flag for parallel dumping (PostgreSQL 9.3+)
  • For very large exports, split into smaller batches by date ranges

You can control the output format with additional options:

pg_dump --format=custom --compress=5 --blobs \
--table=event_log --where="created_at > '2023-01-01'" \
-f events_2023.dump

To restore the subset data to another database:

psql -h new_host -U new_user -d new_db -f recent_events_dump.sql

For recurring exports (e.g., monthly logs), create a shell script:

#!/bin/bash
DATE=$(date -d "1 month ago" +%Y-%m-%d)
pg_dump -h localhost -U postgres -d app_db \
--table=events \
--where="created_at > '$DATE'" \
-f /backups/events_last_month.sql

When working with large PostgreSQL tables (particularly log tables that grow continuously), we often need to extract specific subsets of data for analysis or sharing. The standard pg_dump utility exports entire tables by default, which becomes problematic when dealing with multi-GB tables where only a fraction of the data is needed.

Here are the most effective methods to export a temporal subset of your PostgreSQL data:

1. Using COPY Command with WHERE Clause

The most straightforward method for table subsets:

-- Export to CSV
COPY (
    SELECT * FROM event_log 
    WHERE created_at > '2023-05-01'
) TO '/path/to/output.csv' WITH CSV HEADER;

-- For binary format (smaller files)
COPY (
    SELECT * FROM event_log 
    WHERE created_at > '2023-05-01'
) TO '/path/to/output.dat' WITH BINARY;

2. Creating a Temporary View for pg_dump

When you need the full power of pg_dump's format:

-- Create a temporary view
CREATE VIEW recent_events AS 
SELECT * FROM event_log WHERE created_at > '2023-05-01';

-- Export the view
pg_dump -U username -d dbname -t recent_events -Fc -f recent_events.dump

-- Remember to drop the view afterward
DROP VIEW recent_events;

3. Using pg_dump with Custom Query (PostgreSQL 12+)

Modern PostgreSQL versions offer direct query support in pg_dump:

pg_dump -U username -d dbname \
  --table='event_log' \
  --where="created_at > '2023-05-01'" \
  -Fc -f partial_export.dump
  • For very large tables, add an index on the created_at column if not already present
  • The COPY method is generally fastest for CSV exports
  • Binary formats (pg_dump custom format) preserve type information but are slower to generate
  • Consider using ORDER BY created_at DESC LIMIT 10000 if you specifically need the most recent N records

Here's how I recently exported error logs from our production database:

-- First ensure our query is efficient
EXPLAIN ANALYZE 
SELECT * FROM server_logs 
WHERE log_time > '2023-06-01' AND level = 'ERROR';

-- Then export with compression
COPY (
    SELECT id, log_time, message 
    FROM server_logs 
    WHERE log_time > '2023-06-01' 
    AND level = 'ERROR'
    ORDER BY log_time DESC
) TO PROGRAM 'gzip > /exports/error_logs_202306.csv.gz' 
WITH CSV HEADER;

For very large exports, you can pipe directly to compression utilities:

-- Using gzip compression
psql -U username -d dbname -c \
"COPY (SELECT * FROM event_log WHERE created_at > '2023-05-01') TO STDOUT" \
| gzip > events_after_202305.csv.gz

-- Using zstd for better compression ratios
psql -U username -d dbname -c \
"COPY (SELECT * FROM event_log WHERE created_at > '2023-05-01') TO STDOUT" \
| zstd -o events_after_202305.csv.zst