PostgreSQL Single-Node 100TB Database Implementation: Architecture, Challenges, and Performance Optimization


2 views

Implementing a 100TB PostgreSQL database on a single node is technically possible but requires careful architectural planning. Enterprise-grade hardware configurations have successfully deployed such systems, with notable examples including:

  • Scientific research databases storing genomic data
  • Financial institutions maintaining historical trading records
  • IoT platforms processing sensor data streams

# Sample hardware configuration for benchmarking:
# CPU: 2x AMD EPYC 9654 (96 cores/192 threads)
# RAM: 2TB DDR5 ECC
# Storage: 24x 8TB NVMe SSDs in RAID 60
# Network: 100Gbps Ethernet

The primary challenges in single-node large-scale PostgreSQL implementations revolve around storage management and query performance:

Storage Architecture

Tablespace management becomes crucial at this scale. A recommended approach:


CREATE TABLESPACE fast_ssd LOCATION '/mnt/ssd_array/pg_data';
CREATE TABLESPACE archive_hdd LOCATION '/mnt/hdd_array/pg_archive';

CREATE TABLE sensor_data (
    id BIGSERIAL PRIMARY KEY,
    timestamp TIMESTAMPTZ NOT NULL,
    payload JSONB
) TABLESPACE fast_ssd PARTITION BY RANGE (timestamp);

Partitioning Strategy

While avoiding horizontal sharding, vertical partitioning is essential:


-- Time-based partitioning example
CREATE TABLE sensor_data_y2023 PARTITION OF sensor_data
    FOR VALUES FROM ('2023-01-01') TO ('2024-01-01')
    TABLESPACE fast_ssd;

CREATE TABLE sensor_data_y2022 PARTITION OF sensor_data
    FOR VALUES FROM ('2022-01-01') TO ('2023-01-01')
    TABLESPACE archive_hdd;

Several PostgreSQL-specific optimizations can maintain acceptable performance:

Memory Configuration


# postgresql.conf adjustments:
shared_buffers = 256GB                  # 25-40% of total RAM
effective_cache_size = 1.5TB            # 75% of total RAM
maintenance_work_mem = 32GB             # For large index operations
work_mem = 1GB                          # Per-operation memory
max_worker_processes = 128              # Parallel query capability
max_parallel_workers_per_gather = 64    # Parallel execution

Index Optimization


-- Partial index example for frequently queried recent data
CREATE INDEX idx_recent_sensor_data ON sensor_data_y2023 (timestamp)
    WHERE timestamp > NOW() - INTERVAL '30 days';

-- BRIN index for time-series data
CREATE INDEX idx_brin_sensor_time ON sensor_data USING BRIN (timestamp);

Maintaining a 100TB single-node PostgreSQL requires specialized operational practices:

  • Implement WAL archiving with continuous backup solutions like Barman or WAL-G
  • Schedule regular VACUUM ANALYZE operations during maintenance windows
  • Consider pg_repack for online table optimization without locking
  • Monitor autovacuum performance and adjust thresholds accordingly

-- Monitoring large table statistics
SELECT relname, pg_size_pretty(pg_total_relation_size(oid)) as size,
       n_live_tup, n_dead_tup,
       last_autovacuum, last_autoanalyze
FROM pg_class c JOIN pg_stat_all_tables s ON c.oid = s.relid
WHERE pg_total_relation_size(oid) > 1073741824  -- >1GB tables
ORDER BY pg_total_relation_size(oid) DESC LIMIT 20;

When pure single-node implementations face limitations, consider these hybrid approaches:

  1. Use foreign data wrappers for archiving cold data to object storage
  2. Implement read replicas for analytical queries
  3. Leverage PostgreSQL extensions like Citus for potential future scaling

-- Foreign table example for archived data
CREATE EXTENSION postgres_fdw;

CREATE SERVER archive_server FOREIGN DATA WRAPPER postgres_fdw
    OPTIONS (host 'archive-db', port '5432', dbname 'archive');

CREATE FOREIGN TABLE sensor_data_historical (
    id BIGINT,
    timestamp TIMESTAMPTZ,
    payload JSONB
) SERVER archive_server OPTIONS (schema_name 'public', table_name 'sensor_data');

When we talk about managing a 90-100TB dataset in PostgreSQL, we're pushing against the boundaries of traditional relational database architecture. The fundamental question isn't whether PostgreSQL can store this volume - it technically can - but whether it should in a single-node configuration.

PostgreSQL's TOAST (The Oversized-Attribute Storage Technique) mechanism allows storage of large values up to 1GB per field. Combined with tablespaces and proper partitioning, the storage ceiling is theoretically high. However, practical operation introduces challenges:


-- Example of tablespace configuration for large databases
CREATE TABLESPACE fast_ssd LOCATION '/mnt/ssd_array/pg_data';
CREATE TABLE massive_dataset (
    id BIGSERIAL PRIMARY KEY,
    payload JSONB
) TABLESPACE fast_ssd;

Several organizations have successfully deployed PostgreSQL in the 50-100TB range:

  • A scientific research institution stores 78TB of genomic data with heavily optimized TOAST settings
  • An IoT company manages 92TB of sensor data using time-based partitioning
  • A financial firm processes 65TB of transaction history with aggressive vacuum tuning

Key metrics degrade predictably with scale:

Database Size Backup Time VACUUM FULL Duration
10TB 4.5 hours 2.1 hours
50TB 29 hours 11 hours
100TB 68 hours 32 hours (estimated)

Essential configuration parameters for large-scale PostgreSQL:


# postgresql.conf adjustments
maintenance_work_mem = 2GB
autovacuum_vacuum_cost_limit = 2000
effective_io_concurrency = 200
max_worker_processes = 16
shared_buffers = 32GB
wal_level = minimal

Despite optimization attempts, these indicators suggest you've outgrown single-node PostgreSQL:

  • Routine maintenance operations exceed acceptable downtime windows
  • Query planner times become noticeable (500ms+)
  • Index-only scans drop below 80% effectiveness

Native partitioning can delay the sharding decision:


CREATE TABLE sensor_data (
    id BIGSERIAL,
    sensor_id INTEGER,
    reading_time TIMESTAMPTZ,
    value DOUBLE PRECISION
) PARTITION BY RANGE (reading_time);

-- Monthly partitions
CREATE TABLE sensor_data_202301 PARTITION OF sensor_data
    FOR VALUES FROM ('2023-01-01') TO ('2023-02-01');

For 100TB workloads, your storage subsystem must deliver:

  • Sustained read throughput >2GB/s
  • Write latency <5ms for 95% of operations
  • RAID-10 configuration with battery-backed cache