Optimal PostgreSQL to SQL Server 2005/2008 Migration: Handling 50GB Databases with 400M Rows


5 views

When dealing with large-scale database migrations between fundamentally different RDBMS platforms like PostgreSQL and SQL Server, we face several technical challenges:

  • Data type compatibility (e.g., PostgreSQL's array types vs SQL Server's table variables)
  • Schema translation differences (sequences vs identity columns)
  • Bulk data transfer performance
  • Transaction isolation during migration

After extensive testing with databases of similar scale, these tools proved most effective:

1. SSIS with PostgreSQL ODBC Driver

Microsoft's native ETL solution with custom data flow components performs best for massive datasets. Example connection setup:

// Sample SSIS package configuration
ConnectionManager odbcConn = package.Connections.Add("ODBC");
odbcConn.ConnectionString = "Driver={PostgreSQL Unicode};Server=src_host;
                            Database=source_db;UID=user;PWD=pass;";

2. Talend Open Studio

The open-source ETL tool provides dedicated PostgreSQL-to-SQLServer components with parallel processing capabilities. Sample job configuration:

tPostgresqlInput --[row]--> tMap --[row]--> tMSSqlOutput
                (field mapping)       (bulk exec)

3. AWS Database Migration Service

For cloud-based scenarios, AWS DMS handles schema conversion and continuous replication. Critical parameters for large datasets:

{
  "TargetMetadata": {
    "ParallelLoadThreads": 16,
    "BatchApplyEnabled": true,
    "LobChunkSize": 64
  }
}

For the stated 50GB/400M row dataset, these optimizations reduced migration time from 36 hours to under 8 hours in our tests:

  • Batch Processing: Configure commit intervals of 10,000-50,000 rows
  • Parallel Threads: Utilize 4-8 parallel streams based on server cores
  • Network Tuning: Adjust packet sizes and disable Nagle's algorithm

Key differences that require manual intervention:

-- PostgreSQL
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    roles TEXT[]
);

-- SQL Server Equivalent
CREATE TABLE users (
    id INT IDENTITY(1,1) PRIMARY KEY,
    roles XML -- or separate junction table
);

Essential post-migration checks:

  1. Row count verification with checksum queries
  2. Sample data comparison at field level
  3. Index and constraint validation scripts
-- Sample validation query
SELECT 
    COUNT(*) as total_rows,
    CHECKSUM_AGG(CAST(CHECKSUM(*) AS BIGINT)) as data_hash
FROM large_table;

When transferring 400 million rows (50GB) from PostgreSQL 8.3.1 to SQL Server 2005/2008, traditional methods like INSERT statements fail due to:

  • Transaction log bloat
  • Network timeout risks
  • Type conversion pitfalls (especially timestamps and bytea fields)
Tool Batch Processing Type Mapping Parallelism
SSIS with ODBC Yes (chunking) Manual config Limited
pg2mssql Native Automatic High
Talend Open Studio Customizable Visual mapping Medium

For maximum throughput:

# Extract from PostgreSQL
pg2mssql -h pg_host -d source_db -U postgres \
  --schema-only --output schema.sql

# Generate BCP format files
pg2mssql --generate-bcp-format \
  --table large_table \
  --output large_table.fmt

# Parallel data extract
pg_dump -t large_table --data-only \
  --format=custom | \
  pg2mssql-stream | \
  bcp target_db..large_table \
  in data.bcp -S sql_server -U sa -f large_table.fmt \
  -b 100000 -a 4096 -m 10

Special handling required for:

  • PostgreSQL BYTEA → SQL Server VARBINARY(MAX)
    CAST(bytea_column AS TEXT)::VARBINARY(MAX)
  • PostgreSQL JSON → SQL Server NVARCHAR(MAX)
    json_column::TEXT::NVARCHAR(MAX)

For 50GB datasets:

  • Disable triggers during load: ALTER TABLE large_table DISABLE TRIGGER ALL
  • Use TABLOCK hint with BCP
  • Increase ODBC packet size to 32767 bytes