How to Identify and Troubleshoot High Context Switch Rates in Linux (PostgreSQL Case Study)


2 views

When your Linux system shows high context switch rates (20k/s in vmstat) but low CPU utilization and load averages, it's time to dig deeper. The vmstat output clearly indicates the cs column showing ~20k context switches per second:

# vmstat 3
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0   7292 251876  82344 2291968   0    0     0    73   12 20116  1  0 99  0

The initial pidstat output reveals several PostgreSQL processes with significant context switching:

# pidstat -w 10 1 | grep postgres
12:39:23        25190     12.19     35.86  postgres
12:39:23        31247      4.10     23.58  postgres
12:39:23        31249     82.92     34.77  postgres

But these don't account for the full 20k switches. We need more precise tools.

For detailed per-process context switch metrics:

# perf stat -e context-switches -p pidof postgres sleep 10

 Performance counter stats for process id '2534,2536,12061...':

         203,452      context-switches

      10.001274094 seconds time elapsed

This gives us the exact count instead of averages.

For the most detailed view, use ftrace:

# echo 1 > /sys/kernel/debug/tracing/events/sched/sched_switch/enable
# cat /sys/kernel/debug/tracing/trace_pipe | grep postgres

This will show every single context switch involving PostgreSQL processes.

If PostgreSQL is the main culprit, consider these configuration tweaks:

# postgresql.conf optimizations
max_connections = 50                  # Reduce from default 100
shared_buffers = 4GB                  # 25% of RAM
work_mem = 16MB                       # Reduce disk sorts
maintenance_work_mem = 256MB
random_page_cost = 1.1                # For SSD storage
effective_io_concurrency = 200        # For SSD storage

General Linux optimizations:

# sysctl optimizations
sysctl -w kernel.sched_migration_cost_ns=5000000
sysctl -w kernel.sched_autogroup_enabled=1
sysctl -w kernel.sched_min_granularity_ns=10000000
sysctl -w kernel.sched_wakeup_granularity_ns=15000000

For continuous monitoring:

# Install and run csstats
wget https://github.com/brendangregg/perf-tools/archive/master.zip
unzip master.zip
cd perf-tools-master
./csstats 10  # Sample every 10 seconds

This provides ongoing visibility into context switch patterns.


When vmstat shows consistently high context switch rates (20k+/sec) but system load appears normal, we need specialized tools to pinpoint the exact culprits. The key indicators in your case:

# vmstat 3
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 2  0   7292 249472  82340 2291972   0    0     0     0    0     0  7 13 79  0
 0  0   7292 251808  82344 2291968   0    0     0   184  24 20090  1  1 99  0

While pidstat provides per-process context switch counts, we need more granular data:

# perf stat -e context-switches -a sleep 10
 Performance counter stats for 'system wide':

         200,123      context-switches

      10.001281559 seconds time elapsed

To identify specific threads causing switches:

# perf top -e context-switches -s comm
Samples: 1M of event 'context-switches'
Event count (approx.): 1000000

Overhead  Command
  45.12%  postgres: writer process   
  32.78%  ksoftirqd/0
  12.45%  daemon1
   5.67%  jbd2/dm-0-8
   3.98%  sshd: user@pts/0

For kernel-level context switch analysis:

# echo 1 > /sys/kernel/debug/tracing/events/sched/sched_switch/enable
# cat /sys/kernel/debug/tracing/trace_pipe > /tmp/context_switch.log
# Wait 30 seconds then Ctrl+C
# grep -v "0.000" /tmp/context_switch.log | awk '{print $5}' | sort | uniq -c | sort -nr

The high postgres context switches suggest:

  • Possible lock contention in database
  • Excessive connection pooling
  • Improperly tuned shared_buffers

Check with:

# sudo -u postgres psql -c "SELECT pid,query_start,wait_event_type,wait_event 
FROM pg_stat_activity WHERE wait_event IS NOT NULL;"

For scheduler statistics:

# cat /proc/schedstat
cpu0 0 0 0 0 0 0 1209056870 278356 1458
cpu1 0 0 0 0 0 0 1190234567 265489 1390

To monitor specific processes:

# strace -c -p PID 2>&1 | grep -A10 context
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 72.34    0.123456        1234       100           futex
 12.45    0.045678         456       100           poll
  1. Tune PostgreSQL parameters:
    shared_buffers = 4GB
    effective_cache_size = 12GB
    work_mem = 16MB
    maintenance_work_mem = 256MB
    
  2. Adjust kernel scheduler:
    # echo 1000000 > /proc/sys/kernel/sched_min_granularity_ns
    # echo 10000000 > /proc/sys/kernel/sched_wakeup_granularity_ns