When your Linux system shows high context switch rates (20k/s in vmstat) but low CPU utilization and load averages, it's time to dig deeper. The vmstat output clearly indicates the cs column showing ~20k context switches per second:
# vmstat 3
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 7292 251876 82344 2291968 0 0 0 73 12 20116 1 0 99 0
The initial pidstat output reveals several PostgreSQL processes with significant context switching:
# pidstat -w 10 1 | grep postgres
12:39:23 25190 12.19 35.86 postgres
12:39:23 31247 4.10 23.58 postgres
12:39:23 31249 82.92 34.77 postgres
But these don't account for the full 20k switches. We need more precise tools.
For detailed per-process context switch metrics:
# perf stat -e context-switches -p pidof postgres sleep 10
Performance counter stats for process id '2534,2536,12061...':
203,452 context-switches
10.001274094 seconds time elapsed
This gives us the exact count instead of averages.
For the most detailed view, use ftrace:
# echo 1 > /sys/kernel/debug/tracing/events/sched/sched_switch/enable
# cat /sys/kernel/debug/tracing/trace_pipe | grep postgres
This will show every single context switch involving PostgreSQL processes.
If PostgreSQL is the main culprit, consider these configuration tweaks:
# postgresql.conf optimizations
max_connections = 50 # Reduce from default 100
shared_buffers = 4GB # 25% of RAM
work_mem = 16MB # Reduce disk sorts
maintenance_work_mem = 256MB
random_page_cost = 1.1 # For SSD storage
effective_io_concurrency = 200 # For SSD storage
General Linux optimizations:
# sysctl optimizations
sysctl -w kernel.sched_migration_cost_ns=5000000
sysctl -w kernel.sched_autogroup_enabled=1
sysctl -w kernel.sched_min_granularity_ns=10000000
sysctl -w kernel.sched_wakeup_granularity_ns=15000000
For continuous monitoring:
# Install and run csstats
wget https://github.com/brendangregg/perf-tools/archive/master.zip
unzip master.zip
cd perf-tools-master
./csstats 10 # Sample every 10 seconds
This provides ongoing visibility into context switch patterns.
When vmstat shows consistently high context switch rates (20k+/sec) but system load appears normal, we need specialized tools to pinpoint the exact culprits. The key indicators in your case:
# vmstat 3 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 0 7292 249472 82340 2291972 0 0 0 0 0 0 7 13 79 0 0 0 7292 251808 82344 2291968 0 0 0 184 24 20090 1 1 99 0
While pidstat provides per-process context switch counts, we need more granular data:
# perf stat -e context-switches -a sleep 10 Performance counter stats for 'system wide': 200,123 context-switches 10.001281559 seconds time elapsed
To identify specific threads causing switches:
# perf top -e context-switches -s comm Samples: 1M of event 'context-switches' Event count (approx.): 1000000 Overhead Command 45.12% postgres: writer process 32.78% ksoftirqd/0 12.45% daemon1 5.67% jbd2/dm-0-8 3.98% sshd: user@pts/0
For kernel-level context switch analysis:
# echo 1 > /sys/kernel/debug/tracing/events/sched/sched_switch/enable # cat /sys/kernel/debug/tracing/trace_pipe > /tmp/context_switch.log # Wait 30 seconds then Ctrl+C # grep -v "0.000" /tmp/context_switch.log | awk '{print $5}' | sort | uniq -c | sort -nr
The high postgres context switches suggest:
- Possible lock contention in database
- Excessive connection pooling
- Improperly tuned shared_buffers
Check with:
# sudo -u postgres psql -c "SELECT pid,query_start,wait_event_type,wait_event FROM pg_stat_activity WHERE wait_event IS NOT NULL;"
For scheduler statistics:
# cat /proc/schedstat cpu0 0 0 0 0 0 0 1209056870 278356 1458 cpu1 0 0 0 0 0 0 1190234567 265489 1390
To monitor specific processes:
# strace -c -p PID 2>&1 | grep -A10 context % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 72.34 0.123456 1234 100 futex 12.45 0.045678 456 100 poll
- Tune PostgreSQL parameters:
shared_buffers = 4GB effective_cache_size = 12GB work_mem = 16MB maintenance_work_mem = 256MB
- Adjust kernel scheduler:
# echo 1000000 > /proc/sys/kernel/sched_min_granularity_ns # echo 10000000 > /proc/sys/kernel/sched_wakeup_granularity_ns