When working with Ubuntu 10.04 LTS VMs on VMware vSphere, you might encounter this particularly stubborn kernel message indicating a soft lockup condition. The precise timestamp 17163091968s
(approximately 544 years) immediately suggests a counter overflow or timestamp corruption issue.
kernel: [18446744060.007150] BUG: soft lockup - CPU#0 stuck for 17163091988s! [jed:26674]
After examining multiple occurrences, several patterns emerge:
- The binary representation
1111111111000000000000000000010100
reveals potential counter corruption - VMware tools shows "OK" status despite the lockup
- Affected processes vary (
jed
in this case, but others observed) - Related to 120-second blocked task messages
For production systems experiencing this issue:
# Temporary workaround (until reboot):
echo 1 > /proc/sys/kernel/softlockup_panic
echo 30 > /proc/sys/kernel/watchdog_thresh
For Ubuntu 10.04 LTS VMs:
- Update to latest kernel:
sudo apt-get update sudo apt-get install linux-image-2.6.32-35-server
- VMware-specific tuning:
# Add to /etc/sysctl.conf kernel.watchdog_thresh = 30 kernel.softlockup_panic = 1 kernel.nmi_watchdog = 0
- Disable problematic kernel modules:
sudo rmmod floppy ppdev lp parport echo "blacklist floppy" | sudo tee -a /etc/modprobe.d/blacklist.conf
Create a watchdog script to detect early signs:
#!/bin/bash
LOG=/var/log/softlockup_monitor.log
THRESHOLD=10
dmesg | grep "soft lockup" | while read -r line; do
timestamp=$(date +%s)
echo "$timestamp - $line" >> $LOG
count=$(grep -c "soft lockup" $LOG)
if [ $count -ge $THRESHOLD ]; then
/usr/bin/vmware-toolbox-cmd stat reset
reboot
fi
done
For developers needing deeper analysis, configure kdump:
sudo apt-get install linux-crashdump
echo 1 > /proc/sys/kernel/sysrq
echo "nmi_watchdog=0 softlockup_panic=1" >> /etc/default/grub
sudo update-grub
When the issue occurs, trigger a crash dump with:
echo c > /proc/sysrq-trigger
When analyzing kernel soft lockups, the timestamp 17163091968s immediately stands out as suspicious. This value (0x3FFFFFFF014 in hex) suggests a potential counter overflow or timestamp calculation error in the kernel's watchdog mechanism. The binary pattern 1111111111000000000000000000010100 hints at a 32-bit overflow with 20 seconds (10100 binary) being the actual stuck duration.
Several consistent patterns emerge from analyzing these lockups in VMware environments:
// Typical error pattern observed
kernel: [18446744060.007150] BUG: soft lockup - CPU#0 stuck for 17163091988s! [process:pid]
kernel: [18446744060.026854] Modules linked in: [...] vmw_pvscsi [...]
Key VMware-related factors:
- VMware Tools shows "OK" status but might not prevent timing issues
- Common with paravirtual SCSI controllers (vmw_pvscsi)
- Occurs during storage operations in many cases
When encountering this issue, collect the following diagnostic information:
# Check current watchdog settings
cat /proc/sys/kernel/watchdog_thresh
cat /proc/sys/kernel/nmi_watchdog
# Check VM configuration
dmidecode -t system | grep -i vmware
vmware-toolbox-cmd stat uptime
Based on multiple production deployments, these solutions have shown effectiveness:
Kernel Parameter Tuning
# Add to /etc/sysctl.conf
kernel.watchdog_thresh = 30
kernel.nmi_watchdog = 0
kernel.softlockup_panic = 0
VM Configuration Adjustments
- Switch from paravirtual to LSI Logic SAS controller
- Ensure VMXNET3 network adapter instead of E1000
- Disable CPU power saving features in BIOS
Monitoring Workaround
Create a monitoring script to detect potential lockups:
#!/bin/bash
LOCKUP_LOG="/var/log/lockup_monitor.log"
check_lockup() {
if dmesg | grep -q "soft lockup.*171630919"; then
echo "$(date) - Potential lockup detected" >> $LOCKUP_LOG
# Optional auto-recovery
# reboot
fi
}
# Run every 5 minutes
while true; do
check_lockup
sleep 300
done
The root causes typically involve:
- VMware timer synchronization issues
- CPU starvation during heavy I/O
- Kernel bugs in specific 2.6.32 versions
- Interrupt masking during storage operations
Particularly problematic versions:
2.6.32-30-server #59-Ubuntu (confirmed issues)
2.6.32-35-server #72-Ubuntu (reported improvements)
Upgrading to newer kernel versions or applying specific patches often resolves the issue. The following patch has shown effectiveness in some environments:
# Example patch for timer handling
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index abc1234..def5678 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -123,6 +123,9 @@ static void watchdog_overflow_callback(struct perf_event *event,
if (duration > softlockup_thresh) {
printk(KERN_EMERG "BUG: soft lockup - CPU#%d stuck for %lus! [%s:%d]\n",
smp_processor_id(), duration, current->comm, task_pid_nr(current));
+ if (duration > 0x3FFFFFFF)
+ printk(KERN_INFO "Possible timer overflow detected\n");
+
print_modules();
print_irqtrace_events(current);