Debugging EXT3 Journal Commit I/O Errors on Dell PowerEdge 1950 with RHEL 4.6 and Oracle Database


3 views

During a recent RHEL 4.6 deployment on a Dell PowerEdge 1950 server, I encountered intermittent "kernel: journal commit I/O error" messages accompanied by the more dire "EXT3-fs error (device sda5) in start_transaction: Journal has aborted". These errors appeared randomly during both OS installation and Oracle database operations.

The failures manifested in several scenarios:

  • During RHEL package installation phase
  • While running Oracle database imports
  • Across multiple physical disks (ruling out single disk failure)
# Check disk SMART status
smartctl -a /dev/sda

# Verify filesystem integrity
fsck -fy /dev/sda5

# Monitor kernel ring buffer
dmesg -T | grep -i "journal\|ext3\|sda"

# RAID controller health check (for PERC controllers)
/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aAll

After replacing multiple disks with the same issue, I focused on these potential culprits:

  • Dell PERC RAID controller firmware version (6.x vs 5.x differences)
  • Backplane connectivity issues in the 1950 chassis
  • Memory corruption affecting disk I/O (ruled out via memtest86+)

For critical systems where immediate hardware replacement isn't possible:

# Mount with journal data=writeback option
mount -o remount,data=writeback /

# Disable journal completely (not recommended for production)
tune2fs -O ^has_journal /dev/sda5

# Alternative: Switch to ext2 (lose journaling)
mkfs.ext2 /dev/sda5

When these errors occur during database operations:

-- Check Oracle alert log for corruption
SELECT value FROM v$diag_info WHERE name = 'Diag Trace';

-- Force checkpoint before critical operations
ALTER SYSTEM CHECKPOINT;

-- Consider using ASM instead of filesystem
CREATE DISKGROUP DATA NORMAL REDUNDANCY
FAILGROUP fg1 DISK '/dev/sdb1'
FAILGROUP fg2 DISK '/dev/sdc1';

The ultimate solution came from replacing the suspect disk (despite previous SMART tests showing clean). The key insight was that some disk failures only manifest during specific I/O patterns not caught by standard diagnostics.


During my recent deployment of RHEL 4.6 with Oracle Database on a Dell PowerEdge 1950 server, I encountered persistent filesystem errors that manifested in two ways:

  • SSH sessions displayed "kernel: journal commit I/O error"
  • Local console showed "EXT3-fs error (device sda5) in start_transaction: Journal has aborted"

The errors appeared randomly during different phases:

# Sample dmesg output showing the error pattern
[ 1423.456789] EXT3-fs error (device sda5): ext3_journal_start_sb: Detected aborted journal
[ 1423.567890] EXT3-fs (sda5): Remounting filesystem read-only
[ 1423.678901] kernel: journal commit I/O error

Key characteristics of the issue:

  • Non-deterministic occurrence during installation and operation
  • Persisted across multiple hard drive replacements
  • Most frequent during database import operations

To properly diagnose, I implemented the following checks:

# Checking filesystem integrity
fsck -y /dev/sda5

# Monitoring disk health
smartctl -a /dev/sda

# Verifying RAID controller status
megacli -LDInfo -Lall -aAll

The Dell 1950's PERC controller required special attention:

# Checking battery backup unit status
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuStatus -aALL

# Verifying write cache policy
/opt/MegaRAID/MegaCli/MegaCli64 -LDGetProp -Cache -LAll -aAll

RHEL 4.6's default ext3 configuration needed review:

# Checking mount options
cat /proc/mounts | grep sda5

# Sample /etc/fstab entry for reference:
/dev/sda5 / ext3 defaults,data=ordered 1 1

After extensive testing, the solution involved:

  1. Replacing the suspect hard drive (despite previous replacements)
  2. Updating the RAID controller firmware
  3. Adding explicit barriers in fstab:
# Updated fstab entry
/dev/sda5 / ext3 defaults,data=ordered,barrier=1 1 1

For Oracle database operations, additional tuning proved beneficial:

# Adding to /etc/sysctl.conf
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5