During a recent RHEL 4.6 deployment on a Dell PowerEdge 1950 server, I encountered intermittent "kernel: journal commit I/O error"
messages accompanied by the more dire "EXT3-fs error (device sda5) in start_transaction: Journal has aborted"
. These errors appeared randomly during both OS installation and Oracle database operations.
The failures manifested in several scenarios:
- During RHEL package installation phase
- While running Oracle database imports
- Across multiple physical disks (ruling out single disk failure)
# Check disk SMART status
smartctl -a /dev/sda
# Verify filesystem integrity
fsck -fy /dev/sda5
# Monitor kernel ring buffer
dmesg -T | grep -i "journal\|ext3\|sda"
# RAID controller health check (for PERC controllers)
/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aAll
After replacing multiple disks with the same issue, I focused on these potential culprits:
- Dell PERC RAID controller firmware version (6.x vs 5.x differences)
- Backplane connectivity issues in the 1950 chassis
- Memory corruption affecting disk I/O (ruled out via memtest86+)
For critical systems where immediate hardware replacement isn't possible:
# Mount with journal data=writeback option
mount -o remount,data=writeback /
# Disable journal completely (not recommended for production)
tune2fs -O ^has_journal /dev/sda5
# Alternative: Switch to ext2 (lose journaling)
mkfs.ext2 /dev/sda5
When these errors occur during database operations:
-- Check Oracle alert log for corruption
SELECT value FROM v$diag_info WHERE name = 'Diag Trace';
-- Force checkpoint before critical operations
ALTER SYSTEM CHECKPOINT;
-- Consider using ASM instead of filesystem
CREATE DISKGROUP DATA NORMAL REDUNDANCY
FAILGROUP fg1 DISK '/dev/sdb1'
FAILGROUP fg2 DISK '/dev/sdc1';
The ultimate solution came from replacing the suspect disk (despite previous SMART tests showing clean). The key insight was that some disk failures only manifest during specific I/O patterns not caught by standard diagnostics.
During my recent deployment of RHEL 4.6 with Oracle Database on a Dell PowerEdge 1950 server, I encountered persistent filesystem errors that manifested in two ways:
- SSH sessions displayed "kernel: journal commit I/O error"
- Local console showed "EXT3-fs error (device sda5) in start_transaction: Journal has aborted"
The errors appeared randomly during different phases:
# Sample dmesg output showing the error pattern
[ 1423.456789] EXT3-fs error (device sda5): ext3_journal_start_sb: Detected aborted journal
[ 1423.567890] EXT3-fs (sda5): Remounting filesystem read-only
[ 1423.678901] kernel: journal commit I/O error
Key characteristics of the issue:
- Non-deterministic occurrence during installation and operation
- Persisted across multiple hard drive replacements
- Most frequent during database import operations
To properly diagnose, I implemented the following checks:
# Checking filesystem integrity
fsck -y /dev/sda5
# Monitoring disk health
smartctl -a /dev/sda
# Verifying RAID controller status
megacli -LDInfo -Lall -aAll
The Dell 1950's PERC controller required special attention:
# Checking battery backup unit status
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuStatus -aALL
# Verifying write cache policy
/opt/MegaRAID/MegaCli/MegaCli64 -LDGetProp -Cache -LAll -aAll
RHEL 4.6's default ext3 configuration needed review:
# Checking mount options
cat /proc/mounts | grep sda5
# Sample /etc/fstab entry for reference:
/dev/sda5 / ext3 defaults,data=ordered 1 1
After extensive testing, the solution involved:
- Replacing the suspect hard drive (despite previous replacements)
- Updating the RAID controller firmware
- Adding explicit barriers in fstab:
# Updated fstab entry
/dev/sda5 / ext3 defaults,data=ordered,barrier=1 1 1
For Oracle database operations, additional tuning proved beneficial:
# Adding to /etc/sysctl.conf
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5