When debugging production systems, we often need to monitor error rates in real-time. The classic approach of tail -f | grep error
shows us the errors, but doesn't give us quantitative metrics about the error frequency. Here's how to solve this monitoring gap.
The simplest approach combines three common Unix tools:
watch -n 1 "tail -n 100 /var/log/my_process/*.log | grep error | wc -l"
This gives you a line count update every second. The -n 100
limits to recent entries to prevent memory issues.
For long-running monitoring, consider this awk-based solution that maintains counts between refreshes:
tail -f /var/log/my_process/*.log | awk '/error/{count++; print count}'
To see error counts per time interval (e.g., per minute):
tail -f /var/log/my_process/*.log | \
awk '/error/{
curr_min = strftime("%M");
if (curr_min != last_min) {
print strftime("%H:%M") ": " count;
count = 0;
last_min = curr_min;
}
count++
}'
When dealing with multiple log files, we need to handle log rotation and multiple sources:
find /var/log/my_process/ -name "*.log" -type f -print0 | \
xargs -0 tail -F | \
grep --line-buffered error | \
pv --line-mode --interval 1 | \
wc -l
The -F
flag handles rotated logs, and pv
gives us rate statistics.
For serious monitoring, consider this robust implementation:
# Persistent error counter with timestamp
tail -F /var/log/my_process/*.log | \
stdbuf -oL grep error | \
while read -r line; do
((count++))
now=$(date +%s)
if (( now - last_print >= 1 )); then
echo "$(date '+%Y-%m-%d %H:%M:%S') - $count errors"
count=0
last_print=$now
fi
done
This handles buffering issues and provides timestamps for better analysis.
When debugging production systems, we often need to monitor error rates in real-time. A common scenario is watching growing log files while counting occurrences of specific patterns (like error messages) over time.
The simplest approach combines watch
with tail
and wc
:
watch -n 1 "tail -n 100 /var/log/my_process/*.log | grep error | wc -l"
For better monitoring, we can include timestamps:
while true; do
echo -n "$(date '+%H:%M:%S') - ";
tail -n 100 /var/log/my_process/*.log | grep error | wc -l;
sleep 1;
done
To measure the rate of new errors rather than just totals:
prev_count=0
while true; do
current_count=$(tail -n 100 /var/log/my_process/*.log | grep error | wc -l)
new_errors=$((current_count - prev_count))
echo "$(date '+%H:%M:%S') - New errors: $new_errors - Total: $current_count"
prev_count=$current_count
sleep 1
done
For more complex filtering before counting:
watch -n 1 "tail -n 200 /var/log/my_process/*.log | \
awk '/error/ && !/known_warning/ {print}' | wc -l"
To both monitor and save results:
while true; do
echo "$(date -Is) - $(tail -n 50 /var/log/my_process/*.log | \
grep -c error)" | tee -a error_count.log
sleep 1
done
When dealing with multiple log files, we should:
watch -n 1 "find /var/log/my_process/ -name '*.log' -mtime -1 | \
xargs tail -n 50 | grep -c error"