MySQL Replication: Troubleshooting Slave Stuck in “System Lock” State with High Seconds_Behind_Master


2 views

When your MySQL slave shows Slave_SQL_Running_State: System lock with high Seconds_Behind_Master (402263 in this case), it indicates the SQL thread is blocked while applying relay log events. The processlist shows two critical threads:

1. IO Thread: "Waiting for master to send event" (healthy state)
2. SQL Thread: "System lock" (problem state)

In MySQL 5.6 with InnoDB tables, this typically occurs due to:

  • Disk I/O bottlenecks: Especially on EBS volumes with provisioned IOPS
  • Single-threaded replication: All events are applied sequentially
  • Long-running transactions: From the master being replayed
  • Metadata locking: DDL operations causing contention

Run these while the issue is occurring:

-- Check for locking issues
SHOW ENGINE INNODB STATUS\\G

-- Examine replication delay details
SELECT * FROM performance_schema.replication_applier_status_by_worker;

-- Check for long-running transactions
SELECT * FROM information_schema.innodb_trx ORDER BY trx_started ASC LIMIT 10;

1. Optimize EBS Performance

For Amazon EC2 with EBS:

# Check current disk stats
iostat -dxm 5

# If %util is consistently > 80%, consider:
- Upgrade to gp3 volume type
- Increase provisioned IOPS
- Use RAID 0 across multiple EBS volumes

2. Tune MySQL Parameters

Add these to my.cnf:

[mysqld]
# Reduce disk I/O pressure
innodb_flush_log_at_trx_commit = 2
sync_binlog = 0

# Increase replication buffers
slave_parallel_workers = 4
slave_parallel_type = LOGICAL_CLOCK
slave_preserve_commit_order = 1

3. Handle Large Transactions

For batch operations on master:

-- On master, break into smaller chunks
START TRANSACTION;
DELETE FROM large_table WHERE id BETWEEN 1 AND 10000;
COMMIT;

-- Not
DELETE FROM large_table WHERE id < 1000000;

If the issue persists, examine the specific event causing the lock:

-- Find the current position
SHOW SLAVE STATUS\\G

-- Then examine the corresponding binlog event
mysqlbinlog --start-position=345413702 --stop-position=345413863 relay-log.000197

Look for DDL statements (ALTER TABLE) or large DML operations that might be blocking.

  • Monitor Seconds_Behind_Master with CloudWatch/PMM
  • Set up alerts when delay exceeds thresholds
  • Consider upgrading to MySQL 5.7+ for better parallel replication
  • Schedule heavy writes during low-traffic periods

When monitoring MySQL replication, seeing Slave_SQL_Running_State: System lock with significant delay (Seconds_Behind_Master: 402263 in this case) indicates a serious performance bottleneck. The slave is processing relay logs but spending excessive time waiting for internal locks.

From your SHOW SLAVE STATUS output, several critical points emerge:

- All tables are InnoDB (good for transactional consistency)
- EBS storage on AWS EC2 (potential I/O bottleneck)
- Significant replication lag (402263 seconds)
- No immediate SQL errors reported
- Relay log space consumption is high (19834085375 bytes)

The "System lock" state typically occurs when:

  1. Single-threaded SQL applier gets blocked by large transactions
  2. Storage subsystem cannot keep up with write I/O
  3. Schema changes or DDL operations are being replicated
  4. InnoDB internal lock contention occurs

Immediate Actions

-- Check for long-running transactions:
SELECT * FROM information_schema.innodb_trx 
WHERE trx_started < NOW() - INTERVAL 1 HOUR;

-- Monitor InnoDB status:
SHOW ENGINE INNODB STATUS\G

-- Consider temporarily stopping replication to investigate:
STOP SLAVE;
START SLAVE; -- After investigation

Configuration Optimizations

Add these to your my.cnf (adjust based on instance size):

[mysqld]
slave_parallel_workers = 4  # For MySQL 5.6+ 
slave_parallel_type = DATABASE
innodb_flush_log_at_trx_commit = 2  # For better I/O on slaves
innodb_buffer_pool_size = 12G  # For r4.xlarge or larger
innodb_io_capacity = 2000      # For EBS optimized instances
innodb_read_io_threads = 16
innodb_write_io_threads = 16

Using Performance Schema (MySQL 5.6+)

-- Check lock waits:
SELECT * FROM performance_schema.events_waits_current 
WHERE EVENT_NAME LIKE '%lock%';

-- Identify slow transactions:
SELECT * FROM performance_schema.events_statements_history_long 
WHERE SQL_TEXT IS NOT NULL 
ORDER BY TIMER_WAIT DESC LIMIT 10;

EBS-Specific Optimizations

For Amazon EBS volumes:

  • Use Provisioned IOPS (io1) volumes for database storage
  • Ensure EBS optimization is enabled for the EC2 instance
  • Consider RAID 0 across multiple EBS volumes for better throughput
  • Monitor EBS burst balance and IOPS usage

One production case showed similar symptoms where a slave was stuck in "System lock" for hours. The solution involved:

1. Identifying a batch DELETE operation in the relay log
2. Increasing innodb_buffer_pool_size from 4G to 16G
3. Setting slave_parallel_workers = 8
4. Migrating to gp2 volumes with 3000 baseline IOPS

This reduced replication lag from 6+ hours to under 5 minutes.

Implement these checks:

-- Create a lag monitoring query:
SELECT 
  NOW() AS current_time,
  SEC_TO_TIME(TIMESTAMPDIFF(SECOND, 
    MIN(trx_started), NOW())) AS oldest_trx_age
FROM information_schema.innodb_trx;

-- Set up alerts for:
-- Seconds_Behind_Master > 300
-- Slave_SQL_Running_State = "System lock" for > 5 minutes

For persistent issues:

  1. Consider rebuilding the slave from a fresh snapshot
  2. Evaluate moving to RDS which handles replication more efficiently
  3. Implement delayed replication for critical systems