Recover from GTID replication relay log error in MySQL 5.6 using GTID_PURGED18 May 2014
Recently, we had an error in our production system. Replication on MySQL Slave failed.
It was not caused by a bug, but by an unexpected Slave server restart of the whole machine.
SHOW SLAVE STATUS\G we got scared, that Master's binlog is also corrupt,
which happened not so long ago - MySQL Bug #7022.
Fortunately, Master's binlog was intact, so we moved back to Slave, to solve the problem (removed MySQL noise for brevity):
mysql> SHOW SLAVE STATUS\G *************************** 1. row *************************** ... Slave_IO_Running: Yes Slave_SQL_Running: No ... Last_Errno: 1594 Last_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE ... Executed_Gtid_Set: 3aabcfda-591d-11e3-81f1-0050569a4585:1-14351185 Auto_Position: 1 1 row in set (0.00 sec)
You can see from the output, that we are using GTID replication,
which is available since MySQL 5.6. Transactions
1-14351185 were executed successfully, but the
14351186 got corrupt.
We tried to skip it by
committing an empty GTID transaction,
but that did not help.
What to do next? At this point I wanted to start crying, because multiple sites and tutorials suggest to
perform a clean
mysqldump of Master, which will be copied to Slave.
That is "the correct" way of solving replication problems.
However, shutting down a site, just to perform a ~700GB database dump is not fun at all.
MySQL 5.6 replication variables
Browsing through the net reveals a small number of pages, which mention this secret
I didn't comprehend the meaning of this variable at first (straight from dev.mysql.com):
The set of all transactions that have been purged from the binary log.
But after some digging, it has definitely its place inside MySQL.
By changing the
GTID_PURGED on Slave, we can inform the server about missing binary logs on Master, so Slave
starts replication from the next available transaction identifier. This way, we don't have to perform a full
Master Slave backup. Just make sure, that binlogs, which you want to synchronize
are still present on Master.
Work your magic GTID_PURGED!
Reset available replication information on Slave, so Slave forgets its own binlogs and Master's relay logs.
mysql> RESET SLAVE; mysql> RESET MASTER;
Issue a command to reconfigure replication by pointing Slave to Master (which was cleared in previous step):
mysql> CHANGE MASTER TO MASTER_HOST = '...', MASTER_PORT = ..., MASTER_USER = '...', MASTER_PASSWORD = '...', MASTER_AUTO_POSITION = 1;
We are ready to instruct Slave with information about deleted binlogs on Master, so Slave will not replicate them and they will be skipped.
The reason, why we want to skip some transactions, is because they were already replicated to Slave before.
mysql> SET global GTID_PURGED="3aabcfda-591d-11e3-81f1-0050569a4585:1-14351185";
We are done with preparations. All, that is left to be done is to start the replication:
mysql> START SLAVE;
And check if both Slave replication threads are doing their work without errors:
*************************** 1. row *************************** Slave_IO_State: Waiting for master to send event ... Slave_IO_Running: Yes Slave_SQL_Running: Yes ... Seconds_Behind_Master: 157717 Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: ... Retrieved_Gtid_Set: 3aabcfda-591d-11e3-81f1-0050569a4585:14351185-14805026 Executed_Gtid_Set: 3aabcfda-591d-11e3-81f1-0050569a4585:1-14351185 Auto_Position: 1 1 row in set (0.00 sec)
MySQL is a great database, which solves problems it is meant to solve. New replication functionality (GTID) however has it quirks when compared to the "old" replication by master log coordinates. Hopefully, this post will help others when solving similar problems with bringing replication back to life after Slave failure.