Failover for MySQL Replication (and others) – Should it be Automated?

Automatic failover for MySQL Replication has been subject to debate for many years.

Is it a good thing or a bad thing?

For those with long memory in the MySQL world, they might remember the GitHub outage in 2012 which was mainly caused by software taking the wrong decisions.

GitHub had then just migrated to a combo of MySQL Replication, Corosync, Pacemaker and Percona Replication Manager. PRM decided to do a failover after failing health checks on the master, which was overloaded during a schema migration. A new master was selected, but it performed poorly because of cold caches. The high query load from the busy site caused PRM heartbeats to fail again on the cold master, and PRM then triggered another failover to the original master. And the problems just continued, as summarized below.

Fast forward a couple of years and GitHub is back with a pretty sophisticated framework for managing MySQL Replication and automated failover! As Shlomi Noach puts it:

“To that effect, we employ automated master failovers. The time it would take a human to wake up & fix a failed master is beyond our expectancy of availability, and operating such a failover is sometimes non-trivial. We expect master failures to be automatically detected and recovered within 30 seconds or less, and we expect failover to result in minimal loss of available hosts.”

Most companies are not GitHub, but one could argue that no company likes outages. Outages are disruptive to any business, and they also cost money. My guess is that most companies out there probably wished they had some sort of automated failover, and the reasons not to implement it are probably the complexity of the existing solutions, lack of competence in implementing such solutions, or lack of trust in software to take such an important decision.

There are a number of automated failover solutions out there, including (and not limited to) MHA, MMM, MRM, mysqlfailover, Orchestrator and ClusterControl. Some of them have been on the market for a number of years, others are more recent. That is a good sign, multiple solutions mean that the market is there and people are trying to address the problem.

When we designed automatic failover within ClusterControl, we used a few guiding principles:

  • Make sure the master is really dead before you failover

    In case of a network partition, where the failover software loses contact with the master, it will stop seeing it. But the master might be working well and can be seen by the rest of the replication topology.

    ClusterControl gathers information from all the database nodes as well as any database proxies/load balancers used, and then builds a representation of the topology. It will not attempt a failover if the slaves can see the master, nor if ClusterControl is not 100% sure about the state of the master.

    ClusterControl also makes it easy to visualize the topology of the setup, as well as the status of the different nodes (this is ClusterControl’s understanding of the state of the system, based on the information it gathers).

  • Failover only once

    Much has been written about flapping. It can get very messy if the availability tool decides to do multiple failovers. That’s a dangerous situation. Each master elected, however brief the period it held the master role, might have their own sets of changes that were never replicated to any server. So you may end up with inconsistency across all the elected masters.

  • Do not failover to an inconsistent slave

    When selecting a slave to promote as master, we ensure the slave does not have inconsistencies, e.g. errant transactions, as this may very well break replication.

  • Only write to the master

    Replication goes from the master to the slave(s). Writing directly to a slave would create a diverging dataset, and that can be a potential source of problem. We set the slaves to read_only, and super_read_only in more recent versions of MySQL or MariaDB. We also advise the use of a load balancer, e.g., ProxySQL or MaxScale, to shield the application layer from the underlying database topology and any changes to it. The load balancer also enforces writes on the current master.

  • Do not automatically recover the failed master

    If the master has failed and a new master has been elected, ClusterControl will not try to recover the failed master. Why? That server might have data that has not yet been replicated, and the administrator would need to do some investigation into the failure. Ok, you can still configure ClusterControl to wipe out the data on the failed master and have it join as a slave to the new master – if you are ok with losing some data. But by default, ClusterControl will let the failed master be, until someone looks at it and decides to re-introduce it into the topology.

On PostgreSQL

When it comes to PostgreSQL streaming replication, ClusterControl uses similar principles to automate failover. For PostgreSQL, ClusterControl supports both asynchronous and synchronous replication models between the master and the slaves. In both cases and in the event of failure, the slave with the most up-to-date data is elected as the new master. Failed masters are not automatically recovered/fixed to rejoin the replication setup.

There are a few protective measures taken to make sure the failed master is down and stays down, e.g. it is removed from the load balancing set in the proxy and it is killed if e.g. the user would restart it manually. It is a bit more challenging there to detect network splits between ClusterControl and the master, since the slaves do not provide any information about the status of the master they are replicating from. So a proxy in front of the database setup is important as it can provide another path to the master.

On MongoDB

MongoDB replication within a replicaset via the oplog is very similar to binlog replication, so how come MongoDB automatically recovers a failed master? The problem is still there, and MongoDB addresses that by rolling back any changes that were not replicated to the slaves at the time of failure. That data is removed and placed in a ‘rollback’ folder, so it is up to the administrator to restore it.

To find out more, check out ClusterControl; and feel free to comment or ask questions below.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *