Testing Your Cluster

2017-11-03 09:05:03

When your cluster is working and you understand how to restart nodes, it is time to ensure that it is indeed highly available. What you are now going to do is kill some nodes to make sure that the cluster remains working.

First of all, you open the MySQL client on one of the storage nodes and issue the SELECT query. Next, you go to the management node and either issue kill -9 to the ndb_mgmd process or, if the management node is alone on the server (in other words, if there is not a SQL or storage node also on the same server), unplug the network or power cable. Then you return to the SQL node (that is, the MySQL client) and issue the SELECT query, and you should find that it still works. If it does, you have just verified that your cluster can survive a failure of the management node. If you have more than one SQL node, you can try the query on all of them.

Next, you need to restart your management node. The procedure differs depending on how you killed the node:

If you unplugged the power cable, you plug it back in and boot the machine and then follow the process for issuing kill -9.

If you issued kill -9, all you need to do is repeat the previously described process above: change directory to /var/lib/mysql-cluster and start ndb_mgmd or use one of the other methods mentioned earlier.

If you unplugged the network cable, all you need to do is plug it back in. Management nodes do not kill themselves after a certain period of time without communication (heartbeats) from other nodes as storage nodes do.

After you have restarted the management daemon by whatever means, you should run the management client, ndb_mgm, and issue a SHOW command to check that all the storage and SQL nodes have reconnected to the management server and that your cluster is back up and running. When you have established that all is well, you are ready to continue the testing.

You now want to establish whether your cluster is able to survive a failure of one of the storage nodes. In theory, your cluster can survive as long as one storage node in each node group remains alive. In the example we have been using so far in this chapter, there only is one node group, with two nodes in it, so you can only survive the failure of one storage node. If you had, for example, NumberOfReplicas set to 3 and had six storage nodes, you would be able to survive four nodes failingtwo in each node group, although the cluster could potentially fail if three nodes failed (if they were all three nodes in one node group).

To check that your cluster is indeed highly available, you log in to a SQL node and run the MySQL client. You should then issue the SELECT query as before and verify that it works. Then you move over to the other storage node and either issue a kill -9 command for the ndbd process (there are actually two ndbd processes per node, so you need to kill both at the same time; otherwise, one can restart the other) or remove the power or network cable from the back of the server. (Again, this works only if the server is only running a storage node; if it is running a management node or the only SQL node, you should not do this!) You then repeat the SQL query on the surviving storage node, and you should find that it continues to work. If you issued a kill -9 command to kill the storage node on a server that also has a SQL node on it, the SQL node on that server should also continue to work, so you should test that as well.

Now you need to restart the storage node that you killed. If you reset the machine or killed the power and then powered the machine back on, you log in, su to root, and run ndbd. If you simply killed the ndbd process, you should be able just to start the ndbd process again.

If you removed the network cable, you should plug it back in and watch to see if the storage node attempts to connect to the management node (by looking at the output of SHOW in the ndb_mgm client). It will almost certainly not because ndbd will have exited previously. This is because the storage node will not have been able to contact the arbitrator (the management daemon) and will have killed itself. You should be aware that it is possible to cause the storage node not to shut down completely in the event of a network outage or similar situation, but instead to cause the node to attempt to reconnect periodically. For more information, you should read about the option StopOnError, which is covered in Chapter2. In summary, if the ndbd (storage) node process has exited, you just start it again; if you were very quick, it might not have exited, and you might be able to just plug the cable back in again.

Категории