Restarting a Cluster
To shut down a cluster cleanly, you issue the command SHUTDOWN in the management client and wait for all nodes to cleanly shut down. This will not shut down your SQL nodesonly your storage and management nodes. However, all tables you have converted to the NDB engine will not be available anymore because you have shut down all the storage nodes.
When you restart the storage nodes in your cluster (for example, by using the ndbd daemon), you should not use the --initial flag when you start ndbd. --initial simply means "I am running for the first time; please take DataDirectory, delete everything in it, and format it for my use." You use it in three situations:
- When starting the cluster for the first time
- When starting the cluster after making certain changes to config.ini (changes that affect the disk storage of the nodesas discussed in Chapter2)
- When upgrading the cluster to a new version
When you run ndbd with --initial, MySQL Cluster will clear the cluster file system. (This can be considered Stage 0 in the startup process.)
Note that starting all nodes in any one node group with the --initial flag at the same time after a shutdown will completely destroy all your data. The only time you should start all nodes with --initial is when you are installing it for the first time or when you are upgrading major versions and have very good backups.
Restarting on Failure
You will inevitably have to restart your cluster from time to time. We cover upgrading a cluster later on (see the section "Upgrading MySQL Cluster," later in this chapter), but here we describe how nodes resynchronize their data and cover the procedure to recover from the process of shutting down the complete cluster or just a single node.
How a Storage Node Stores Data to Disk
When a transaction (that is, a query) is committed, it is committed to the RAM in all nodes on which the data is mirrored. Transaction log records are not flushed to disk as part of the commit. This means that as long as one of the nodes remains working, the data is safe. It also means that there is no reading or writing to the disks during one transaction, which naturally removes that bottleneck.
However, of course this means that if all nodes suffer simultaneous failure that clears their RAM, you lose your data. Therefore, MySQL Cluster is designed to handle a complete cluster crashthat is, all nodes in any one node group (or all nodes) being killed (for example, if the power is cut to all servers and then any UPS system fails to work). It does this in several ways, all of which involve storing the data on the hard drives of the individual storage nodes in a process known as check pointing.
The first type of checkpoint, a global checkpoint, stores all the recent transactions in a log file format. The data node flushes the most recent REDO log (which contain all the recent transactions) to disk, which allows the cluster to reapply recent transactions in the event of total failure of all nodes in a node group. The frequency with which it updates this copy is controlled by the parameter TimeBetweenGlobalCheckpoints in config.ini and defaults to 2 seconds. Any less time, and you lose performance and increase durability; any greater time, and you lose durability and increase performance.
The second type of checkpoint, a local checkpoint (LCP), takes place on each storage node more or less concurrently. During an LCP, all the cluster's data is stored on the disk. In most clusters with high update rates, it is likely that a new LCP is started immediately after the previous one is completed; the default frequency is to start a new checkpoint after 4MB of write operations have built up since the last checkpoint was started. The LCP mechanism uses an UNDO log in order to allow it to create a completely consistent copy of the data without locking anything while doing so. An LCP is essentially the same process that occurs when you take an online backup with MySQL Cluster. The purpose of the LCP is to allow the data node to remove old REDO logs to prevent disk usage from always growing.
The cluster will store on disk the three most recent LCPs and the REDO logs for in between.
Single-Node Restart
If one node in each node group remains working, to start the other node(s) in the node group, you simply run ndbd on the servers where the dead node(s) reside, which should connect and start working.
In some situations, the data on the disk of a node can become corrupted, and if this is the case and the node fails to start properly, you simply start it with the --initial flag.
Doing an Entire Cluster Restart (System Restart)
If your entire cluster fails for some reason, the recovery can be more complicated than the recovery for a single node.
You should try to bring up your management node and then start to connect storage nodes. Each cluster node copies the last complete LCP it has on its disk back into RAM, and it then applies the latest complete global checkpoint (from the REDO log).
If none of these files are corrupted on any nodes, you should find that the startup is fairly quick and everything continues from where it was when it died.
However, if some nodes do not come up, you are still okay as long as one node in each node group has come up. You can start other nodes with ndbd --initial as long as there is another node in that node group that has started and has a complete set of data stored on it.
Note that normally a cluster doesn't want to start if not all the data nodes are connected. Therefore, the cluster waits longer during the restart if the nodes aren't all connected so that the other data nodes can connect. This period of time is specified in the setting StartPartialTimeout, which defaults to 30 seconds. If at the end of 30 seconds, a cluster is possible (that is, it has one node from each node group) and it can't be in a network partitioned situation (that is, it has all of one node group), the cluster will perform a partial cluster restart, in which it starts up even though data nodes are missing. If the cluster is in a potential network partitioned setup, where it doesn't have all of a single node group, then it will wait even longer, with a setting called StartPartitionedTimeout, which defaults to 60 seconds. Starting in this situation would be potentially dangerous because network partitioning can lead to data integrity issues. The reason for the extra wait is that a system restart is normally much faster than node restarts as it does not involve as much network traffic.