Storage Nodes

All the following options go under the [NDBD_DEFAULT] or [NDBD] headings in config.ini.

For each parameter that defines a memory size, it is possible to use k, M, or G as a suffix to indicate units of 1024, 10242, or 10243.

Parameters to Define for All Storage Nodes (NDBD_DEFAULT)

There are some parameters that you must define for all nodes. For example, setting NoOfReplicas to different values for different nodes will not work. Also, it does not make sense to set different values for DataMemory and IndexMemory because if you do, you will find yourself limited to the lowest value (storage nodes must have roughly the same amount of data). It therefore makes sense to put these parameters in the DEFAULT section.

NoOfReplicas

The global parameter NoOfReplicas, which can be set only in the [NDBD DEFAULT] section, defines the number of replicas (copies) of each fragment of data stored in the cluster. This parameter also specifies the size of node groups. (Remember that a node group is a set of nodes that all store the same information.)

Node groups are formed implicitly. The first node group is formed by the set of data nodes with the lowest node IDs, the next node group by the set of the next-lowest node identities, and so on.

There is no default value for NoOfReplicas (it must be specified); the maximum possible value is 4. You can set it to 1, but of course that gives you no redundancy, so we recommend that you set it to 2 or more.

You should bear in mind that your total number of storage nodes should be a multiple of NoOfReplicas and a power of 2, if possible (clearly, this is not possible if NoOfReplicas is set to 3), so if you have NoOfReplicas set to 2, you should have 2, 4, 8, or 16 (and so on) nodes. This is due to the way that MySQL Cluster currently partitions the data, and this is expected to become less important in version 5.1 due to the fact that you will be able to control the partitioning mechanism in that version.

Changing the setting for NoOfReplicas for an active cluster requires that you take down the entire cluster and re-import the data, so generally you should avoid changing this parameter after you have your cluster set up.

DataDir

The DataDir parameter specifies the directory where local checkpoints, trace files, log files, process ID files, and error logs are placed. This can be specified for each individual storage node, but we recommend that you define it in the DEFAULT section to avoid confusion, and we recommend that you set it to /var/lib/mysql-cluster, which is the conventional location. There is no default value; you must specify this either in the [NDBD_DEFAULT] section or in each separate [NDBD] section.

DataMemory and IndexMemory

DataMemory and IndexMemory are parameters that specify the size of memory segments used to store the actual records and their indexes. Although you do not have to set these, the defaults are very low, and you are unlikely to be able to import anything larger than a token table with them.

When setting these parameters, it may be helpful to look up the syntax of the ALL DUMP 1000 command in appendixB, "Management Commands," and also the detailed example of its usage in Chapter2, "Configuration."

We strongly recommended that you set DataMemory and IndexMemory to the same values for all nodes, which is why we also suggest that you define these parameters in the [NDBD_DEFAULT] section. Because data is distributed evenly over all nodes in the cluster, your storage nodes will be able to use only as much as the smallest setting, so there is really no point in setting different values for different nodes.

DataMemory and IndexMemory can be changed, but decreasing either of them can be risky; doing so can easily lead to a node or even an entire cluster being unable to restart due to insufficient memory space. Increasing these values should be fine, but it is important that such upgrades be performed in the same manner as software upgrades, following the procedure outlined at the beginning of this chapter.

DataMemory defines the amount of RAM that is available for the storage node to keep its data in. Data in this context means the actual data in the table and all indexes apart from the primary (hashed) key.

If you set this parameter too low, you get "table is full" errors when you try to add more data. If you set it too high, the system may start to swap (that is, use the hard drive as RAM) excessively and become unstable. We suggest that when you are importing your initial data, you set this to 70% of your system RAM size. After you have your tables imported, you can use the ALL DUMP 1000 command (see appendixB for more details) to see how much RAM is actually being used, and you can lower this value to be that plus a sensible amount for growth and temporary usage.

You set the DataMemory parameter as a number plus the unit (for example, DataMemory=1024MB). We suggest that you never set this to 100% of your system RAM; if you need that much DataMemory, it is time to add more RAM to your system.

Something that has not been previously discussed that you should be aware of is that DataMemory is also used for UNDO information: For each update, a copy of the unaltered record is allocated in DataMemory. It is therefore necessary to allocate enough memory to handle the largest transactions performed by applications using the cluster. You should avoid large transactions for this and the following reasons:

The default value for DataMemory is 80MB, and the minimum value is 1MB. There is no maximum size, but the node should not start swapping when the limit is reached.

IndexMemory defines the amount of RAM that is available for the storage node to keep its hash indexes (that is, primary keys) in. Similar problems occur with IndexMemory as with DataMemory if you set the value too high or low. We suggest that you set IndexMemory to 15% of your system RAM size to start with and adjust it as for DataMemory after your data is imported. It is set identically to DataMemory, with a number followed by units (for example, IndexMemory=128MB).

The default value for IndexMemory is 18MB, and the minimum value is 1MB.

FileSystemPath

The FileSystemPath parameter specifies the directory where all files created for metadata, REDO logs, UNDO logs, and data files are placed on the storage node. The default is the directory specified by DataDir. Note that this directory must exist before the ndbd process is initiated.

The storage daemon creates a subdirectory in the directory specified in FileSystemPath for the node's file system. This subdirectory contains the node ID. For example, if the node ID is 2, then this subdirectory is named ndb_2_fs. If you leave this at the default and set DataDir to /var/lib/mysql-cluster, you will get your FileSystemPath set to /var/lib/mysql-cluster/ndb_X_fs/, where X is the node ID of the storage daemon.

You can set FileSystemPath to a different value for each storage node, but we suggest that you not do this. Similarly, we cannot think of many instances in which it would be particularly beneficial to keep the FileSystemPath outside the default location (if you have multiple disks it may be a good idea to put this onto a different physical disk to spread out the write operations).

BackupDataDir

It is possible to specify the directory in which backups are placed. By default, this directory is FileSystemPath/BACKUP, which in turn defaults to DataDir/BACKUP unless you have defined FileSystemPath. There should be no reason to change this unless you have multiple hard drives and want to insure against drive failure, in which case you could set BackupDataDir to a folder on a separate hard drive.

Transaction Parameters

The next three parameters are important because they affect the number of parallel transactions and the sizes of transactions that can be handled by the system.

MaxNoOfConcurrentTransactions sets the number of parallel transactions possible in a node and MaxNoOfConcurrentOperations (and its partner parameter MaxNoOfLocalOperations) sets the number of records that can be in an update phase or locked simultaneously.

These parameters (especially MaxNoOfConcurrentOperations) are likely targets for users setting specific values and not using the default value. The default value is set for small clusters using small transactions, in order to ensure that they do not use excessive memory. If you make extensive use of transactions in an application, it is very likely that you will have some large transactions and will therefore have to increase these parameters.

MaxNoOfConcurrentTransactions

Transaction records are allocated to individual MySQL servers, and generally there is at least one transaction record allocated per connection that is using any table in the cluster. For this reason, you should ensure that there are more transaction records in the cluster than there are concurrent connections to all MySQL servers in the cluster (and a lot more if you have individual transactions using multiple tables).

The default setting for the MaxNoOfConcurrentTransactions parameter is 4096, which should be sufficient for all but the most heavily laden clusters. This parameter must be set to the same value for all cluster nodes.

Changing this parameter is never safe and can cause a cluster to crash. When a node crashes, one of the nodes (the oldest surviving node) builds up the transaction state of all transactions ongoing in the crashed node at the time of the crash. It is thus important that this node have as many transaction records as the failed nodes; in other words, it is a good idea to have this number set significantly higher than you anticipate you will ever use.

MaxNoOfConcurrentOperations and MaxNoOfLocalOperations

When performing transactions of only a few operations each and not involving a great many records, there is no need to set this parameter very high. When performing large transactions involving many records, you should set this parameter higher.

Records are kept for each transaction updating cluster data, both in the transaction coordinator (a randomly chosen storage node) and in the nodes where the actual updates are performed. These records contain state information needed in order to find UNDO records for rollbacks, lock queues, and other purposes.

The MaxNoOfConcurrentOperations parameter should therefore be set to the number of records to be updated simultaneously in transactions, divided by the number of cluster data nodes. For example, in a cluster that has 4 data nodes and that is expected to handle 1,000,000 concurrent updates using transactions, you should set this value to 1,000,000 / 4 = 250,000. Of course, you want to build in a safety margin, so you should set the value higher than this.

Read queries that set locks also cause operation records to be created. Some extra operations should be allocated to accommodate cases where the distribution is not perfect over the nodes.

When queries make use of the unique hash index (which most complex queries can be assumed to do at some stage), there are actually two operation records used per record in the transaction: The first record represents the read in the index table (which is a completely separate and hidden table), and the second handles the operation on the base table.

The default value for MaxNoOfConcurrentOperations is 32768. If it is exceeded, you get the error message "Out of operation records in transaction coordinator"

MaxNoOfConcurrentOperations normally handles a second parameter that can be configured separately. The second parameter, MaxNoOfLocalOperations, specifies how many operation records are to be local to the node and is calculated as 1.1 x MaxNoOfConcurrentOperations by default (in other words, each node can handle up to 10% more than the average for all nodes).

A very large transaction performed on an eight-node cluster requires as many operation records as there are reads, updates, and deletions involved in the transaction. Therefore, the chances are that you will want to set MaxNoOfLocalOperations higher than the default (that is, 10% greater than MaxNoOfConcurrentOperations) if you are dealing with large transactions. If it is necessary to configure the system for one very large transaction, it is a good idea to configure the two parts separately with MacNoOfLocalOperations set much higher than MaxNoOfConcurrentOperations because this parameter cannot be "spread out" over many nodes.

Transaction Temporary Storage

The next set of parameters is used to define the temporary storage available to the storage nodes when executing a query that is part of a cluster transaction. All records are released when the query is completed and the cluster is waiting for the commit or rollback.

The default values for these parameters are adequate in most situations. However, users who need to support transactions involving large numbers of rows or operations may need to increase these values to enable better parallelism over the cluster. On the other hand, users whose applications require relatively small transactions can decrease the values in order to save memory.

MaxNoOfConcurrentIndexOperations

For queries that use a unique hash index, a temporary set of operation records are used during a query's execution phase. The MaxNoOfConcurrentIndexOperations parameter sets the size of that pool of records available for such use. Thus this record is allocated only while you're executing a part of a query, and as soon as this part has been executed, the record is released. (The states needed to handle the other phases in a transaction's life, aborts and commits, are handled by the normal operation records where the pool size is set by the parameter MaxNoOfConcurrentOperations.)

The default value of MaxNoOfConcurrentIndexOperations is 8192. It should be pointed out that only in rare cases of extremely high parallelism using unique hash indexes should it be necessary to increase this value. Using a smaller value is possible and can save memory if you are certain that a high degree of parallelism is not required for your cluster.

MaxNoOfFiredTriggers

A record is created when an operation is performed that affects a unique hash index. Inserting or deleting a record in a table with unique hash indexes or updating a column that is part of a unique hash index "fires" an insert or a deletion into the index table. The resulting record is used to represent this index table operation while waiting for the original operation that fired it to complete. This operation is short lived but can still require a large number of records in its pool for situations with many parallel write operations on a base table containing a set of unique hash indexes.

The default value of MaxNoOfFiredTriggers is 4000, which is sufficient in most situations. In some cases, you can even decrease this value if you feel that the need for parallelism in the cluster is not high.

TransactionBufferMemory

The memory affected by the transactionBufferMemory parameter is used for tracking operations fired when updating index tables and reading unique indexes. This memory is used to store the key and column information for these operations. It is only very rarely that the value for this parameter needs to be altered from the default.

It is possible to change a similar buffer used for read and write operations by changing a compile-time buffer. See "transactionBufferMemory" at the online documentation pages (http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-db-definition.html) for more details on this parameter, which you are unlikely to need to change.

The default value for transactionBufferMemory is 1MB.

Scans and Buffering

Table and range scans are used during transactions, and there are certain maximum values you need to set. Several buffers are also used. With all these parameters, the defaults will probably work for most clusters, but you may need to increase them. If you are able to reduce them, you will typically save RAM.

MaxNoOfConcurrentScans

The MaxNoOfConcurrentScans parameter is used to control the number of parallel table or range scans that can be performed in the cluster. Each transaction coordinator (a randomly picked storage node) can handle the number of parallel scans defined for this parameter.

Each scan query is performed by scanning all partitions in parallel (for performance reasons). Each partition scan uses a "scan record" in the node where the partition is located, the number of "scan records" being the value of this parameter multiplied by the number of nodes.

Scans are typically performed in one of two cases. The first of these cases occurs when no hash or ordered index exists to handle the query, in which case the query is executed by performing a full table scan (which can be very slow). The second case is encountered when there is no hash index to support the query but there is an ordered index. Using the ordered index means executing a parallel range scan. Because the order is kept on the local partitions only, it is necessary to perform the index scan on all partitions, which may involve communicating with other storage nodes that hold other partitions (or fragments) of the data.

The default value of MaxNoOfConcurrentScans is 256. The maximum value is 500.

MaxNoOfLocalScans

The MaxNoOfLocalScans parameter specifies the number of local scan records that can take place at any one time on each storage node. You might want to increase this value if many of your table scans are not fully parallelized (that is, able to be executed on many nodes at the same time).

The default value for this parameter is the value for MaxNoOfConcurrentScans multiplied by the total number of storage nodes in the cluster.

BatchSizePerLocalScan

The BatchSizePerLocalScan parameter is used to calculate the number of lock records that needs to exist to handle the concurrent scan operations. If you increase MaxScanBatchSize (defined for SQL nodes later in this chapter, in the section "Parameters to Define for All SQL Nodes (MYSQLD_DEFAULT)"), you should also increase this parameter. The default is 64.

LongMessageBuffer

The LongMessageBuffer parameter defines the size of the internal buffer that is used for passing messages within individual nodes and between nodes. Although it is highly unlikely that this would need to be changed, it is configurable. By default this parameter is set to 1MB.

Logging and Checkpointing

Storage nodes do a large amount of logging to various local log files as part of their operation. This logging is mostly not logging in the traditional sense of error logs and so on but involves logging of recent transactions. These logs are used only in the event of a full cluster shutdown, but they are important because without them, if you suffer a full cluster shutdown, you will not be able to restart your cluster.

NoOfFragmentLogFiles

The NoOfFragmentLogFiles parameter sets the number of REDO log files that the storage node will keep. REDO log files are organized in a ring: The cluster creates a new file when the last one gets 16MB and will continue creating until it has created NoOfFragmentLogFiles groups of 4 files, at which point it goes back to the beginning and starts overwriting the older files.

It is extremely important that the first and last log files (sometimes referred to as the head and tail log files, respectively) do not meet; when they approach one another too closely, the node begins aborting all transactions encompassing updates due to not having enough room for new log records.

A REDO log record is not removed until three local checkpoints have been completed since that log record was created. Checkpointing frequency is determined by its own set of configuration parameters, discussed later in this chapter, with the TimeBetweenLocalCheckpoints and TimeBetweenGlobalCheckpoints parameters. Clearly, if you increase the time between checkpoints such that the head and tail of the REDO logs come close to each other, you should increase the value of NoOfFragmentLogFiles.

The default NoOfFragmentLogFiles parameter value is 8, which means eight sets of four 16MB files, for a total of 512MB. (Note that REDO log space must be allocated in blocks of 64MB, or four files of 16MB.) In scenarios that require a great many updates, the value for NoOfFragmentLogFiles may need to be set as high as 300 or even higher in order to provide sufficient space for REDO logs. If you have this sort of requirement, you are probably going to need a very fast disk drive (for example, RAID or SCSI), so you must make sure that the cluster is not held up writing REDO logs to disk.

If the checkpointing is slow and there are so many writes to the database that the log files are full and the log tail cannot be cut without jeopardizing recovery, all updating transactions are aborted with internal error code 410, or "Out of log file space temporarily." This condition prevails until a checkpoint has completed and the log tail can be moved forward.

MaxNoOfSavedMessages

The MaxNoOfSavedMessages parameter sets the maximum number of trace files that will be kept before old ones are overwritten. Trace files are generated when, for whatever reason, the node crashes and are very useful for debugging. You need them if you are trying to report a bug to a developer or if someone else on the mailing list asks you for them, as they can help determine what the cluster was doing at the exact moment it crashed.

The default is 25 trace files. We suggest that you do not change this.

Metadata Objects

Metadata objects are pretty much everything within your cluster, including database tables, system tables, indexes, and so on. You will have to increase some of these parameters if you are attempting to import anything larger than a token database because the defaults are often very low. This is especially true for MaxNoOfAttributes.

MaxNoOfAttributes

The MaxNoOfAttributes parameter sets the number of attributes (including fields, indexes, and primary keys) that can be defined in the cluster.

The default value for this parameter is 1000, with the minimum possible value being 32. There is no maximum. Each attribute consumes around 200 bytes of storage per node (even if it is not used) due to the fact that all metadata is fully replicated on the servers.

This is quite a common limit to hit, and you are likely to be greeted with error 708: "No more attribute metadata records (increase MaxNoOfAttributes)" if you try importing a big table without first increasing this parameter.

MaxNoOfTables

The MaxNoOfTables parameter might appear fairly self-explanatory, but remember that a table is allocated for each table, unique hash index, and ordered index, even though in the case of indexes, it is hidden from view. This parameter sets the maximum number of table objects for the cluster as a whole.

For each attribute that has a BLOB data type, an extra table is also used to store most of the BLOB data. These tables must also be taken into account when defining the total number of tables.

The default value of this parameter is 128, the minimum is 8, and the maximum is 1600. Each table object consumes approximately 20KB per node, regardless of whether it is used.

MaxNoOfOrderedIndexes

For each ordered index in the cluster, an object is allocated to describe what is being indexed and in which storage segments it resides. Each unique index and primary key has both an ordered index and a hash index.

The default value of the MaxNoOfOrderedIndexes parameter is 128. Each object consumes approximately 10KB of data per node, regardless of whether it is used.

MaxNoOfUniqueHashIndexes

For each unique index that is not a primary key, a special table is allocated that maps the unique key to the primary key of the indexed table. By default, an ordered index is also defined for each unique index. To prevent this, you must use the USING HASH option when defining the unique index. (See Chapter1 for more detail on the USING HASH option.)

The default value of the MaxNoOfUniqueHashIndexes parameter is 64. Each index consumes approximately 15KB per node.

MaxNoOfTriggers

Internal update, insert, and delete triggers are allocated for each unique hash index. (This means that three triggers are created for each unique hash index.) However, an ordered index requires only a single trigger object. Backups also use three trigger objects for each normal table in the cluster.

The MaxNoOfTriggers parameter sets the maximum number of trigger objects in the cluster. The default value for this parameter is 768.

Boolean Parameters

The behavior of data nodes is affected by a set of parameters that take on Boolean values. You can specify these parameters as trUE by setting them equal to 1 or Y and as FALSE by setting them equal to 0 or N.

LockPagesInMainMemory

For a number of operating systems, including Solaris and Linux, it is possible to lock a process into memory and so avoid any swapping to disk. You can use the LockPagesInMainMemory parameter to help guarantee the cluster's real-time characteristics.

By default, this feature is disabled. Of course, if you do this and then don't have enough RAM, you will have a problem (typically a node crash). You will also have no protection from a complete cluster shutdown; you will lose all your data and will have to restore from your last full backup. We don't recommend that you use this feature.

StopOnError

The StopOnError parameter specifies whether an ndbd process is to exit or to perform an automatic restart when an error condition is encountered. This feature is enabled by default.

You will notice that when you start ndbd, you get two processes. The "Angel" process restarts the "child" process when it meets an error condition if StopOnError is enabled. Otherwise, both processes exit (the default behavior).

Quite commonly, you will want to turn off this parameter, which makes the ndbd process attempt to restart automatically if they encounter an error. This is particularly useful in the event of a short-term network failure that may be fixed later on.

The drawback of setting this parameter is that, in theory, the data node may keep failing and restarting over and over, which could lead to overall cluster performance issues.

Diskless

It is possible to specify cluster tables as diskless, meaning that tables are not checkpointed to disk and that no logging occurs. Such tables exist only in main memory. A consequence of using diskless tables is that neither the tables nor the records in those tables will be preserved after a crash. However, when operating in diskless mode, it is possible to run ndbd on a diskless computer.

The ability to employ diskless mode on a per-table basis is planned for the 5.1 release of MySQL Cluster but is not currently supported.

Currently, when the Diskless parameter is enabled, backups are performed, but backup data is not actually stored (which makes the backups completely pointless).

Diskless is disabled by default, and we recommend that you not use it unless you are using diskless machines. Using it is very risky, and if your entire cluster shuts down, you will loose all your data (except for any traditional SQL backups you may have).

Controlling Timeouts, Intervals, and Disk Paging

A number of parameters specify timeouts and intervals between various actions in cluster data nodes. Most of the timeout values are specified in milliseconds.

TimeBetweenWatchDogCheck

To prevent the main ndbd thread from getting stuck in an endless loop, a "watchdog" thread checks the main thread. The TimeBetweenWatchDogCheck parameter specifies the number of milliseconds between checks. If the process remains in the same state after three checks, it is terminated by the watchdog thread.

The TimeBetweenWatchDogCheck parameter can easily be changed for purposes of experimentation or to adapt to local conditions. It can be specified on a per-node basis, although there seems to be little reason for doing so.

The default timeout is 4000 milliseconds (that is, 4 seconds).

StartPartialTimeout

The StartPartialTimeout parameter specifies the time that the cluster will wait for all storage nodes to come up before the cluster initialization routine is invoked. This timeout is used to avoid a partial cluster startup whenever possible.

The default value is 30000 milliseconds (that is, 30 seconds). 0 means eternal time out; in other words, the cluster may start only if all nodes are available.

StartPartitionedTimeout

If the cluster is ready to start after waiting for StartPartialTimeout milliseconds but is still in a possibly partitioned state (in other words, there is at least one node in each node group not yet connected), the cluster will wait this further period of time before starting. If it is not in a potentially partitioned statefor example, if all the nodes in one node group have connectedit will start immediately.

The default timeout is 60000 milliseconds (that is, 60 seconds).

StartFailureTimeout

If a storage node has not completed its startup sequence within the time specified by the StartFailureTimeout parameter, the node startup fails. If you set this parameter to 0, no data node timeout is applied, and nodes will continue to attempt to start eternally.

The default value is 60000 milliseconds (that is, 60 seconds). For data nodes that contain extremely large amounts of data, this parameter should be increased. For example, in the case of a storage node containing several gigabytes of data, a period as long as 10 to 15 minutes (that is, 600,000 to 1,000,000 milliseconds) might be required in order to perform a node start.

HeartbeatIntervalDbDb

One of the primary methods of discovering failed storage nodes is heartbeats. The HeartbeatIntervalDbDb parameter states how often heartbeat signals are sent and how often to expect to receive them. After missing three heartbeat intervals in a row, the node is declared dead. Thus the maximum time for discovering a failure through the heartbeat mechanism is four times the heartbeat interval.

The default heartbeat interval is 1500 milliseconds (that is, 1.5 seconds). This parameter must not be changed drastically and should not vary widely between nodes. If one node uses 5,000 milliseconds and the node watching it uses 1,000 milliseconds, obviously, the node will be declared dead very quickly. This parameter can be changed during an online software upgrade but only in small increments. (If half the nodes in the cluster are expecting to receive heartbeats every 1.5 seconds but you have changed the other half to send them every 15 seconds, you will get nodes thinking that other nodes are dead.)

HeartbeatIntervalDbApi

Each storage node sends heartbeat signals to each MySQL server (SQL node) to ensure that it remains in contact. If a MySQL server fails to send a heartbeat in time, it is declared dead, in which case all ongoing transactions are completed and all resources are released. The SQL node cannot reconnect until all activities started by the previous MySQL instance have been completed. The three-heartbeat criterion for the HeartbeatIntervalDbApi parameter is the same as that described for HeartbeatIntervalDbDb.

The default interval is again 1500 milliseconds (that is, 1.5 seconds). This interval can vary between individual storage nodes because each storage node watches the MySQL servers connected to it independently of all other storage nodes. However, again, there seems limited point in varying this, so we suggest that you define it in the DEFAULT section.

TimeBetweenLocalCheckpoints

The sizes of all write operations executed since the start of the previous local checkpoints are added together to decide whether nodes should start local checkpoints. The TimeBetweenLocalCheckpoints parameter specifies the minimum time between local checkpoints and can be used to prevent storage nodes in high-update/insert clusters starting the next local checkpoint immediately after the previous one has finished (independent of the cluster's workload). You can also use it to ensure that local checkpoints will follow each other with no gap (also independent of the cluster's workload) by setting it to 6 or less.

It should be noted that you are not actually setting a normal time in milliseconds, but you are specifying the base 2 logarithm of the number of 4-byte words of write operations that have built up since the last local checkpoint started, so that the default value 20 means 4MB (4 x (220)) of write operations, 21 would mean 8MB, and so on, up to a maximum value of 31, which equates to 8GB of write operations.

Note

If you are wondering why it has been decided to set the TimeBetweenLocalCheckpoints parameter as it is set, you are not alone! You should get used to it, however, because defining a value in milliseconds would break your configuration because it would be far too large (22000, for example, if you were trying to mean 2 seconds).

 

TimeBetweenGlobalCheckpoints

When a transaction is committed, it is committed into the main memory (RAM) of all nodes on which the data is mirrored. However, transaction log records are not flushed to disk (hard drive) as part of the commit. The reasoning behind this behavior is that having the transaction safely committed on at least two autonomous host machines (in RAM) should meet reasonable standards for durability and is essential for good write/update performance. It means that during a complete cluster shutdown, you can technically loose a small number of fully committed transactions (the number depends on the value of this parameter).

It is also important to ensure that even the worst of casesa complete crash of the clusteris handled sensibly. To guarantee that this happens, all transactions taking place within a given interval are put into a global checkpoint, which can be thought of as a set of committed transactions that have been flushed to disk.

In other words, as part of the commit process, a transaction is placed in a global checkpoint group. Later on, this group's log records are flushed to disk, and then the entire group of transactions is safely committed to disk on all computers in the cluster as well.

The TimeBetweenGlobalCheckpoints parameter defines the interval between global checkpoints. The default is 2000 milliseconds (that is, 2 seconds). This means that in a worst-case scenario, you could lose up to 2 seconds of committed transactions from your cluster in the event of a total cluster shutdown the instant after the previous global checkpoint starts. If this is unacceptable to you, you should decrease the value of this parameter, but you need to realize that this will impose an extra load on your storage nodes.

TimeBetweenInactiveTransactionAbortCheck

Timeout handling is performed by checking a timer on each transaction once during every interval specified by the TimeBetweenInactiveTransactionAbortCheck parameter. Thus, if this parameter is set to 1000 milliseconds, every transaction will be checked for timing out once per second.

The default value for this parameter is 1000 milliseconds (that is, 1 second).

TransactionInactiveTimeout

If a transaction is currently not performing any queries but is waiting for further user input, the transactionInactiveTimeout parameter states the maximum time that the user can wait before the transaction is aborted.

The default for this parameter is 0 (no timeout). For a real-time database that needs to ensure that no transaction keeps locks for too long a time, this parameter should be set to a small value, but not 0. The unit is milliseconds.

TransactionDeadlockDetectionTimeout

When a node executes a query involving an update or a write (in other words, creating a transaction), the node waits for the other nodes in the cluster to respond before continuing. A failure to respond can occur for any of the following reasons:

The transactionDeadlockDetectionTimeout parameter states how long the transaction coordinator will wait for query execution by another node before aborting the transaction, and this is important for both node failure handling and deadlock detection. Setting it too high can cause undesirable behavior in situations involving deadlocks and node failure.

The default timeout value is 1200 milliseconds (that is, 1.2 seconds).

NoOfDiskPagesToDiskAfterRestartTUP

When executing a local checkpoint, the storage nodes flush all data pages to disk. Merely doing this as quickly as possible without any moderation is likely to impose excessive loads on processors, networks, and disks. To control the write speed, the NoOfDiskPagesToDiskAfterRestartTUP parameter specifies how many pages per 100 milliseconds are to be written. In this context, a page is defined as 8KB; thus, this parameter is specified in units of 80KBps. Therefore, setting NoOfDiskPagesToDiskAfterRestartTUP to a value of 20 means writing 1.6MB of data pages to disk each second during a local checkpoint. Put simply, you should increase it if you want your local checkpoints to take less time but use more resources.

This value controls the writing of all data stored in DataMemory, so it includes the UNDO log records for data pages.

The default value is 40 (that is, 3.2MB of data pages per second).

NoOfDiskPagesToDiskAfterRestartACC

The NoOfDiskPagesToDiskAfterRestartACC parameter uses the same units as NoOfDiskPagesToDiskAfterRestartTUP (units of 80KB) and acts in a similar fashion, but it limits the speed of writing index pages from index memory.

The default value for this parameter is 20 index memory pages per second (that is, 1.6MBps).

NoOfDiskPagesToDiskDuringRestartTUP

The NoOfDiskPagesToDiskDuringRestartTUP parameter is similar to NoOfDiskPagesToDiskAfterRestartTUP and NoOfDiskPagesToDiskAfterRestartACC, but it is converted with local checkpoints executed on the node when a node is restarting. As part of all node restarts, a local checkpoint is always performed. During a node restart, it is possible to write to disk at a higher speed than at other times because fewer activities are being performed in the node.

The default value is still 40 (that is, 3.2MBps), but we recommend that you increase this, depending on the speed of your nodes.

NoOfDiskPagesToDiskDuringRestartACC

The NoOfDiskPagesToDiskDuringRestartACC parameter controls the number of index memory pages that can be written to disk during the local checkpoint phase of a node restart.

As with NoOfDiskPagesToDiskAfterRestartTUP and NoOfDiskPagesToDiskAfterRestartACC, values for this parameter are expressed in terms of 8KB pages written per 100 milliseconds (that is, 80KBps).

The default value is 20 (that is, 1.6MBps).

ArbitrationTimeout

The ArbitrationTimeout parameter specifies the time that the data node will wait for a response from the arbitrator to an arbitration message. If this time is exceeded, it is assumed that the network has split (and it will then carry out the relevant logic discussed in Chapter1).

The default value is 1000 milliseconds (that is, 1 second).

Buffering

Buffers are used as front ends to the file system when writing log records to disk, to improve performance (the storage nodes do not have to wait for the disk to catch up). If the node is running in diskless mode, these parameters can be set to their minimum values without penalty due to the fact that disk writes are "faked" (that is, they do not actually take place) by the NDB storage engine's file system abstraction layer.

UndoIndexBuffer

The UNDO buffer is used during local checkpoints. The NDB storage engine uses a recovery scheme based on checkpoint consistency in conjunction with an operational REDO log.Putsimply, it regularly takes a snapshot of the data in the cluster (local checkpoints), which it does by applying the UNDO log to remove half-complete transactions [1] and then flushing the resulting consistent copy of the database to disk. It also regularly stores the REDO log (which contains the more recent transactions) to disk as part of the global checkpoint process. In the event of a failure, it applies the last known full commit to disk (local checkpoint) and then applies the REDO log (global checkpoint).

[1] By doing this, it does not have to prevent any writes during this process. The UNDO log is activated on one table fragment at a time.

The UNDO index buffer is used to store updates on the primary key hash index. Inserts and deletions rearrange the hash index; the NDB storage engine writes UNDO log records that map all physical changes to an index page so that they can be undone at system restart. It also logs all active insert operations for each fragment at the start of a local checkpoint.

Reads and updates set lock bits and update a header in the hash index entry. These changes are handled by the page-writing algorithm to ensure that these operations need no UNDO logging.

The UndoIndexBuffer parameter is 2MB by default. The minimum value is 1MB, and for most applications, the minimum is sufficient. If your applications use extremely large and/or numerous insertions and deletions together with large transactions and large primary keys, it may be necessary to increase the size of this buffer. If this buffer is too small, the NDB storage engine issues internal error code 677: "Index UNDO buffers overloaded."

UndoDataBuffer

The UNDO data buffer plays the same role as the UNDO index buffer, except that it is used with regard to data memory rather than index memory. This buffer is used during the local checkpoint phase of a fragment for insertions, deletions, and updates.

Because UNDO log entries tend to grow larger as more operations are logged, the UndoDataBuffer parameter is typically set larger than its index memory counterpart, with a default value of 16MB.

For some applications, this amount of memory may be unnecessarily large. In such cases, it is possible to decrease this size down to a minimum of 1MB.

It is rarely necessary to increase the size of this buffer. If there is such a need, it is a good idea to check whether the disks can actually handle the load caused by database update activity. A lack of sufficient disk space or I/O capacity cannot be overcome by increasing the size of this buffer.

If this buffer is too small and gets congested, the NDB storage engine issues internal error code 891: "Data UNDO buffers overloaded."

RedoBuffer

All update activities need to be logged. The REDO log makes it possible to replay these updates whenever the system is restarted. The NDB recovery algorithm uses a "fuzzy" checkpoint of the data together with the UNDO log, and then it applies the REDO log to play back all changes up to the restoration point.

The RedoBuffer parameter is 8MB by default. The minimum value is 1MB.

If this buffer is too small, the NDB storage engine issues error code 1221: "REDO log buffers overloaded."

Backup Parameters

The parameters discussed in the following sections define memory buffers set aside for execution of online backups.

BackupDataBufferSize

In creating a backup, there are two buffers used for sending data to the disk. The backup data buffer is used to fill in data recorded by scanning a node's tables. When this buffer has been filled to the level specified as BackupWriteSize (see the "BackupWriteSize" section), the pages are sent to disk. While flushing data to disk, the backup process can continue filling this buffer until it runs out of space. When this happens, the backup process pauses the scan and waits until some disk writes have completed and have thus freed up memory so that scanning may continue.

The default value for the BackupDataBufferSize parameter is 2MB.

BackupLogBufferSize

The backup log buffer fulfils a role similar to that played by the backup data buffer, except that it is used for generating a log of all table writes made during execution of the backup. The same principles apply for writing these pages as with the backup data buffer, except that when there is no more space in the backup log buffer, the backup fails. For that reason, the backup log buffer must be big enough to handle the load caused by write activities while the backup is being made.

The default value for the BackupLogBufferSize parameter should be sufficient for most applications. In fact, it is more likely for a backup failure to be caused by insufficient disk write speed than it is for the backup log buffer to become full. If the disk subsystem is not configured for the write load caused by applications, the cluster will be likely to be able to perform the desired operations.

It is preferable to configure cluster nodes in such a manner that the processor, rather than the disks or the network connections, becomes the bottleneck.

The default value is 2MB.

BackupWriteSize

The BackupWriteSize parameter specifies the size of messages written to disk by the backup log and backup data buffers.

The default value is 32KB.

Logging

Some parameters are set as a level from 0 to 15, which determines how much information is sent to stdout (printed onscreen typically). The default is to send some (but not much) information about startup events to stdout and nothing else. Note that this is completely separate from the logging information covered earlier, in the section "LogDestination." This is information that affects only each individual storage node, although it makes sense to set the same values for all storage nodes the same (in other words, to set the value in the DEFAULT section).

LogLevelStartup

The LogLevelStartup parameter specifies the reporting level for events generated during startup of the process.

The default level is 1.

LogLevelShutdown

The LogLevelShutdown parameter specifies the reporting level for events generated as part of graceful shutdown of a node.

The default level is 0.

LogLevelStatistic

The LogLevelStatistic parameter specifies the reporting level for statistical events such as number of primary key reads, number of updates, number of inserts, information relating to buffer usage, and so on.

The default level is 0.

LogLevelCheckpoint

The LogLevelCheckpoint parameter specifies the reporting level for events generated by local and global checkpoints.

The default level is 0.

LogLevelNodeRestart

The LogLevelNodeRestart parameter specifies the reporting level for events generated during node restart.

The default level is 0.

LogLevelConnection

The LogLevelConnection parameter specifies the reporting level for events generated by connections between cluster nodes.

The default level is 0.

LogLevelError

The LogLevelError parameter specifies the reporting level for events generated by errors and warnings by the cluster as a whole. These errors do not cause any node failure but are still considered worth reporting.

The default level is 0.

LogLevelInfo

The LogLevelInfo parameter specifies the reporting level for events generated for information about the general state of the cluster.

The default level is 0.

Parameters to Define for Each Storage Node (NDBD)

Although you could set all the parameters we have looked at so far on a per-storage node basis, generally we suggest that it makes sense to have the same values for all nodes in the cluster and, therefore, unless there is a specific reason you want one node to have a different setting, we suggest that you just define all parameters in the [NDBD_DEFAULT] section.

This leaves the Id and Hostname parameters to set in each [NDBD] section.

An Example of a Storage Section of a config.ini File

Here is an example of a configuration with two storage nodes. It defines the parameters that most people will have to define, and this would make a good first configuration file to work from:

[NDBD DEFAULT] # 2 replicas (min. for redundancy; requires 2 storage nodes min.) NoOfReplicas=2 # Standard data directory DataDir=/var/lib/mysql-cluster # 512mb storage for data, 64mb for primary keys. DataMemory=512mb IndexMemory=64mb # Define MaxNoOfConcurrentOperations; leave # MaxNoOfLocalOperations as default (1.1* # MaxNoOfConcurrentOperations) MaxNoOfConcurrentOperations=10000 # Slightly increase the number of ordered indexes, unique hash # indexes, tables and attributes allowed from defaults # When importing tables, set this as high as you can # Then reduce based on output from ALL DUMP 1000 command MaxNoOfOrderedIndexes=512 MaxNoOfUniqueHashIndexes=256 MaxNoOfTables=256 MaxNoOfAttributes=1500 # 2 storage nodes [NDBD] # Allow auto allocation of ID, so just set hostname: Hostname = 10.0.0.5 [NDBD] Hostname = 10.0.0.6

Hopefully, you can see from this example how you would define the storage node section of your config.ini file, which is by far the largest and most important section.

Remember that if you change anything to do with the file system, you must restart the storage nodes by using the --initial parameter; the procedure for a cluster restart like this without any downtime is given at the beginning of this chapter, under the procedure for a rolling online configuration change.

Категории