Indexes
Indexes are extremely important in MySQL Cluster, just as they are with other database engines. MySQL Cluster has three different types of indexes: PRIMARY KEY indexes, UNIQUE hash indexes, and ordered indexes (T-tree).
PRIMARY KEY Indexes
The data in MySQL Cluster is partitioned based on a hash of the PRIMARY KEY column(s). This means the primary key is implemented through a hash index type. A hash index can be used to resolve queries that are used with equals but cannot be used for range scans. In addition, you need to use all of a PRIMARY KEY column and not just part of it.
The following is an example of queries that could use the PRIMARY KEY column:
SELECT * FROM tbl WHERE pk = 5; SELECT * FROM tbl WHERE pk = 'ABC' OR pk = 'XYZ';
The following are examples of queries that will not use the PRIMARY KEY index:
SELECT * FROM tbl WHERE pk < 100; SELECT * FROM tbl WHERE pk_part1 = 'ABC';
The first of these examples will not use the PRIMARY KEY index because it involves a range with less than and not a straight equality. The second example assumes that you have a two-part primary key, such as PRIMARY KEY (pk_part1, pk_part2). Because the query is using only the first half, it cannot be used.
Using the primary key is the fastest way to access a single row in MySQL due to the fact that your tables are partitioned on it.
Remember that when you declare a primary key in MySQL Cluster, by default you get both the PRIMARY KEY hash index and an ordered index, as described in the following section. It is possible to force the primary key to have only a hash index if you aren't going to be doing range lookups. Here is how you implement that:
PRIMARY KEY (col) USING HASH
UNIQUE Hash Indexes
UNIQUE hash indexes occur whenever you declare a UNIQUE constraint in MySQL Cluster. They are implemented through a secondary table that has the column you declared UNIQUE as the primary key in the table along with the primary key of the base table:
CREATE TABLE tbl ( Id int auto_increment, Name char(50), PRIMARY KEY (id), UNIQUE (Name) );
This gives you a hidden table that looks like the following:
CREATE TABLE hidden ( Name_hash char(8), Orig_pk int, PRIMARY KEY (name_hash) );
These tables are independently partitioned, so the index may not reside on the same node as the data does. This causes UNIQUE indexes to be slightly slower than the PRIMARY KEY, but they will still be extremely fast for most lookups.
Ordered Indexes
An ordered index in MySQL Cluster is implemented through a structure called a T-tree. A T-tree works the same as a B-tree, but it is designed for use with in-memory databases. The two primary differences are that a T-tree is generally much deeper than a B-tree (because it doesn't have to worry about disk seek times) and that a T-tree doesn't contain the actual data itself, but just a pointer to the data, which makes a T-tree take up less memory than asimilar B-tree.
All the queries that could be resolved with a B-tree index can also be resolved by using a T-tree index. This includes all the normal range scans, such as less than, between, and greater than. So if you are switching an application from MyISAM or InnoDB to MySQL Cluster, it should continue to work the same for query plans.
A T-tree itself is partitioned across the data nodes. Each data node contains a part of the T-tree that corresponds to the data that it has locally.
Index Statistics
The MySQL optimizer uses various statistics in order to decide which is the most optimal method for resolving a query. MySQL particularly makes use of two different values: the number of rows in the table and an estimate of how many rows will match a particular condition.
The first value is approximately the number of rows in a table. This is important for deciding whether MySQL should do a full table scan or use indexes. MySQL Cluster can provide this value to the MySQL optimizer, but it doesn't necessarily have to do so. The problem with retrieving this data from the data nodes for each query being executed is that it adds a bit of extra overhead and slightly degrades the response time of your queries. There is therefore a way to turn off the fetching of the exact number of rows. The MySQL server setting ndb_use_exact_count controls whether to do the retrieval. If the MySQL server doesn't retrieve an exact value, it will always say there are 100 rows in the table.
Due to the distributed and in-memory natures of the MySQL storage engine, it is almost always best to use an index, if possible. In most cases, you want to turn off the ndb_use_exact_count variable to gain performance because it favors index reads.
As a side note, the ndb_use_exact_count variable has an impact on the following simple statement:
SELECT count(*) FROM tbl;
If you are familiar with the other storage engines, you know that MyISAM has an optimization to speed this up, whereas InnoDB doesn't. The ndb_use_exact_count setting affects NDB in a similar manner: If it is on, it works like MyISAM, but if it is off, it works like InnoDB. The nice thing is that you can set this on a per-connection basis, which allows you to do something such as the following:
mysql> set ndb_use_exact_count = 1; Query OK, 0 rows affected (0.00 sec) root@world~> SELECT count(*) FROM Country; +----------+ | count(*) | +----------+ | 239 | +----------+ 1 row in set (0.01 sec) mysql> set ndb_use_exact_count = 0; Query OK, 0 rows affected (0.01 sec)
The other value that MySQL uses is an estimation of how many rows will match against a particular index lookup. MySQL Cluster currently isn't able to calculate this value. It always estimates that 10 rows will match. If you combine this with the preceding rule of 100 rows, you see that MySQL will always use indexes, which is good with MySQL Cluster. Where the problem comes with this is where MySQL has choices between multiple indexes. The following query shows the problem:
SELECT * FROM tbl WHERE idx1 = 5 AND idx2 = 10;
MySQL Cluster tells MySQL that both idx1 and idx2 will retrieve the same number of rows (10). However, normally this might not be true; one of the indexes is typically more selective than the other. In this case, because MySQL Cluster doesn't have the information available, you have to tell MySQL which one is better. To tell it which one, you should use a USING INDEX clause in the SELECT statement. So if the second index is more selective, you write a statement similar to the following:
SELECT * FROM tbl USING INDEX (idx2) WHERE idx1 = 5 AND idx2 = 10;
MySQL 5.1 will have the ability for NDB to estimate the rows that will match when using an index. This will prevent you from having to tell MySQL which index is more selective in most cases.