Friday, August 03, 2012

Re-balancing MySQL Cluster

Recently, I had to investigate whether MySQL Cluster's (aka mysql-ndb: network database) partitioning uses fixed modulo hashing.

Using fixed modulo hashing for partitioning would be pretty bad when a new data node is added as all rows would need to be rehashed and re-distributed among all the available nodes (assuming the modulo is the number of nodes). This would be both computationally cost and time consuming. A common sense alternative would be consistent hashing.

Asking around on  #mysql-ndb (freenode), MongtoDB pointed out me to two great posts by Frazer (who works at MySQL).

1) MySQL Cluster Online Scaling
This is the blog post that mentions the HashMap distribution scheme and how data migration works. In this scheme, a lookup table is added as another layer of indirection on top of standard md5 hashing with constant modulo. When k new nodes are added, only the lookup table needs to be modified and the minimal number of rows (k/n) would be moved from each original node to the new nodes. The missing piece of information in this post is what the lookup table actually looks like.

2) Data Distribution in MySQL Cluster
This is the preceding blog post, which explains sharding in mysql-cluster terminology.

Sadly, the official documentation does not mention the partitioning schemes in great detail. However, a great detailed example on the procedure to add new data nodes is provided.

Unconvinced that the HashMap distribution scheme works, I decided to verify by trying:

For verification, I started with the following setup using 2 replicas for each node group.


(Note: I reserved node 50 for running ndb tools).

and the following schema in the database `awesome`.


I had 100 records with unique user_id from 0 to 99.

Running
ndb_select_all -c 192.168.0.1 user_prefs -d awesome --rowid -D "|"

gives the fragment/shard location of each row in the table. E.g.:

Running
ndb_desc -c 192.168.0.1 user_prefs -d awesome -p -n

gives the node id (primary and secondary) for each partition.

For instance, partition 0 exists on node 11 as the primary and node 12 as the secondary. Nodes 11 and 12 form a node group.

A rolling restart of all services (i.e. mgm, ndb and mysqld)  is needed to bring the two new nodes online. This can be avoided if the set up was pre-configured to take into account future new nodes, and these nodes are not brought online until they are needed. After the restart, I ended up with following cluster configuration. (the new nodes 31, 32 form nodegroup 2). This brings the number of data nodes from 4 to 6.


I executed the partition re-organization, which took 8.69 sec.
Re-running ndb_desc gives the following:
which shows that data re-distribution occurred.

In order to verify that not all rows are unnecessarily re-distributed, I ran the ndb_select_all program and wrote a python script to compare the location of the rows before and after the partition reorganization. The results of this comparison can be found here, which shows that rows are only moved to the newly added fragments and not existing ones.

Statistically speaking,
I started with the following number of rows in fragments (or shard) 0 to 3.
[25, 26, 25, 24]

The reorganization moved the following number of rows from each fragment to a new fragment (either fragment 4 or 5)
[8, 8, 10, 10]

This number is indeed in the ballpark of 2/6 of the original number of rows in each fragment.

Unlike MongoDB, where online migration can happen any time and the locations of the records are non-deterministic, partition reorganization is manually triggered in mysql-ndb via the "alter online table [tbl] reorganize partition" DDL. A SQL "optimize table" statement needs to be issued to reclaim free space after partition reorganization. 

Lastly, it should be noted that MySQL-ndb currently supports a maximum of 48 data nodes, although this limit might be raised in future.

No comments: