I'm experiencing an issue with a MariaDB cluster when dumping or optimizing databases.
When I run a a "mysqlcheck" or a "mysqldump" on my MariaDB 10.1 database server (which runs in a Galera cluster with two other servers) then the tasks stop after a short time and don't show any progress. The entire cluster seems to stall.
For example the mysqldump stops after creating an 14,0 MB (14.760.912 bytes) dump file and doesn't proceed, even after several minutes nothing happens.
The mysqlcheck to repair and optimize tables also hangs after checking a few tables.
In both situations the cluster starts to have issues and the only way to get it to work normally again is by taking the server that executed the job offline and also taking another server offline. I then take them back online on by one and the cluster works normally again.
I'm not sure what is causing these problems. I haven't found any errors in the syslog, although during the shutdown of the servers I noticed the following:
Jan 10 20:43:46 france mysqld[1015]: 2016-01-10 20:43:46 140096330258176 [Warning] WSREP: TO isolation failed for: 3, schema: mysql, sql: OPTIMIZE TABLE proc. Check wsrep connection state and retry the query.
Jan 10 21:58:47 france mysqld[1034]: 2016-01-10 21:58:47 139691511322368 [Warning] WSREP: TO isolation failed for: 3, schema: smf, sql: OPTIMIZE TABLE smf_categories. Check wsrep connection state and retry the query. Jan 10 21:58:47 france mysqld[1034]: 2016-01-10 21:58:47 139691511322368 [Warning] Aborted connection 24 to db: 'smf' user: 'maintenance' host: 'localhost' (Unknown error) Jan 10 21:58:47 france mysqld[1034]: 2016-01-10 21:58:47 139691509827328 [Warning] WSREP: TO isolation failed for: 3, schema: (null), sql: SELECT 1 FROM mysql.user LIMIT 1. Check wsrep connection state and retry the query.
Any idea what could be the cause of this?
MariaDB galera cluster fail during dump or optimize
-
- Atomicorp Staff - Site Admin
- Posts: 8355
- Joined: Wed Dec 31, 1969 8:00 pm
- Location: earth
- Contact:
Re: MariaDB galera cluster fail during dump or optimize
I havent sen that one before. A few things to look into, is that node in a failed state for any other reason? Like are any tables crashed, or something along those lines.
Another might be some big blocking job (like an optimize) running from another application talking to the DB. Or even a filesystem error on that node blocking a read.
Last thing to try would be to try a different SST method
Another might be some big blocking job (like an optimize) running from another application talking to the DB. Or even a filesystem error on that node blocking a read.
Last thing to try would be to try a different SST method
Re: MariaDB galera cluster fail during dump or optimize
Thank you for the quick reply.
The node is working fine in production simulations up till I run the dump or optimize. (so I don't think there are any crashed tables, otherwise it would be in the log files. and the server wouldn't be working correctly for normal requests.)
There aren't any running applications which run an optimize job.
I'm thinking along the lines of a performance issue or a deadlock bug.
My Galera config looks like:
I've removed the IP's and names, for security reasons
As soon as I turn wsrep_on=ON into wsrep_on=OFF and restart the server the optimize and dump job run fine.
I'll try with the mysqldump as SST now and I'll edit my post once I have the results.
The node is working fine in production simulations up till I run the dump or optimize. (so I don't think there are any crashed tables, otherwise it would be in the log files. and the server wouldn't be working correctly for normal requests.)
There aren't any running applications which run an optimize job.
I'm thinking along the lines of a performance issue or a deadlock bug.
My Galera config looks like:
Code: Select all
[galera]
# Mandatory settings
wsrep_on=ON
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_cluster_name="cluster_name"
wsrep_cluster_address="gcomm://IP's"
binlog_format=row
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
#
# Allow server to accept connections on all interfaces.
#
bind-address=0.0.0.0
#
# Optional setting
wsrep_slave_threads=16
wsrep_restart_slave=1
wsrep_replicate_myisam=1
wsrep_gtid_domain_id=1
wsrep_gtid_mode=1
innodb_flush_log_at_trx_commit=1
innodb_flush_neighbors=1
wsrep_provider_options = gcache.size = 32G;evs.keepalive_period = PT3S;evs.suspect_timeout = PT30S;evs.inactive_timeout = PT1M;evs.install_timeout = PT1M;evs.send_window = 512;evs.user_send_window = 512
# Galera Synchronization Congifuration
wsrep_sst_method=rsync
#wsrep_sst_auth=user:pass
# Galera Node Configuration
wsrep_node_address="IP"
wsrep_node_name="A name"
As soon as I turn wsrep_on=ON into wsrep_on=OFF and restart the server the optimize and dump job run fine.
I'll try with the mysqldump as SST now and I'll edit my post once I have the results.
Re: MariaDB galera cluster fail during dump or optimize
Are you sure?gijs007 wrote: wsrep_replicate_myisam=1
That's most likely because OPTIMIZE statements will also get replicated over the cluster, thus causing potential significant performance degradation, in your case resulting in a stalling cluster. Additionally, mysqldump will lock tables, which can cause all kinds of issues when the node is actively used by your application. Check your process list for table names when this happens, perhaps you see a pattern there.As soon as I turn wsrep_on=ON into wsrep_on=OFF and restart the server the optimize and dump job run fine.
Why not use xtrabackup for state transfers? Rsync and mysqldump are slow and blocking.I'll try with the mysqldump as SST now and I'll edit my post once I have the results.
Lemonbit Internet Dedicated Server Management
Re: MariaDB galera cluster fail during dump or optimize
Yup, most of my database is xtradb/innodb. I have a few tables that are myisam which aren't edited very often.prupert wrote:Are you sure?gijs007 wrote: wsrep_replicate_myisam=1
I was pointed into the right direction by Tom on the MariaDB KB: https://mariadb.com/kb/en/mariadb/galer ... mment_1911That's most likely because OPTIMIZE statements will also get replicated over the cluster, thus causing potential significant performance degradation, in your case resulting in a stalling cluster. Additionally, mysqldump will lock tables, which can cause all kinds of issues when the node is actively used by your application. Check your process list for table names when this happens, perhaps you see a pattern there.As soon as I turn wsrep_on=ON into wsrep_on=OFF and restart the server the optimize and dump job run fine.
The issue appears to be caused by flow control. I've resolved it by tuning the flow control settings.
I did this by adding the following wsrep_provider_options: gcs.fc_limit=500; gcs.fc_master_slave=YES; gcs.fc_factor=1.0
Xtrabackup doesn't support "wsrep_gtid_mode=1" according to the MariaDB blog regarding Galera SST modes.Why not use xtrabackup for state transfers? Rsync and mysqldump are slow and blocking.I'll try with the mysqldump as SST now and I'll edit my post once I have the results.
Re: MariaDB galera cluster fail during dump or optimize
I've noticed that the cluster still go's down after completing the backup with the new flow control settings..