WordPress Multisite – MySQL Percona ExtraDB Big Problem
-
We are experiencing severe problems on WordPress Multisite installation backed by Multi-Master SeveralNines Percona XtraDB Cluster running on the specifications described below.
innodb_version: 5.6.15-rel63.0
protocol_version: 10
version: 5.6.15-56
Percona XtraDB Cluster (GPL), Release 25.5, Revision 759, wsrep_25.5.r4061
RedHat 6.5 x86_64 LinuxWe have 1 webserver (apache), 1 virtual IP backed by 2 HAProxy and Keepalived nodes and 3 DB nodes (master-master replication similar to this).
We wanted to use 3 DB nodes simultaneously for high traffic website hosting (427 domains in wp_blogs table, each node sends approximately 4GB per hour).
We ran into severe problems with something called transients. In a master-master replication we suddenly received errors where replication complained about and error called “HA_ERR_FOUND_DUPP_KEY”. Googled the error but could not found a viable solution. Wrote to Percona and codership with no luck in turn. Asked to severalnines and they couldnt find an answer either.
Take a Look at the logs:
2014-05-28 11:57:43 6523 [Warning] WSREP: RBR event 3 Write_rows apply warning: 121, 5156662 2014-05-28 11:57:43 6523 [Warning] WSREP: Failed to apply app buffer: seqno: 5156662, status: 1 2014-05-28 11:57:43 6523 [ERROR] Slave SQL: Could not execute Write_rows event on table venus_db.wp_31_options; Duplicate entry '_transient_timeout_56a977a6d ec898d54b8fdec77f87729b' for key 'option_name', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 250, Error_c ode: 1062 2014-05-28 11:57:43 6523 [Warning] WSREP: RBR event 3 Write_rows apply warning: 121, 5156662 2014-05-28 11:57:43 6523 [ERROR] WSREP: Failed to apply trx: source: 9819adf0-e260-11e3-924e-97c07e86bdc1 version: 3 local: 0 state: APPLYING flags: 1 conn_i d: 1060609 trx_id: 159160981 seqnos (l: 802435, g: 5156662, s: 5156661, d: 5156660, ts: 2492618038795915) 2014-05-28 11:57:43 6523 [ERROR] WSREP: Failed to apply trx 5156662 4 times 2014-05-28 11:57:43 6523 [ERROR] WSREP: Node consistency compromized, aborting... 2014-05-28 11:57:43 6523 [Note] WSREP: Closing send monitor... 2014-05-28 11:57:43 6523 [Note] WSREP: Closed send monitor. 2014-05-28 11:57:43 6523 [Note] WSREP: gcomm: terminating thread 2014-05-28 11:57:43 6523 [Note] WSREP: gcomm: joining thread 2014-05-28 11:57:43 6523 [Note] WSREP: gcomm: closing backend 2014-05-28 11:57:43 6523 [Note] WSREP: view(view_id(NON_PRIM,6558397e-e260-11e3-bb1b-2e2582d08b55,19) memb { 2014-05-28 11:57:43 6523 [Note] WSREP: view((empty)) 2014-05-28 11:57:43 6523 [Note] WSREP: gcomm: closed
The guilty tables have a CREATE TABLE like:
CREATE TABLE IF NOT EXISTS 'wp_options' ( 'option_id' bigint(20) unsigned NOT NULL AUTO_INCREMENT, 'option_name' varchar(64) NOT NULL DEFAULT '', 'option_value' longtext NOT NULL, 'autoload' varchar(20) NOT NULL DEFAULT 'yes', PRIMARY KEY ('option_id'), UNIQUE KEY 'option_name' ('option_name') ) ENGINE=InnoDB
What happens is two requests write the same transient option_name value to different nodes at the same time (consider high traffic). Since there are 427 domains we have 427 wp_*_options tables and it is inevitable to have HA_ERR_FOUND_DUPP_KEY error. There is no running away from this.
So replication stopped and since node consistency was compromized all nodes shut down one by one and game over. Had a database crash.
The culprit seems to be transients
2014-05-28 11:57:43 6523 [ERROR] Slave SQL: Could not execute Write_rows event on table venus_db.wp_31_options; Duplicate entry '_transient_timeout_56a977a6d ec898d54b8fdec77f87729b' for key 'option_name', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 250, Error_c
option_name is supposed to be UNIQUE and you are writing the same value to that column as transient_timeout_ which is followed by a string that shows how long transient is going to live. In addition to this there is no garbage collector mechanism. Just plugins. have a look at this for more insights about transient
This is just plain idiocity !!!
Searched some more and saw nextgen gallery using a lot of transients. Even they put slider code in transient. Why ? Dunno. For speed I guess but since there is memcache and other software there is no need for this type of stuff to put your application in a very strict setup.
Sailed for an adventure and wrote a script for deleting expired transients every minute with an bash script like
#!/bin/bash dbuser='root' dbpass='************' curdir='pwd' echo "SELECT CONCAT('SELECT option_id, option_name, option_value, \"\'',TABLE_SCHEMA,'\'.\'',TABLE_NAME,'\'\" AS \"tableschema\" FROM \'',TABLE_SCHEMA,'\'.\'',TABLE_NAME,'\ WHERE option_name LIKE \"%_transient_timeout%\" AND option_value < now();' ) AS "COMMAND" FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA NOT LIKE 'INFORMATION_SCHEMA' AND TABLE_NAME LIKE 'wp%options'" > /tmp/expired1.sql mysql -Ns -u$dbuser -p$dbpass < /tmp/expired1.sql > /tmp/expired2.sql mysql -Ns -u$dbuser -p$dbpass < /tmp/expired2.sql > /tmp/expired3.sql rm -f /tmp/delete_expired.sql while read s; do parsed_line=$(echo $s | sed "s/ /|/g") option_id=$(echo $parsed_line | awk -F '|' '{print $1}') option_name=$(echo $parsed_line | awk -F '|' '{print $2}') option_value=$(echo $parsed_line | awk -F '|' '{print $3}') schema_table=$(echo $parsed_line | awk -F '|' '{print $4}') delete_sql=$(echo DELETE FROM $schema_table WHERE option_id=$option_id AND option_name=\"$option_name\" AND option_value=\"$option_value\"";") echo $delete_sql >> /tmp/delete_expired.sql done < /tmp/expired3.sql mysql -Ns -u$dbuser -p$dbpass < /tmp/delete_expired.sql # cleanup rm -f /tmp/expired1.sql /tmp/expired2.sql /tmp/expired3.sql /tmp/delete_expired.sql
And put that in cron like
*/1 * * * * /root/scripts/transient/delete_expired.sh >/dev/null 2>&1
That helped the site to run faster but still there is no good solution for multi-master environments.
Found a server variable supposed to suppress HA_ERR_FOUND_DUPP_KEY error called
–slave_exec_mode=IDEMPOTENT
but that caused another error called BF-BF lock conflict.
So we ran out of luck and WordPress is really getting on my nerves.
Is there a way to turn this transient thing off forever ?
- The topic ‘WordPress Multisite – MySQL Percona ExtraDB Big Problem’ is closed to new replies.