• Resolved Cihan Ispanoglu

    (@cihan-ispanoglu)


    We are experiencing severe problems on WordPress Multisite installation backed by Multi-Master SeveralNines Percona XtraDB Cluster running on the specifications described below.

    innodb_version: 5.6.15-rel63.0
    protocol_version: 10
    version: 5.6.15-56
    Percona XtraDB Cluster (GPL), Release 25.5, Revision 759, wsrep_25.5.r4061
    RedHat 6.5 x86_64 Linux

    We have 1 webserver (apache), 1 virtual IP backed by 2 HAProxy and Keepalived nodes and 3 DB nodes (master-master replication similar to this).

    We wanted to use 3 DB nodes simultaneously for high traffic website hosting (427 domains in wp_blogs table, each node sends approximately 4GB per hour).

    We ran into severe problems with something called transients. In a master-master replication we suddenly received errors where replication complained about and error called “HA_ERR_FOUND_DUPP_KEY”. Googled the error but could not found a viable solution. Wrote to Percona and codership with no luck in turn. Asked to severalnines and they couldnt find an answer either.

    Take a Look at the logs:

    2014-05-28 11:57:43 6523 [Warning] WSREP: RBR event 3 Write_rows apply warning: 121, 5156662
    2014-05-28 11:57:43 6523 [Warning] WSREP: Failed to apply app buffer: seqno: 5156662, status: 1
    2014-05-28 11:57:43 6523 [ERROR] Slave SQL: Could not execute Write_rows event on table venus_db.wp_31_options; Duplicate entry '_transient_timeout_56a977a6d
    ec898d54b8fdec77f87729b' for key 'option_name', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 250, Error_c
    ode: 1062
    2014-05-28 11:57:43 6523 [Warning] WSREP: RBR event 3 Write_rows apply warning: 121, 5156662
    2014-05-28 11:57:43 6523 [ERROR] WSREP: Failed to apply trx: source: 9819adf0-e260-11e3-924e-97c07e86bdc1 version: 3 local: 0 state: APPLYING flags: 1 conn_i
    d: 1060609 trx_id: 159160981 seqnos (l: 802435, g: 5156662, s: 5156661, d: 5156660, ts: 2492618038795915)
    2014-05-28 11:57:43 6523 [ERROR] WSREP: Failed to apply trx 5156662 4 times
    2014-05-28 11:57:43 6523 [ERROR] WSREP: Node consistency compromized, aborting...
    2014-05-28 11:57:43 6523 [Note] WSREP: Closing send monitor...
    2014-05-28 11:57:43 6523 [Note] WSREP: Closed send monitor.
    2014-05-28 11:57:43 6523 [Note] WSREP: gcomm: terminating thread
    2014-05-28 11:57:43 6523 [Note] WSREP: gcomm: joining thread
    2014-05-28 11:57:43 6523 [Note] WSREP: gcomm: closing backend
    2014-05-28 11:57:43 6523 [Note] WSREP: view(view_id(NON_PRIM,6558397e-e260-11e3-bb1b-2e2582d08b55,19) memb {
    2014-05-28 11:57:43 6523 [Note] WSREP: view((empty))
    2014-05-28 11:57:43 6523 [Note] WSREP: gcomm: closed

    The guilty tables have a CREATE TABLE like:

    CREATE TABLE IF NOT EXISTS 'wp_options' (
      'option_id' bigint(20) unsigned NOT NULL AUTO_INCREMENT,
      'option_name' varchar(64) NOT NULL DEFAULT '',
      'option_value' longtext NOT NULL,
      'autoload' varchar(20) NOT NULL DEFAULT 'yes',
      PRIMARY KEY ('option_id'),
      UNIQUE KEY 'option_name' ('option_name')
    ) ENGINE=InnoDB

    What happens is two requests write the same transient option_name value to different nodes at the same time (consider high traffic). Since there are 427 domains we have 427 wp_*_options tables and it is inevitable to have HA_ERR_FOUND_DUPP_KEY error. There is no running away from this.

    So replication stopped and since node consistency was compromized all nodes shut down one by one and game over. Had a database crash.

    The culprit seems to be transients

    2014-05-28 11:57:43 6523 [ERROR] Slave SQL: Could not execute Write_rows event on table venus_db.wp_31_options; Duplicate entry '_transient_timeout_56a977a6d
    ec898d54b8fdec77f87729b' for key 'option_name', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 250, Error_c

    option_name is supposed to be UNIQUE and you are writing the same value to that column as transient_timeout_ which is followed by a string that shows how long transient is going to live. In addition to this there is no garbage collector mechanism. Just plugins. have a look at this for more insights about transient

    This is just plain idiocity !!!

    Searched some more and saw nextgen gallery using a lot of transients. Even they put slider code in transient. Why ? Dunno. For speed I guess but since there is memcache and other software there is no need for this type of stuff to put your application in a very strict setup.

    Sailed for an adventure and wrote a script for deleting expired transients every minute with an bash script like

    #!/bin/bash
    dbuser='root'
    dbpass='************'
    curdir='pwd'
    
    echo "SELECT CONCAT('SELECT option_id, option_name, option_value, \"\'',TABLE_SCHEMA,'\'.\'',TABLE_NAME,'\'\" AS \"tableschema\" FROM \'',TABLE_SCHEMA,'\'.\'',TABLE_NAME,'\ WHERE option_name LIKE \"%_transient_timeout%\" AND option_value < now();' ) AS "COMMAND" FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA NOT LIKE 'INFORMATION_SCHEMA' AND TABLE_NAME LIKE 'wp%options'" > /tmp/expired1.sql
    
    mysql -Ns -u$dbuser -p$dbpass < /tmp/expired1.sql > /tmp/expired2.sql
    mysql -Ns -u$dbuser -p$dbpass < /tmp/expired2.sql > /tmp/expired3.sql
    
    rm -f /tmp/delete_expired.sql
    
    while read s; do
     parsed_line=$(echo $s | sed "s/ /|/g")
     option_id=$(echo $parsed_line | awk -F '|' '{print $1}')
     option_name=$(echo $parsed_line | awk -F '|' '{print $2}')
     option_value=$(echo $parsed_line | awk -F '|' '{print $3}')
     schema_table=$(echo $parsed_line | awk -F '|' '{print $4}')
     delete_sql=$(echo DELETE FROM $schema_table WHERE option_id=$option_id AND option_name=\"$option_name\" AND option_value=\"$option_value\"";")
     echo $delete_sql >> /tmp/delete_expired.sql
    done < /tmp/expired3.sql
    
    mysql -Ns -u$dbuser -p$dbpass < /tmp/delete_expired.sql
    
    # cleanup
    rm -f /tmp/expired1.sql /tmp/expired2.sql /tmp/expired3.sql /tmp/delete_expired.sql

    And put that in cron like
    */1 * * * * /root/scripts/transient/delete_expired.sh >/dev/null 2>&1

    That helped the site to run faster but still there is no good solution for multi-master environments.

    Found a server variable supposed to suppress HA_ERR_FOUND_DUPP_KEY error called

    –slave_exec_mode=IDEMPOTENT

    but that caused another error called BF-BF lock conflict.

    So we ran out of luck and WordPress is really getting on my nerves.
    Is there a way to turn this transient thing off forever ?

Viewing 2 replies - 1 through 2 (of 2 total)
Viewing 2 replies - 1 through 2 (of 2 total)
  • The topic ‘WordPress Multisite – MySQL Percona ExtraDB Big Problem’ is closed to new replies.