• Resolved Southwest

    (@southwest)


    Summary:

    I have a WordPress database that is filled with special characters encoded in different ways, and this is interfering with my attempts to convert it all to UTF-8.

    Technical Specs:

    WordPress version: 4.1.1

    Hosted at NearlyFreeSpeech

    Backstory:

    I have a WordPress blog, currently running WordPress 4.1.1. I’ve had the same blog for close to ten years. A while ago–unfortunately, I can’t remember when–I noticed that old posts were filled with weird character strings wherever I had a special character. For instance, I saw na?ˉve instead of na?ve. I corrected these glitches when I saw them, opening the post in the editor and replacing ?ˉ with ?.

    When I saved the post, the new special character was displaying instead of the original garbled mess. I was happy.

    I had some old posts that contained Japanese text, and the Japanese text was goofy here, too. Instead of 日本語, it would display ?—¥???èa?. Only, when I tried to correct that in the post editor, the Japanese characters I entered would be replaced entirely with question marks upon hitting “save” or “publish”.

    I didn’t replace all of those garbled strings, though; I still have some posts that show na?ˉve instead of na?ve (for example). This means that my blog currently contains:

    • mis-encoded special characters (na?ˉve)
    • “proper” special characters (na?ve)
    • mis-encoded Japanese (?—¥???èa?)
    • question marks where I attempted to correct the mis-encoded Japanese (???)

    What I want:

    I want to make changes to my current database so that:

    1. Japanese (and other non-Latin characters) can be saved and displayed
    2. There are no strings of mis-encoded special characters
    3. All of the characters I “corrected” in the past still display

    Steps taken:

    Since I’m having problems on a WordPress installation that I’ve used for nearly a decade, I identified the problem as one with character encoding. I recognize that old versions of WordPress, until v2.2, used the latin1 character set and the latin1_swedish_ci collation. When I look at my database in phpMyAdmin, I see that I have numerous tables that are still collated as latin1_swedish_ci.

    I’ve tried, in numerous ways, to convert my database to UTF-8. Here’s what I’ve done:

    1. Use the UTF-8 Database Converter plugin
    2. Follow this guide to export the database and replace all instances of “latin1” with “UTF8”
    3. As described in the Codex, use a SQL script to convert tables and columns to blob, and then to UTF-8 text./li>
    4. As described in the Codex, use a SQL script to recast tables and columns to latin1 text, then to blob, and then to UTF-8 text

    Each time I’ve done this, I’ve had the same result.

    Actual results:

    Upon creating the new database in mySQL, the following things happen:

    • “Mis-encoded” special characters (na?ˉve) and Japanese text (?—¥???èa?) convert into legible strings (na?ve and 日本語)
    • “Corrected” special characters cause data loss. An entry in the database that originally read, “Today, it is 37° and sunny” will lose all of its content from the first special character onward, becoming “Today, it is 37”. As far as I can tell, this affects all tables in the database, including wp_options.
    • Question marks where I attempted to correct the mis-encoded Japanese remain question marks.
    • When I create a new post containing special characters and Japanese, those characters are saved correctly to the database, and display without issue.

    Help?

    I know nothing about databases, but I really really want to believe there’s a way for me to correct my database so that Japanese saves and displays correctly… without also screwing up hundreds of entries that I enthusiastically “corrected” earlier, or configuration options.

    Any advice at all that can point me in the right direction would be greatly appreciated.

    Thank you.

Viewing 1 replies (of 1 total)
  • Thread Starter Southwest

    (@southwest)

    I’ve fixed this, and want to share what I did.

    The issue was one of mixed encoding. Some fields contained data encoded properly as UTF-8; others contained data encoded as something else, probably ISO-8859-1. When imported to a new UTF-8 database, this was causing truncation.

    My steps to solve this:

    1. Copy the original database, wordpress, to a new database, wordpress2. Make sure the collation of wordpress2 is set to UTF-8.
    2. Follow any one of the steps above to convert the tables and columns of wordpress2 to UTF-8. This will cause truncation.
    3. For each row on wordpress containing entries where the data when converted to UTF-8 was not the same as the data when converted to ASCII, update the corresponding row in wordpress2 with the wordpress data, converted to UTF-8. A sample script is below.

    The script:

    UPDATE wordpress3.wp_options wp3
    INNER JOIN wordpress.wp_options wp ON (wp.option_id = wp3.option_id
    AND convert(wp.option_value using utf8) != convert(wp.option_value using ascii))
    SET wp3.option_value = convert(wp.option_value using utf8);

    A more knowledgeable friend wrote a series of scripts for me that queried information_schema to locate all the columns containing entries where convert(value using utf8) != convert(value using ascii), and to then generate versions of the script above for them.

    Results:
    It worked! On my new database, I can save Japanese without it turning into question marks (because the character set has successfully been set to UTF-8), and all the mis-encoded fields that were causing data truncation have been fixed.

    There are some posts that still contain misencoded characters, but since I can find almost all of those strings by searching for a, ?, ?, or ?, I can just go in and manually replace them. I’m not troubled by that.

Viewing 1 replies (of 1 total)
  • The topic ‘Trouble converting database containing special characters to UTF-8’ is closed to new replies.