Database Charset and Collation Fix in MySQL

By Rex Resurreccion Dec 12, 2019

Proper Character set and Collation is important

Storing user input into the Database could be a challenge, especially if it contains different types of characters. For instance, a Chinese and Russian alphabets, special symbols like an arrow, copyright and emoji, can all have different unicode encoding. This may cause a different output during rendering than the original format after saving it.

Improper Database Charset and collation can make user input unreadable or worst your system could crash due to a character encoding and decoding error.

Luckily, popular database management system supports different types of Collation and Character set.

What is Character set and Collation?

A character set is a set of symbols and encoding that contains one or more collation. A collation is a set of rules that define how data characters are sorted and compared in a database.

Choosing a character set to use

If you are to choose what Character set to use for configuring your database, the most common is UTF8. This unicode encoding supports a wide range of characters.

MySQL UTF8, UTF8MB3 and UTF8MB4

Fix Charset and Collation in MySQL

In MySQL you should use UTF8MB4, that can store a 4-Byte character and also support newly introduced symbols and emoji.

The original UTF8MB3 which is an alias for UTF8 can only support 3-Byte character, you should not use this anymore because it has already been deprecated and expected to be removed in a future release.If you are still using UTF8MB3 then you must update your database to use UTF8MB4 instead.

Changing the Character set of Table Columns in MySQL

Firstly, we will show how to update this in column level. Let’s get the Database Charset and Collation in the table’s columns to know which one needs an update. All the information are stored in “COLUMNS” table of “information_schema” database of MySQL.

Run the select query below by changing the condition on “TABLE_SCHEMA” column to be the name of your own database (change website_db to the name of your database).

SELECT `table_name`, `column_name`, `DATA_TYPE`, `IS_NULLABLE`,  `CHARACTER_SET_NAME`, `COLLATION_NAME` FROM information_schema.COLUMNS WHERE `TABLE_SCHEMA` = 'website_db' AND (DATA_TYPE = 'varchar' OR DATA_TYPE LIKE '%text');

The SELECT query will return all the columns in your tables and its current Database Charset and Collation.

Now we can use the ALTER command to update the columns to have UTF8MB4. The comments will be the actual table in your database and name is the column in that table.

ALTER TABLE `comments` MODIFY `name` varchar(255) CHARACTER SET UTF8MB4 NOT NULL;

We can update the tables one at a time, however, it would be cumbersome if you have a large database schema. Run this command to generate the ALTER command that you will need to update the entire columns in your table (change the comments to the name of your table) .

SELECT CONCAT('ALTER TABLE `', table_name, '` MODIFY `', column_name, '` ', DATA_TYPE, '(', CHARACTER_MAXIMUM_LENGTH, ') CHARACTER SET UTF8MB4', (CASE WHEN IS_NULLABLE = 'NO' THEN ' NOT NULL' ELSE '' END), ';') FROM information_schema.COLUMNS WHERE TABLE_SCHEMA = 'website_db' AND table_name = 'comments' AND (DATA_TYPE = 'varchar' OR DATA_TYPE LIKE '%text') AND (COLLATION_NAME <> 'utf8mb4_general_ci');

You will get a result similar to this…

| ALTER TABLE `comments` MODIFY `name` varchar(255) CHARACTER SET UTF8MB4 NOT NULL;|
| ALTER TABLE `comments` MODIFY `message` text(65535) CHARACTER SET UTF8MB4 NOT NULL; 

Changing the Character set of Table in MySQL

Secondly, we will update the default Character set and Collation in the table level. Get the current Character set and Collation of the database tables stored in “TABLES” and “COLLATION_CHARACTER_SET_APPLICABILITY” tables of “information_schema” database.

Run the query below by changing website_db to whatever name of your database.

SELECT table_name, C.character_set_name, C.collation_name FROM information_schema.TABLES AS T, information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` AS C WHERE C.collation_name = T.table_collation AND T.table_schema = 'website_db';

The SELECT query will return all the tables and its current Character set and Collation.

Update your table using the ALTER command like this…

ALTER TABLE comments CHARACTER SET UTF8MB4;

To generate the ALTER command for all your tables run this query (change the website_db and comments to be your actual database and table names).

SELECT CONCAT('ALTER TABLE ',  table_name, ' CHARACTER SET utf8 COLLATE utf8_bin;') FROM information_schema.TABLES AS T, information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` AS C
WHERE C.collation_name = T.table_collation AND T.table_schema = 'website_db' AND ( C.CHARACTER_SET_NAME <> 'utf8mb4' OR C.COLLATION_NAME <> 'utf8mb4_general_ci');

When you add columns to the table it will inherit the default Character set and Collation.

Changing the Character set of Database in MySQL

Finally, Applying this in database level. The command is almost identical to what we used for updating the table, we just have to change the key word “TABLE” to “DATABASE”.

ALTER DATABASE `website_db` CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

Likewise, we can create a dump of all the tables, recreate the database and insert all the tables back. The command in Linux for dumping the database table is “mysqldump”

mysqldump website_db > website_db.sq

After creating a dump, login again to MySQL and drop the existing database.

DROP DATABASE website_db;

And then we create our database with the default Character set and Collation. If you are about to start a fresh database then this is the only command you need.

CREATE DATABASE `website_db` CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci; 

Exit from MySQL. To insert back all your tables, run this command.

mysql website_db < website_db.sql

When you add a table it will now inherit the default Character set and Collation from the database.

Apply Character set and Collation during Connection

It is awesome that we now know how to apply all these Database changes. Although it is also a good idea to make this as a default setting already when your application connects to the database.

The command to do that is this…

SET default_storage_engine=INNODB, character_set_connection=utf8, collation_connection=utf8_bin;

If you have a Django application for example, the setting will look similar to this…

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'wordpress_db',
        'USER': 'wordpress',
        'PASSWORD': '[email protected]',
        'HOST': 'localhost',
        'PORT': '3306',
        'OPTIONS': {
            'init_command': 'SET default_storage_engine=INNODB, character_set_connection=utf8mb4, collation_connection=utf8mb4_bin',
        },
    }
}

Conclusion

To sum up, the configuration we discussed may not apply to all. That is to say, it is important to know the geographic location of the users and how they interact with your system. But in general this is a good start, using UTF8MB4 that can store up to 4-Byte characters should cover a wide range of characters and symbols.

Checkout Example Commands for MySQL

Leave a Reply

Your email address will not be published. Required fields are marked *

© YippeeCode.com 2020