Database Charset and Collation Fix in MySQL

By Rex Resurreccion Dec 12, 2019
Database Charset and Collation Fix in MySQL

Proper Character set and Collation is important

Storing user input into the Database could become challenging, especially if the text contains different types of characters. For instance, a Chinese and Russian alphabets, special symbols like an arrow, copyright and an emoji. All of these can have different unicode encoding and different output during rendering than the original format while saving it.

Improper Database Charset and collation can make user input unreadable or worst your system could crash due to a character encoding and decoding error.

Luckily, popular database management system supports different types of Collation and Character set.

What is Character set and Collation?

A character set is a set of symbols and encoding that contains one or more collation. A collation is a set of rules that define how data characters are sorted and compared in a database.

Choosing a character set to use

If you are to choose what Character set to use for configuring your database, the most common is UTF8. This unicode encoding supports a wide range of characters.

MySQL UTF8, UTF8MB3 and UTF8MB4

fix character set and collation in mysql

In MySQL you should use UTF8MB4, that can store a 4-Byte character and also support newly introduced symbols and emoji.

The original UTF8MB3 which is an alias for UTF8 can only support 3-Byte character, you should not use this anymore because it has already been deprecated and expected to be removed in a future release.

Changing the Character set of Table Columns in MySQL

Firstly, I will show how to update this in column level. Let’s get the Database Charset and Collation in the table columns for us to know which one needs an update. All the information are stored in COLUMNS table of information_schema database.

Run the select query below by changing the condition on TABLE_SCHEMA to be the name of your own database (change website_db to the name of your database).

SELECT `table_name`, `column_name`, `DATA_TYPE`, `IS_NULLABLE`,  `CHARACTER_SET_NAME`, `COLLATION_NAME` FROM information_schema.COLUMNS WHERE `TABLE_SCHEMA` = 'website_db' AND (DATA_TYPE = 'varchar' OR DATA_TYPE LIKE '%text');

The SELECT query will return all the columns in your tables and its current Database Charset and Collation.

Now we can use the ALTER command to update the columns to have UTF8MB4. The comments will be the actual table in your database and name is the column in that table.

ALTER TABLE `comments` MODIFY `name` varchar(255) CHARACTER SET UTF8MB4 NOT NULL;

We can update the tables one at a time, however, it would be cumbersome if you have a large database schema. Run this query to generate the ALTER statement that you will need to update the entire columns in your table (change the comments to the name of your table) .

SELECT CONCAT('ALTER TABLE `', table_name, '` MODIFY `', column_name, '` ', DATA_TYPE, '(', CHARACTER_MAXIMUM_LENGTH, ') CHARACTER SET UTF8MB4', (CASE WHEN IS_NULLABLE = 'NO' THEN ' NOT NULL' ELSE '' END), ';') FROM information_schema.COLUMNS WHERE TABLE_SCHEMA = 'website_db' AND table_name = 'comments' AND (DATA_TYPE = 'varchar' OR DATA_TYPE LIKE '%text') AND (COLLATION_NAME <> 'utf8mb4_general_ci');

You will get a result similar to this…

| ALTER TABLE `comments` MODIFY `name` varchar(255) CHARACTER SET UTF8MB4 NOT NULL;|
| ALTER TABLE `comments` MODIFY `message` text(65535) CHARACTER SET UTF8MB4 NOT NULL; 

Changing the Character set of Table in MySQL

Secondly, we can update the default Character set and Collation in table level. Get the current Character set and Collation of the tables stored in TABLES and COLLATION_CHARACTER_SET_APPLICABILITY of information_schema database.

Run the query below by changing website_db to whatever name of your database.

SELECT table_name, C.character_set_name, C.collation_name FROM information_schema.TABLES AS T, information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` AS C WHERE C.collation_name = T.table_collation AND T.table_schema = 'website_db';

The SELECT query will return all the tables and its current Character set and Collation.

Update your table using the ALTER command like this…

ALTER TABLE comments CHARACTER SET UTF8MB4;

To generate the ALTER command for all your tables run this query (change the website_db and comments to be your actual database and table names).

SELECT CONCAT('ALTER TABLE ',  table_name, ' CHARACTER SET utf8 COLLATE utf8_bin;') FROM information_schema.TABLES AS T, information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` AS C 
WHERE C.collation_name = T.table_collation AND T.table_schema = 'website_db' AND ( C.CHARACTER_SET_NAME <> 'utf8mb4' OR C.COLLATION_NAME <> 'utf8mb4_general_ci');

When you add new table columns, it will now inherit the default Character set and Collation.

Changing the Character set of Database in MySQL

Finally, applying the setup in database level. The command is almost identical to what we used for updating a table, we just have to change the key word TABLE to DATABASE.

ALTER DATABASE `website_db` CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

Likewise, we can create a dump of all the tables, recreate the database and insert all the tables back. The command in Linux for dumping the database table is mysqldump.

mysqldump website_db > website_db.sq

After creating a dump, login again to MySQL and drop the existing database.

DROP DATABASE website_db;

And then we create our database using the default Character set and Collation. If you are about to start a fresh database then this is the only command you need.

CREATE DATABASE `website_db` CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci; 

Exit from MySQL. To insert back all your tables, run this command.

mysql website_db < website_db.sql

When you add a table it will now inherit the default Character set and Collation from the database.

Apply Character set and Collation during Connection

It is nice that we now know how to apply all these Database changes. Although it is also a good practice to make this as a default setting already when your application connects to the database.

The command to do that is simple.

SET default_storage_engine=INNODB, character_set_connection=utf8, collation_connection=utf8_bin;

If you have a Django application for example, the setting will look similar to this…

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'website_db',
        'USER': 'dev1',
        'PASSWORD': '15391@Sxz9',
        'HOST': 'localhost',
        'PORT': '3306',
        'OPTIONS': {
            'init_command': 'SET default_storage_engine=INNODB, character_set_connection=utf8mb4, collation_connection=utf8mb4_bin',
        },
    }
}

Conclusion

The configuration we discussed may not apply to all. That is to say, it is important to know the geographic location of the users and how they interact with your system. But in general this is a good start, using UTF8MB4 that can store up to 4-Byte characters should cover a wide range of characters and symbols.

Checkout Example Commands for MySQL

© YippeeCode.com 2020