Character set and collation basics

from the Artful MySQL Tips List


There are many places where the character set needs to be carefully managed:
  • the source or target MySQL table
  • the connection
  • the client
  • the result set
  • the software
  • data sources
If you are using the mysql client program, you can prevent many charset problems by issuing, right off the top ...

SET NAMES 'charset_name';

... which is shorthand for ...

SET character_set_client = charset_name;
SET character_set_results = charset_name;
SET character_set_connection = charset_name;

For most purposes, charset_name should be utf8 or utf8mb4.

If you are using PHP, instead of issuing a SET NAMES query at the top, call mysqli_set_charset(), which also sets the charset for PHP functions like mysqli_escape_string().

Note that these practices do not prevent problems arising from charset mismatches, for example between table or column charset settings and the client querying the table, or between table or column and the data coming in.

For webapps, there's another potential problem: some browsers, when they detect a character set mismatch with what's been specified, do a silent conversion, so looking OK in a browser doesn't ensure that the character encoding is sound, so forms accepting user data should include ...

<form accept "charset_name">

... for example ...

<form accept "UTF-8">

Character set mismatches can go unnoticed. To diagnose a suspicious character, compare hex values under relevant charsets, and compare LENGTH() of the character against CHAR_LENGTH():

set names utf8;
set @utf8q = "’";
select 
  @utf8q, 
  hex(@utf8q) as hexOfQ,
  convert(@utf8q using utf8) as convToUtf8, 
  convert(@utf8q using latin1) as convToLatin1, 
  hex( convert(@utf8q using latin1) ) as HexOflatin1,
  length(@utf8q) as len,
  char_length(@utf8q) as charLen;
+--------+--------+------------+--------------+-------------+------+---------+
| @utf8q | hexOfQ | convToUtf8 | convToLatin1 | HexOflatin1 | len  | charLen |
+--------+--------+------------+--------------+-------------+------+---------+
| ’      | E28099 | ’          | ’            | 92          |    3 |       1 |
+--------+--------+------------+--------------+-------------+------+---------+

Inserting the character into a table introduces more possibilities for problems. We make a small test table to store our char as latin1 or utf8, either as is, or specified as latin1, or specified as utf8:

drop table if exists c;
create table c( 
  id smallint unsigned primary key auto_increment,
  slatin char(8) character set latin1, 
  sutf8  char(8) character set utf8, 
  comment varchar(64)
);
insert into c set slatin="’", sutf8="’", comment="as is";
insert into c set slatin=_latin1"’", sutf8=_utf8"’", comment="slatin forced latin1, sutf8 forced utf8";
insert into c set slatin=_utf8"’", sutf8=_latin1"’", comment="slatin forced utf8, sutf8 forced latin1";
insert into c set slatin=_latin1"’", sutf8=_latin1"’", comment="both forced latin1";
insert into c set slatin=_utf8"’", sutf8=_utf8"’", comment="both forced utf8";

select id, slatin,hex(slatin),sutf8,hex(sutf8), comment from c;
+----+----------+-------------+----------+------------------+-----------------------------------------+
| id | slatin   | hex(slatin) | sutf8    | hex(sutf8)       | comment                                 |
+----+----------+-------------+----------+------------------+-----------------------------------------+
|  1 | ’        | 92          | ’        | E28099           | as is                                   |
|  2 | â€™      | E28099      | ’        | E28099           | slatin forced latin1, sutf8 forced utf8 |
|  3 | ’        | 92          | â€™      | C3A2E282ACE284A2 | slatin forced utf8, sutf8 forced latin1 |
|  4 | â€™      | E28099      | â€™      | C3A2E282ACE284A2 | both forced latin1                      |
|  5 | ’        | 92          | ’        | E28099           | both forced utf8                        |
+----+----------+-------------+----------+------------------+-----------------------------------------+

Stored as is, or stored with a specification matching the target column, everything's fine. But when the charset of what goes into the table fails to match the column charset, as in rows 2,3 and 4, we see charset mangling.

Here's a more elaborate example inserting the same Greek text into latin1, latin7 (which is "Baltic Rim", which includes Greek), and utf8 columns of a table, under each of those three Set Names settings, then Selecting each of the nine results under those three Set Names settings, giving 27 readings. Blank cells indicate unreadable results:

             Inserting and Selecting Greek Text in MySQL
+--------------+------------+-----------------------------------------+ 
|                           |          Set Names for Insert           | 
+--------------+------------+------------+--------------+-------------+
|  Set Names   | Column     |            |              |             | 
|  for Select  | Charset    |  latin1    |   latin7     |   utf8      | 
+--------------+------------+------------+--------------+-------------+ 
|              | latin1     |  Correct   |              |             | 
| latin1       | latin7     |  Correct   |              |             | 
|              | utf8       |  Correct   |              |             | 
+--------------+------------+------------+--------------+-------------+ 
|              | latin1     |            |              |             | 
| latin7       | latin7     |            |   Correct    |             | 
|              | utf8       |            |              |             | 
+--------------+------------+------------+--------------+-------------+ 
|              | latin1     |            |              |  Correct    | 
| utf8         | latin7     |            |              |  Correct    | 
|              | utf8       |            |              |  Correct    | 
+--------------+------------+------------+--------------+-------------+ 

When the charset of incoming data matches the column charset, and the charset of the client doing the Select matches both, the result is correct. In 20 of the remaining 24 combinations, the result is mangled. Latin1 and utf8 are a little forgiving of each other, but all remaining mismatches produce mangling!

The moral of the story: take care to match incoming data with column specifications and current client, server and table settings.

Must-read character set and collation resources ...

Character set tutorial

Prevent charset problems

Collations

Analyse charset problems

Fix xharset problems.

Return to the Artful MySQL Tips page