Monday, February 6, 2012

UTF-8 encoding

On php UI, both fields (location, zone) support utf-8 characters. The user could type and save. What surprised me is that in mysql db, zone is defined as a varchar (latin1) column, and location is defined as a varchar (utf8). How can this work?

I set both zone and location to ‘测试’ and save. Below is the values stored in the DB. It is interesting that the zone field actually store the encoded utf8 string (each character takes 3 bytes) in a latin-1 column, whereas location field is a double encoded utf8 string which is unnecessary (it is because the connection setting is latin-1).

This is because the UI page uses utf-8 characters. It could be store blindly into a latin1 column.
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>

[Explanation for location]
The data is in utf8, the connection settings are in latin1. The table is in utf8. So, MySQL converts each of C3 and A9 (which it was told was latin1) to utf8, leading to 4 bytes in the table: C3 83 C2 A9. I call this double-encoding. The problem was caused by lying (usually accidentally) to MySQL about what encoding was in the application's data.

[Explanation for zone]
The data was loaded in an old system in the following manner. The application got utf8 bytes and blindly stored them in a latin1 column.

Reference: MySQL Charset/Collate