MySQL has two Unicode character sets. You can store text in about 650 languages using these character sets.
ucs2 (UCS-2 Unicode) collations:
ucs2_bin
ucs2_czech_ci
ucs2_danish_ci
ucs2_estonian_ci
ucs2_general_ci (default)
ucs2_icelandic_ci
ucs2_latvian_ci
ucs2_lithuanian_ci
ucs2_persian_ci
ucs2_polish_ci
ucs2_roman_ci
ucs2_romanian_ci
ucs2_slovak_ci
ucs2_slovenian_ci
ucs2_spanish2_ci
ucs2_spanish_ci
ucs2_swedish_ci
ucs2_turkish_ci
ucs2_unicode_ci
utf8 (UTF-8 Unicode) collations:
utf8_bin
utf8_czech_ci
utf8_danish_ci
utf8_estonian_ci
utf8_general_ci (default)
utf8_icelandic_ci
utf8_latvian_ci
utf8_lithuanian_ci
utf8_persian_ci
utf8_polish_ci
utf8_roman_ci
utf8_romanian_ci
utf8_slovak_ci
utf8_slovenian_ci
utf8_spanish2_ci
utf8_spanish_ci
utf8_swedish_ci
utf8_turkish_ci
utf8_unicode_ci
The MySQL implementation of UCS-2 stores characters in big-endian byte order and does not use a byte order mark (BOM) at the beginning of UCS-2 values. Other database systems might use little-ending byte order or a BOM, in which case conversion of UCS-2 values will need to be performed when transferring data between those systems and MySQL.
Note that in the ucs2_roman_ci and
utf8_roman_ci collations,
I and J are given the same
precedence, and U and V
are given the same precedence.
There is a limitation in MySQL 4.1 that results in two
characters not being correctly handled when a user tries to
change their case using LOWER() or
UPPER():
LATIN SMALL LETTER DOTLESS i
LATIN CAPITAL LETTER I WITH DOT ABOVE
Here are two workarounds for MySQL 4.1:
Use UCS2 if you have Turkish data.
Use these function calls:
CONVERT(LOWER(CONVERT(col USING ucs2)) USING utf8)
MySQL implements the utf8_unicode_ci
collation according to the Unicode Collation Algorithm (UCA)
described at
http://www.unicode.org/reports/tr10/. The
collation uses the version-4.0.0 UCA weight keys:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
The following discussion uses
utf8_unicode_ci, but it is also true for
ucs2_unicode_ci.
Currently, the utf8_unicode_ci collation has
only partial support for the Unicode Collation Algorithm. Some
characters are not supported yet. Also, combining marks are not
fully supported. This affects primarily Vietnamese and some
minority languages in Russia such as Udmurt, Tatar, Bashkir, and
Mari.
The most significant feature in
utf8_unicode_ci is that it supports
expansions; that is, when one character compares as equal to
combinations of other characters. For example, in German and
some other languages ‘ß’ is
equal to ‘ss’.
utf8_general_ci is a legacy collation that
does not support expansions. It can make only one-to-one
comparisons between characters. This means that comparisons for
the utf8_general_ci collation are faster, but
slightly less correct, than comparisons for
utf8_unicode_ci.
For example, the following equalities hold in both
utf8_general_ci and
utf8_unicode_ci:
Ä = A Ö = O Ü = U
A difference between the collations is that this is true for
utf8_general_ci:
ß = s
Whereas this is true for utf8_unicode_ci:
ß = ss
MySQL implements language-specific collations for the
utf8 character set only if the ordering with
utf8_unicode_ci does not work well for a
language. For example, utf8_unicode_ci works
fine for German and French, so there is no need to create
special utf8 collations for these two
languages.
utf8_general_ci also is satisfactory for both
German and French, except that
‘ß’ is equal to
‘s’, and not to
‘ss’. If this is acceptable for
your application, then you should use
utf8_general_ci because it is faster.
Otherwise, use utf8_unicode_ci because it is
more accurate.
utf8_swedish_ci, like other
utf8 language-specific collations, is derived
from utf8_unicode_ci with additional language
rules. For example, in Swedish, the following relationship
holds, which is not something expected by a German or French
speaker:
Ü = Y < Ö
The utf8_spanish_ci and
utf8_spanish2_ci collations correspond to
modern Spanish and traditional Spanish, respectively. In both
collations, ‘ñ’ (n-tilde) is a
separate letter between ‘n’ and
‘o’. In addition, for traditional
Spanish, ‘ch’ is a separate
letter between ‘c’ and
‘d’, and
‘ll’ is a separate letter between
‘l’ and
‘m’

User Comments
Note for Hebrew speakers: in utf8_general_ci, dots (Niqqud symbols) are treated as seperate characters, so
a. If you have dotted words in your table, they won't be ordered correctly.
b. You can have words with the same letters and different dots in a unique index column.
On the other hand, in utf8_unicode_ci, dots are igonred, so:
a. The order will be correct;
b. Words with the same letters and different dots will be regarded as equal, and you won't be able to have them in a unique index column.
I still didn't find a collation that treats both issues correctly.
Add your own comment.