Collation and character set are popular in SQL but not all developers mention to them. They can impact to querying data and performance. In this article, I will show you how to use them and how they impact to your database.
As you can see, we define the collation and character set for the database, table, and column. I'm sure developers define this for thousands of times. But do you know what they are? And how they impact to your database? Let's come up with theory first.
Character defines how characters are represented as binary data and is used to interpret and store textual information in a database. Common character in MYSQL: utf8mb4(UTF-8 Unicode), latin1, ascii, etc.... We can show all character sets by running the following query.
SHOW CHARACTER SET;
Collation refers to the set of rules that govern how string comparison and sorting operations are performed. If collation is not specified, the default collation for the database is used.
SELECT * FROM information_schema.COLLATIONS;
Specially, collations come in different types, such as case-sensitive (CS), case-insensitive (CI), accent-sensitive (AS), and accent-insensitive (AI).
Look at this, utf8mb4_vi_0900_ai_ci (accent-insensitive and case-insensitive). In here, utf8mb4 is character set, vi is language, 0900 is version, ai is accent-insensitive, ci is case-insensitive.
The results return 2 rows instead of 1 row. This collation makes wrong results when querying data.
Main point is when collation is accent-insensitive, it will ignore accents. For example: "a" and "รก" are the same. When collation is case-insensitive, it will ignore case. For example: "steve" and "Steve" are the same.
To fix that, we need to use utf8mb4_vi_0900_as_cs