Encodings and character sets

Thursday 22nd of November 2012 10:23:15 AM

  Toggle Advanced Options

Concepts and definitions

Unicode first and foremost defines a table of code points for characters. That's a fancy way of saying "65 stands for A and 66 stands for B and so forth". How these code points are actually encoded into bits is a different topic.

UTF-32 is such an encoding that encodes all Unicode code points using 32 bits. That is, four bytes per character. It's very simple, but often wastes a lot of space. UTF-16 and UTF-8 are variable-length encodings. If a character can be represented using a single byte (because its code point is a very small number), UTF-8 will encode it with a single byte. If it requires two bytes, it will use two bytes and so on. It has elaborate ways to use the highest bits in a byte to signal how many bytes a character consists of. This can save space, but may also waste space if these signal bits need to be used often. UTF-16 is in the middle, using at least two bytes, growing to up to four bytes as necessary.

So, Unicode is a large table mapping characters to numbers and the different UTF encodings specify how these numbers are encoded as bits. Overall, Unicode is yet another encoding scheme.

Code points

Characters are referred to by their "Unicode code point". Unicode code points are written in hexadecimal (to keep the numbers shorter), preceded by a "U+" (that's just what they do, it has no other meaning than "this is a Unicode code point").

Example encodings of the letter "A: LATIN CAPITAL LETTER A"

Character Code point Encoding Bits
A U+0041 UTF-8 01000001
A U+0041 UTF-16 00000000 01000001
A U+0041 UTF-32 00000000 00000000 00000000 01000001