UTF8, UTF16, and UTF32

Unicode defines a single huge character set, assigning one unique integer value to every graphical symbol (that is a major simplification, and isn't actually true. Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8, UTF-16 and UTF-32.


  • UTF8 is variable 1 to 4 bytes. 
UTF-8 is kind of "size optimized". It is best suited for Latin character based data (or ASCII), it takes only 1 byte per character but the size grows accordingly symbol variety (and in worst case could grow up to 6 bytes per character)

UTF8: Variable-width encoding, backwards compatible with ASCII. ASCII characters (U+0000 to U+007F) take 1 byte, code points U+0080 to U+07FF take 2 bytes, code points U+0800 to U+FFFF take 3 bytes, code points U+10000 to U+10FFFF take 4 bytes. Good for English text, not so good for Asian text.




  • UTF16 is variable 2 or 4 bytes.
UTF-16 is kind of "balance". It takes minimum 2 bytes per character which is enough for existing set of the mainstream languages with having fixed size on it to ease character handling (but size is still variable and can grow up to 4 bytes per character)

UTF-16: Variable-width encoding. Code points U+0000 to U+FFFF take 2 bytes, code points U+10000 to U+10FFFF take 4 bytes. Bad for English text, good for Asian text.

  • UTF32 is fixed 4 bytes.
UTF-32 is kind of "performance". It allows using of simple algorithms as result of fixed size characters (4 bytes) but with memory disadvantage

UTF32: Fixed-width encoding. All code points take 4 bytes. An enormous memory hog, but fast to operate on. Rarely used.

No comments :

Post a Comment