Introduction
Unicode characters are integers called code points. A code point value is
in the range 0 to 0x10FFFF (1,114,111 values).
A character (integer) is encoded as multiple bytes allowing storage on byte
devices such as disk or memory. UTF-8 is the most popular encoding
and is used by Python 3. (Note: There are other encodings such as UTF-16.)
Examples of Unicode Characters
html Python String
--- ord value -- special unicode
hex dec chr character character description
------- ------- --- --------- ---------- ---------------------
0x61 97 'a' a \u0061 LATIN SMALL LETTER A
0x62 98 'b' b \u0062 LATIN SMALL LETTER B
0x63 99 'c' c \u0063 LATIN SMALL LETTER C
...
0x7b 123 '{' { \u007b LEFT CURLY BRACKET
...
0x2167 8551 'Ⅷ' Ⅷ \u2167 ROMAN NUMERAL EIGHT
0x2168 8552 'Ⅸ' Ⅸ \u2168 ROMAN NUMERAL NINE
...
0x265E 9822 '♞' ♞ \u265e BLACK CHESS KNIGHT
ox265F 9823 '♟' ♟ \u265f BLACK CHESS PAWN
...
0x1F600 128512 '😀' 😀 \U0001f600 GRINNING FACE
0x1F609 128521 '😉' 😉 \U0001f609 WINKING FACE
\u 16 bit unicode escape sequence
\U 32 bit unicode escape sequence
Shown are UTF-8 code point formats and how many bits are
available for code point values that define characters.
0_xxx_xxxx 7 bits
110x_xxxx 10xx_xxxx 11 bits
1110_xxxx 10xx_xxxx 10xx_xxxx 16 bits
1111_0xxx 10xx_xxxx 10xx_xxxx 10xx_xxxx 21 bits
Notes
- Unicode is a character set. UTF-8 is an encoding.
- In Python 3 all characters are Unicode (UTF-8 encoded) code points
- The ord(x) function returns an integer representing the
Unicode code point of the character x
- UTF-8 encode/decode examples:
encode
- 'a'.encode('utf-8') = b'a'
bin(int.from_bytes(b'a','big')) = 01100001
- 'Ω'.encode('utf-8') = b'\xce\xa9'
bin(int.from_bytes(b'\xce\xa9','big')) = 11001110 10101001
- '♞'.encode('utf-8') = b'\xe2\x99\x9e'
bin(int.from_bytes(b'\xe2\x99\x9e','big')) = 11100010 10011001 10001110
- '😀'.encode('utf-8') = b'\xf0\x9f\x98\x80'
bin(int.from_bytes(b'\xf0\x9f\x98\x80','big')) =
11110000 10011111 10011000 10000000
decode
- '😉'.encode('utf-8') = b'\xf0\x9f\x98\x89'
b'\xf0\x9f\x98\x89'.decode('utf-8') = '😉'
Links
Python 3 Unicode HOWTO
Unicode Home
HTML Unicode (UTF-8) Reference