Introduction
Unicode characters are integers called code points. A code point value is
in the range 0 to 0x10FFFF (1,114,111 values).
A character (integer) is encoded as multiple bytes allowing storage on byte
devices such as disk or memory. UTF-8 is the most popular encoding
and is used by Python 3. (Note: There are other encodings such as UTF-16.)
Examples of Unicode Characters
Python string
html unicode
--- ord value -- special escape
hex dec chr character character description
------- ------- --- --------- ---------- ---------------------
0x61 97 'a' a \u0061 LATIN SMALL LETTER A
0x62 98 'b' b \u0062 LATIN SMALL LETTER B
0x63 99 'c' c \u0063 LATIN SMALL LETTER C
...
0x7b 123 '{' { \u007b LEFT CURLY BRACKET
...
0x2167 8551 'Ⅷ' Ⅷ \u2167 ROMAN NUMERAL EIGHT
0x2168 8552 'Ⅸ' Ⅸ \u2168 ROMAN NUMERAL NINE
...
0x265E 9822 '♞' ♞ \u265e BLACK CHESS KNIGHT
ox265F 9823 '♟' ♟ \u265f BLACK CHESS PAWN
...
0x1F600 128512 '😀' 😀 \U0001f600 GRINNING FACE
0x1F609 128521 '😉' 😉 \U0001f609 WINKING FACE
\u 16 bit unicode escape sequence
\U 32 bit unicode escape sequence
Shown are UTF-8 code point formats and how many bits are
available for code point values that define characters.
0_xxx_xxxx 7 bits
110x_xxxx 10xx_xxxx 11 bits
1110_xxxx 10xx_xxxx 10xx_xxxx 16 bits
1111_0xxx 10xx_xxxx 10xx_xxxx 10xx_xxxx 21 bits
Notes
- Unicode is a character set. UTF-8 is an encoding.
- In Python 3 all characters are Unicode (UTF-8 encoded) code points
- The ord(x) function returns an integer representing the
Unicode code point of the character x
- UTF-8 encode/decode examples:
encode
- 'a'.encode('utf-8') = b'a'
bin(int.from_bytes(b'a','big')) = 01100001
- 'Ω'.encode('utf-8') = b'\xce\xa9'
bin(int.from_bytes(b'\xce\xa9','big')) = 11001110 10101001
- '♞'.encode('utf-8') = b'\xe2\x99\x9e'
bin(int.from_bytes(b'\xe2\x99\x9e','big')) = 11100010 10011001 10001110
- '😀'.encode('utf-8') = b'\xf0\x9f\x98\x80'
bin(int.from_bytes(b'\xf0\x9f\x98\x80','big')) =
11110000 10011111 10011000 10000000
decode
- '😀'.encode('utf-8') = b'\xf0\x9f\x98\x80'
b'\xf0\x9f\x98\x80'.decode('utf-8') = '😀'
Links
Python 3 Unicode HOWTO
Unicode Home
HTML Unicode (UTF-8) Reference
Why Nobody Knows What This One Unicode Character Means
(YouTube)
Characters, Bytes, and Bits
#!/usr/bin/python3
# ====================================================================
# demonstrate the number of bytes and bits in characters
# ====================================================================
# --------------------------------------------------------------------
# ---- string length in bytes
# --------------------------------------------------------------------
def utf8len(s:str) -> int:
return len(s.encode('utf-8'))
# --------------------------------------------------------------------
# ---- convert each byte in a string into a string of bits
# --------------------------------------------------------------------
def bit_string(s:str) -> str:
# ----convert string to a list of bytes
byts = s.encode('utf-8')
# ---- convert bytes to a list of bit strings
bin_strs = []
for byt in byts:
bin_strs.append(f'{byt:08b}')
# ---- combine bit strings into a single string
return ' '.join(bin_strs)
# --------------------------------------------------------------------
# ---- display a string's bytes and bits
# --------------------------------------------------------------------
def display_a_string_bytes_and_bits(s:str) -> None:
print()
print(f'str="{s}" len={len(s)} (char) sizeof={utf8len(s)} (bytes)')
print()
print(f'bit string is {bit_string(s)}')
# --------------------------------------------------------------------
# ---- main
# --------------------------------------------------------------------
print()
print('---------- single character ASCII')
display_a_string_bytes_and_bits('A')
print()
print('---------- single character UTF-8')
display_a_string_bytes_and_bits('\u16A0')
print()
print('---------- multiple characters')
display_a_string_bytes_and_bits('A\u16A0B')