Some Unicode Notes and Stuff

Introduction

Unicode characters are integers called code points. A code point value is in the range 0 to 0x10FFFF (1,114,111 values). A character (integer) is encoded as multiple bytes allowing storage on byte devices such as disk or memory. UTF-8 is the most popular encoding and is used by Python 3. (Note: There are other encodings such as UTF-16.)

Examples of Unicode Characters

Python string html unicode --- ord value -- special escape hex dec chr character character description ------- ------- --- --------- ---------- --------------------- 0x61 97 'a' a \u0061 LATIN SMALL LETTER A 0x62 98 'b' b \u0062 LATIN SMALL LETTER B 0x63 99 'c' c \u0063 LATIN SMALL LETTER C ... 0x7b 123 '{' { \u007b LEFT CURLY BRACKET ... 0x2167 8551 'Ⅷ' Ⅷ \u2167 ROMAN NUMERAL EIGHT 0x2168 8552 'Ⅸ' Ⅸ \u2168 ROMAN NUMERAL NINE ... 0x265E 9822 '♞' ♞ \u265e BLACK CHESS KNIGHT ox265F 9823 '♟' ♟ \u265f BLACK CHESS PAWN ... 0x1F600 128512 '😀' 😀 \U0001f600 GRINNING FACE 0x1F609 128521 '😉' 😉 \U0001f609 WINKING FACE \u 16 bit unicode escape sequence \U 32 bit unicode escape sequence

Shown are UTF-8 code point formats and how many bits are available for code point values that define characters. 0_xxx_xxxx 7 bits 110x_xxxx 10xx_xxxx 11 bits 1110_xxxx 10xx_xxxx 10xx_xxxx 16 bits 1111_0xxx 10xx_xxxx 10xx_xxxx 10xx_xxxx 21 bits

Notes

  1. Unicode is a character set. UTF-8 is an encoding.

  2. In Python 3 all characters are Unicode (UTF-8 encoded) code points

  3. The ord(x) function returns an integer representing the Unicode code point of the character x

  4. UTF-8 encode/decode examples:

      encode

    • 'a'.encode('utf-8') = b'a'
      bin(int.from_bytes(b'a','big')) = 01100001

    • 'Ω'.encode('utf-8') = b'\xce\xa9'
      bin(int.from_bytes(b'\xce\xa9','big')) = 11001110 10101001

    • '♞'.encode('utf-8') = b'\xe2\x99\x9e'
      bin(int.from_bytes(b'\xe2\x99\x9e','big')) = 11100010 10011001 10001110

    • '😀'.encode('utf-8') = b'\xf0\x9f\x98\x80'
      bin(int.from_bytes(b'\xf0\x9f\x98\x80','big')) = 11110000 10011111 10011000 10000000
    • decode

    • '😀'.encode('utf-8') = b'\xf0\x9f\x98\x80'
      b'\xf0\x9f\x98\x80'.decode('utf-8') = '😀'

Links

Python 3 Unicode HOWTO

Unicode Home

HTML Unicode (UTF-8) Reference

Why Nobody Knows What This One Unicode Character Means (YouTube)

Characters, Bytes, and Bits

#!/usr/bin/python3 # ==================================================================== # demonstrate the number of bytes and bits in characters # ==================================================================== # -------------------------------------------------------------------- # ---- string length in bytes # -------------------------------------------------------------------- def utf8len(s:str) -> int: return len(s.encode('utf-8')) # -------------------------------------------------------------------- # ---- convert each byte in a string into a string of bits # -------------------------------------------------------------------- def bit_string(s:str) -> str: # ----convert string to a list of bytes byts = s.encode('utf-8') # ---- convert bytes to a list of bit strings bin_strs = [] for byt in byts: bin_strs.append(f'{byt:08b}') # ---- combine bit strings into a single string return ' '.join(bin_strs) # -------------------------------------------------------------------- # ---- display a string's bytes and bits # -------------------------------------------------------------------- def display_a_string_bytes_and_bits(s:str) -> None: print() print(f'str="{s}" len={len(s)} (char) sizeof={utf8len(s)} (bytes)') print() print(f'bit string is {bit_string(s)}') # -------------------------------------------------------------------- # ---- main # -------------------------------------------------------------------- print() print('---------- single character ASCII') display_a_string_bytes_and_bits('A') print() print('---------- single character UTF-8') display_a_string_bytes_and_bits('\u16A0') print() print('---------- multiple characters') display_a_string_bytes_and_bits('A\u16A0B')