Some Unicode Notes and Stuff

Introduction

Unicode characters are integers called code points. A code point value is in the range 0 to 0x10FFFF (1,114,111 values). A character (integer) is encoded as multiple bytes allowing storage on byte devices such as disk or memory. UTF-8 is the most popular encoding and is used by Python 3. (Note: There are other encodings such as UTF-16.)

Examples of Unicode Characters

html Python String --- ord value -- special unicode hex dec chr character character description ------- ------- --- --------- ---------- --------------------- 0x61 97 'a' a \u0061 LATIN SMALL LETTER A 0x62 98 'b' b \u0062 LATIN SMALL LETTER B 0x63 99 'c' c \u0063 LATIN SMALL LETTER C ... 0x7b 123 '{' { \u007b LEFT CURLY BRACKET ... 0x2167 8551 'Ⅷ' Ⅷ \u2167 ROMAN NUMERAL EIGHT 0x2168 8552 'Ⅸ' Ⅸ \u2168 ROMAN NUMERAL NINE ... 0x265E 9822 '♞' ♞ \u265e BLACK CHESS KNIGHT ox265F 9823 '♟' ♟ \u265f BLACK CHESS PAWN ... 0x1F600 128512 '😀' 😀 \U0001f600 GRINNING FACE 0x1F609 128521 '😉' 😉 \U0001f609 WINKING FACE \u 16 bit unicode escape sequence \U 32 bit unicode escape sequence

Shown are UTF-8 code point formats and how many bits are available for code point values that define characters. 0_xxx_xxxx 7 bits 110x_xxxx 10xx_xxxx 11 bits 1110_xxxx 10xx_xxxx 10xx_xxxx 16 bits 1111_0xxx 10xx_xxxx 10xx_xxxx 10xx_xxxx 21 bits

Notes

  1. Unicode is a character set. UTF-8 is an encoding.

  2. In Python 3 all characters are Unicode (UTF-8 encoded) code points

  3. The ord(x) function returns an integer representing the Unicode code point of the character x

  4. UTF-8 encode/decode examples:

      encode

    • 'a'.encode('utf-8') = b'a'
      bin(int.from_bytes(b'a','big')) = 01100001

    • 'Ω'.encode('utf-8') = b'\xce\xa9'
      bin(int.from_bytes(b'\xce\xa9','big')) = 11001110 10101001

    • '♞'.encode('utf-8') = b'\xe2\x99\x9e'
      bin(int.from_bytes(b'\xe2\x99\x9e','big')) = 11100010 10011001 10001110

    • '😀'.encode('utf-8') = b'\xf0\x9f\x98\x80'
      bin(int.from_bytes(b'\xf0\x9f\x98\x80','big')) = 11110000 10011111 10011000 10000000
    • decode

    • '😉'.encode('utf-8') = b'\xf0\x9f\x98\x89'
      b'\xf0\x9f\x98\x89'.decode('utf-8') = '😉'

Links

Python 3 Unicode HOWTO

Unicode Home

HTML Unicode (UTF-8) Reference