Hex Dump UTF-8

Introduction

Unicode characters are integers called code points. A code point value is in the range 0 to 0x10FFFF (1,114,111 values). A character (integer) is encoded as multiple bytes allowing storage on byte devices such as disk or memory. UTF-8 is the most popular encoding and is used by Python 3. (Note: There are other encodings such as UTF-16.)

Examples of Unicode Characters

html Python String --- ord value -- special unicode hex dec chr character character description ------- ------- --- --------- ---------- --------------------- 0x61 97 'a' a \u0061 LATIN SMALL LETTER A 0x62 98 'b' b \u0062 LATIN SMALL LETTER B 0x63 99 'c' c \u0063 LATIN SMALL LETTER C ... 0x7b 123 '{' { \u007b LEFT CURLY BRACKET ... 0x2167 8551 'Ⅷ' Ⅷ \u2167 ROMAN NUMERAL EIGHT 0x2168 8552 'Ⅸ' Ⅸ \u2168 ROMAN NUMERAL NINE ... 0x265E 9822 '♞' ♞ \u265e BLACK CHESS KNIGHT ox265F 9823 '♟' ♟ \u265f BLACK CHESS PAWN ... 0x1F600 128512 '😀' 😀 \U0001f600 GRINNING FACE 0x1F609 128521 '😉' 😉 \U0001f609 WINKING FACE \u 16 bit unicode escape sequence \U 32 bit unicode escape sequence

Shown are UTF-8 code point formats and how many bits are available for code point values that define characters. 0_xxx_xxxx 7 bits 110x_xxxx 10xx_xxxx 11 bits 1110_xxxx 10xx_xxxx 10xx_xxxx 16 bits 1111_0xxx 10xx_xxxx 10xx_xxxx 10xx_xxxx 21 bits

Notes

Unicode is a character set. UTF-8 is an encoding.

In Python 3 all characters are Unicode (UTF-8 encoded) code points

The ord(x) function returns an integer representing the Unicode code point of the character x

UTF-8 encode/decode examples:
- 'a'.encode('utf-8') = b'a'
  bin(int.from_bytes(b'a','big')) = 01100001
- 'Ω'.encode('utf-8') = b'\xce\xa9'
  bin(int.from_bytes(b'\xce\xa9','big')) = 11001110 10101001
- '♞'.encode('utf-8') = b'\xe2\x99\x9e'
  bin(int.from_bytes(b'\xe2\x99\x9e','big')) = 11100010 10011001 10001110
- '😀'.encode('utf-8') = b'\xf0\x9f\x98\x80'
  bin(int.from_bytes(b'\xf0\x9f\x98\x80','big')) = 11110000 10011111 10011000 10000000
- '😉'.encode('utf-8') = b'\xf0\x9f\x98\x89'
  b'\xf0\x9f\x98\x89'.decode('utf-8') = '😉'

Links

Python 3 Unicode HOWTO

Unicode Home

HTML Unicode (UTF-8) Reference