Introduction
Sometimes it is necessary to see the raw bytes that make up data.
A hex dump capability is very useful in these cases.
I have used hex dumps in the past. For example,
when examining Internet packet data.
To see the ASCII codes (Octal, Decimal, Hexadecimal)
click HERE
.
To see data and byte examples in Python
click HERE
.
Project #1
Note: Code this project yourself. Do not use any existing modules or packages.
Write a Python program to do a hex dump of The first N (100?) bytes of any file.
(Do this for a text file and an image file.)
The format should look something like...
file name: ./faces/abc.png
-offset- ----------------------hex---------------------- -----ascii------
00000000 89 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 .PNG........IHDR
00000016 00 00 04 38 00 00 02 7b 08 02 00 00 00 89 6e b6 ...8...{......n.
00000032 20 00 00 00 01 73 52 47 42 00 ae ce 1c e9 00 00 .....sRGB.......
00000048 00 04 67 41 4d 41 00 00 b1 8f 0b fc 61 05 00 00 ..gAMA......a...
Note: non-printable characters are replaced with '.'
in the ASCII section.
Another format might be
file name: ./faces/abc.png
-offset- ----------------------hex----------------------
00000000 89 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52
. P N G . . . . . . . . I H D R
00000016 00 00 04 38 00 00 02 7b 08 02 00 00 00 89 6e b6
. . . 8 . . . { . . . . . . n .
00000032 20 00 00 00 01 73 52 47 42 00 ae ce 1c e9 00 00
. . . . . s R G B . . . . . . .
00000048 00 04 67 41 4d 41 00 00 b1 8f 0b fc 61 05 00 00
. . g A M A . . . . . . a . . .
Note: I find the first format more useful than the second one.
Project #2
Note: Code this project yourself. Do not use any existing modules or packages.
"HEX DUMP" the last N (100?) bytes of a file.
For example:
import os
with open(file_name, mode='rb') as bfile:
# start somewhere near the end of the file (end_of_file - 24)
bfile.seek(-24,os.SEEK_END)
Project #3
Create a text file containing a mixture of one byte Unicode (ASCII) and multi-byte Unicode characters.
Run the hex dump on this file. Can you locate the Unicode
in the bytes?
project #4
Create a dump program that recognizes
UTF-8 characters and displays the character and the bytes that make up the character.
file name: x_utf.txt
-offset- ------------------------------------hex--------------------------------------
00000000 --efbbbf ------61 ------20 --e28880 ------0a --e2889e --e28891 ------2c
(.) (a) ( ) (∀) (.) (∞) (∑) (,)
00000008 ... Next eight UTF-8 Characters ...
00000016 ... Next eight UTF-8 Characters ...
Note: UTF-8 is capable of encoding all 1,112,064 valid character
code points in Unicode using one to four (8-bit) bytes.
For a UTF-8 code example click HERE
Input UTF-8 Characters Using the Console
Useful Code Snippets
# --------------------------------------------------------------------
# ---- read chunks of bytes from the file
# --------------------------------------------------------------------
import os
READ_BYTE_CHUNK = 16
with open(file_name, mode='rb') as bfile:
## ---- start 24 bytes before the end-of-file
##bfile.seek(-24,os.SEEK_END)
while True:
byts = bfile.read(READ_BYTE_CHUNK)
if len(byts) == 0: break
print(f'chunk length = {len(byts)}')
# --------------------------------------------------------------------
# ---- print each byte in a chunk
# --------------------------------------------------------------------
for byt in byts:
if byt < 32 or byt > 126:
bchar = '.'
else:
bchr = chr(byt)
print(f'{byt:02x} {bchr}')
#!/usr/bin/python3
# ====================================================================
# switch back and forth between integers and characters
# switch using ord, chr, bin, and bytes
# --------------------------------------------------------------------
# Note: The range of the return value of the ord() function
# is from 0 to 1,114,111 (0x10FFFF).
# ====================================================================
## test characters
## chr ord
## --- ------
## ( ) 0x20
## (:) 0x3a
## (a) 0x61
## (b) 0x62
## (c) 0x63
## (→) 0x2192
## (∞) 0x221e
## (∑) 0x2211
## (∏) 0x220f
## (∀) 0x2200
## (∈) 0x2208
c = '€' # chr
x = 0x20ac # ord
print(f'ord = {ord(c)} 0x{ord(c):0x}')
print(f'chr = {chr(x)}')
print(f'bin = {bin(x)}')
## FYI:
##print(f'bin = {bin(c)}')
##TypeError: 'str' object cannot be interpreted as an integer
print(f'bin = {bin(ord(c))}')
# ---- convert a character to a bit string
# ---- skip the first two characters in the bit string
b = bin(ord(c))[2:]
print(f'bits = {b} type={type(b)}')
print()
print('-- UTF-8 encoded -------------------------------')
# ---- The bytes() method returns an immutable bytes object
arr = bytes(c,"utf-8")
print(f'bytes = {arr} type={type(arr)}')
print(f'bytes = ',end='')
for b in arr:
print(f'{b:02x}({bin(b)}) ',end='')
print()
What is the relationship between the ord of a character and its
utf-8 encoding? For example:
00100010 00000000 bin of ord('∀') = 0x2200
11100010 10001000 10000000 bin of utf-8 encoded '∀'
0010 001000 000000 0x2200 (see comment below)
(UTF-8 adds extra bits when encoding a character)
Create Hex Dump Test Files
#!/usr/bin/python3
# ====================================================================
# create a test file for hex dump
# ====================================================================
import random
import user_interface as ui
# --------------------------------------------------------------------
# ---- create test file
# --------------------------------------------------------------------
def create_file(file_name,size):
with open(file_name,'wb') as bfile:
for _ in range(size):
i = random.randint(0,255)
b = i.to_bytes(1,'big')
bfile.write(b)
print(f'{size} random bytes written to file')
# --------------------------------------------------------------------
# ---- main
# --------------------------------------------------------------------
while True:
print()
file_name = ui.get_user_input('Enter a file name: ')
if not file_name: break
print()
file_size_str = ui.get_user_input('Enter a file size: ')
if not file_size_str: break
tf,file_size = ui.is_int(file_size_str)
if tf and file_size > 0:
create_file(file_name,file_size)
else:
print()
print(f'error: bad file size entered ({file_size_str})')
break
Miscellaneous
Network Byte Order is based on the idea of a big-endian byte order, which means that the
most significant byte is stored first in memory. This ensures that data is transmitted in the
same order across different types of computers and networks.
Endianness
(wikipedia)
Byte and Bit Order Dissection
(LINUX Journal)
Convert a Byte to Bits Demo
Convert UTF-8 Character Encoding to Character Code Point (ORD) Value
Convert UTF-8 Character Encoding to Character Code Point (ORD) Value (A better way?)
Some Unicode Notes and Stuff