Hodgepodge

Introduction

Sometimes it is necessary to see the raw bytes that make up data. A hex dump capability is very useful in these cases.

I have used hex dumps in the past. For example, when examining Internet packet data.

To see the ASCII codes (Octal, Decimal, Hexadecimal) click HERE .

To see data and byte examples in Python click HERE

Project #1

Note: Code this project yourself. Do not use any existing modules or packages.

Write a Python program to do a hex dump of The first N (100?) bytes of any file. (Do this for a text file and an image file.)

The format should look something like...

file name: ./faces/abc.png -offset- ----------------------hex---------------------- -----ascii------ 00000000 89 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 .PNG........IHDR 00000016 00 00 04 38 00 00 02 7b 08 02 00 00 00 89 6e b6 ...8...{......n. 00000032 20 00 00 00 01 73 52 47 42 00 ae ce 1c e9 00 00 .....sRGB....... 00000048 00 04 67 41 4d 41 00 00 b1 8f 0b fc 61 05 00 00 ..gAMA......a...

Note: non-printable characters are replaced with '.' in the ASCII section.

Another format might be

file name: ./faces/abc.png -offset- ----------------------hex---------------------- 00000000 89 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 . P N G . . . . . . . . I H D R 00000016 00 00 04 38 00 00 02 7b 08 02 00 00 00 89 6e b6 . . . 8 . . . { . . . . . . n . 00000032 20 00 00 00 01 73 52 47 42 00 ae ce 1c e9 00 00 . . . . . s R G B . . . . . . . 00000048 00 04 67 41 4d 41 00 00 b1 8f 0b fc 61 05 00 00 . . g A M A . . . . . . a . . .

Note: I find the first format more useful than the second one.

Project #2

Note: Code this project yourself. Do not use any existing modules or packages.

"HEX DUMP" the last N (100?) bytes of a file. For example:

import os with open(file_name, mode='rb') as bfile: # start somewhere near the end of the file (end_of_file - 24) bfile.seek(-24,os.SEEK_END)

Project #3

Create a text file containing a mixture of one byte Unicode (ASCII) and multi-byte Unicode characters. Run the hex dump on this file. Can you locate the Unicode in the bytes?

project #4

Create a dump program that recognizes UTF-8 characters and displays the character and the bytes that make up the character.

file name: x_utf.txt -offset- ------------------------------------hex-------------------------------------- 00000000 --efbbbf ------61 ------20 --e28880 ------0a --e2889e --e28891 ------2c (.) (a) ( ) (∀) (.) (∞) (∑) (,) 00000008 ... Next eight UTF-8 Characters ... 00000016 ... Next eight UTF-8 Characters ...

Note: UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four (8-bit) bytes.

For a UTF-8 code example click HERE

Input UTF-8 Characters Using the Console

Useful Code Snippets

# -------------------------------------------------------------------- # ---- read chunks of bytes from the file # -------------------------------------------------------------------- import os READ_BYTE_CHUNK = 16 with open(file_name, mode='rb') as bfile: ## ---- start 24 bytes before the end-of-file ##bfile.seek(-24,os.SEEK_END) while True: byts = bfile.read(READ_BYTE_CHUNK) if len(byts) == 0: break print(f'chunk length = {len(byts)}')

# -------------------------------------------------------------------- # ---- print each byte in a chunk # -------------------------------------------------------------------- for byt in byts: if byt < 32 or byt > 126: bchar = '.' else: bchr = chr(byt) print(f'{byt:02x} {bchr}')

#!/usr/bin/python3 # ==================================================================== # switch back and forth between integers and characters # switch using ord, chr, bin, and bytes # -------------------------------------------------------------------- # Note: The range of the return value of the ord() function # is from 0 to 1,114,111 (0x10FFFF). # ==================================================================== ## test characters ## chr ord ## --- ------ ## ( ) 0x20 ## (:) 0x3a ## (a) 0x61 ## (b) 0x62 ## (c) 0x63 ## (→) 0x2192 ## (∞) 0x221e ## (∑) 0x2211 ## (∏) 0x220f ## (∀) 0x2200 ## (∈) 0x2208 c = '€' # chr x = 0x20ac # ord print(f'ord = {ord(c)} 0x{ord(c):0x}') print(f'chr = {chr(x)}') print(f'bin = {bin(x)}') ## FYI: ##print(f'bin = {bin(c)}') ##TypeError: 'str' object cannot be interpreted as an integer print(f'bin = {bin(ord(c))}') # ---- convert a character to a bit string # ---- skip the first two characters in the bit string b = bin(ord(c))[2:] print(f'bits = {b} type={type(b)}') print() print('-- UTF-8 encoded -------------------------------') # ---- The bytes() method returns an immutable bytes object arr = bytes(c,"utf-8") print(f'bytes = {arr} type={type(arr)}') print(f'bytes = ',end='') for b in arr: print(f'{b:02x}({bin(b)}) ',end='') print()

What is the relationship between the ord of a character and its utf-8 encoding? For example: 00100010 00000000 bin of ord('∀') = 0x2200 11100010 10001000 10000000 bin of utf-8 encoded '∀' 0010 001000 000000 0x2200 (see comment below) (UTF-8 adds extra bits when encoding a character)

Create Hex Dump Test Files

#!/usr/bin/python3 # ==================================================================== # create a test file for hex dump # ==================================================================== import random import user_interface as ui # -------------------------------------------------------------------- # ---- create test file # -------------------------------------------------------------------- def create_file(file_name,size): with open(file_name,'wb') as bfile: for _ in range(size): i = random.randint(0,255) b = i.to_bytes(1,'big') bfile.write(b) print(f'{size} random bytes written to file') # -------------------------------------------------------------------- # ---- main # -------------------------------------------------------------------- while True: print() file_name = ui.get_user_input('Enter a file name: ') if not file_name: break print() file_size_str = ui.get_user_input('Enter a file size: ') if not file_size_str: break tf,file_size = ui.is_int(file_size_str) if tf and file_size > 0: create_file(file_name,file_size) else: print() print(f'error: bad file size entered ({file_size_str})') break

Miscellaneous

Network Byte Order is based on the idea of a big-endian byte order, which means that the most significant byte is stored first in memory. This ensures that data is transmitted in the same order across different types of computers and networks.

Endianness (wikipedia)

Byte and Bit Order Dissection (LINUX Journal)

Convert a Byte to Bits Demo

Convert UTF-8 Character Encoding to Character Code Point (ORD) Value

Convert UTF-8 Character Encoding to Character Code Point (ORD) Value (A better way?)

Some Unicode Notes and Stuff