Recognize Postal (Snail Mail) Addresses

Introduction

Let's assume you are a programmer working for the IT department of the California Department of Transportation (CalTrans). After doing a survey of drivers you collected 10,000 records of home addresses and enter them into a computer file just as the drivers wrote them. They have many different formats.

Also assume that CalTrans has a program that can analyze the addresses and extract trip information, but it requires addresses to be in a canonical form. (Trip information is used to plan future transportation needs and improvements.)

Definitions

canonical form A canonical form means that values can be described or represented in multiple ways,
and one of those ways is chosen as the favored (preferred) form.
grammar a set of instructions about how to write statements that are valid
parse to divide into grammatical parts and identify the parts and their relations to each other
semantics the meanings of words and phrases
syntax the spelling and grammatical structure

Project #1

Your task is to write a program to parse the driver addresses and output them in canonical form.

You get to define the output address (canonical form) and the input test addresses (many forms).

Use the re (regular expression) module?

Step 1 - Define the Canonical Address

The canonical form of the address might look like the following with each piece on its own line. Use this or design your own.

LastName: FirstName: MiddleInitial: AddressNumber: StreetName: StreetDirection: <-- S, E, NW StreetType: <-- ST, PL, AVE, ... StreetNumber: Apartment: City: State: <-- CA, NY, ... Zip Code: <-- 99999 or 99999-9999

Step 2 - Collect as Many Forms of Driver Addresses as You Can

This will be your test data.

Note: They should follow the guidelines by the Unites States Postal Service (USPS). Look them up.

name street address city state zip code
name street address apartment city state zip code
name street address apartment city state, zip code
name street address, apartment city, state, zip code

USPS Postal Addressing Standards

Step 3 - Write the Program

To simplify things, write the program to accept only a few of the driver address formats.

Output your canonical addresses. (to a file?)

Collect and output statistics on how many addresses your program was able to recognize and convert to your canonical form. (Also, how many you could not.)

Note: I participated in a project that was similar to this one. I used a program that had its own special language to describe addresses. These addresses were converted to a canonical form and used by another program to geocode them.

Links

Is there a library for parsing US addresses?
How to Parse Addresses using Python and Google GeoCoding API
address-parser 1.0.0
deepparse
U.S. address parser
Data Science Tools - How to Parse Addresses in Python (YouTube)

Code Example (parse phone numbers)

/!/usr/bin/python3 # parse phone numbers; convert to canonical form import re import user_interface as ui # ---- compile phone number regular expressions ----------- rex = [] rex.append(re.compile(r'^\s*(\d\d\d)-(\d\d\d\d)\s*$')) rex.append(re.compile(r'^\s*(\d\d\d)(\d\d\d\d)\s*$')) # ---- parse a phone number string ------------------------ def parse_phone_number(s): # ---- is there a string to parse? if len(s) < 1: return '' # ---- match any of the phone number regrex patterns? for p in rex: m = p.match(s) if m: return f'{m.group(1)}-{m.group(2)}' return '' # ---- main ----------------------------------------------- while True: # loop ui.clear_screen() print() s = ui.get_user_input('Enter phone number: ') if not s: # empty string? break pn = parse_phone_number(s) print() if not pn: # empty string? print('Not a recognized phone number') else: print(f'Phone number is {pn}') ui.pause()