Let's assume you are a programmer working for the IT department of the California Department of Transportation (CalTrans). After doing a survey of drivers you collected 10,000 records of home addresses and enter them into a computer file just as the drivers wrote them. They have many different formats.
Also assume that CalTrans has a program that can analyze the addresses and extract trip information, but it requires addresses to be in a canonical form. (Trip information is used to plan future transportation needs and improvements.)
Definitions
canonical form
A canonical form means that values can be described or represented in multiple ways,
and one of those ways is chosen as the favored (preferred) form.
grammar
a set of instructions about how to write statements that are valid
parse
to divide into grammatical parts and identify the parts and their
relations to each other
semantics
the meanings of words and phrases
syntax
the spelling and grammatical structure
Your task is to write a program to parse the driver addresses and output them in canonical form.
You get to define the output address (canonical form) and the input test addresses (many forms).
Use the re (regular expression) module?
The canonical form of the address might look like the following with each piece on its own line. Use this or design your own.
This will be your test data.
Note: They should follow the guidelines by the Unites States Postal Service (USPS). Look them up.
USPS Postal Addressing Standards
To simplify things, write the program to accept only a few of the driver address formats.
Output your canonical addresses. (to a file?)
Collect and output statistics on how many addresses your program was able to recognize and convert to your canonical form. (Also, how many you could not.)
Note: I participated in a project that was similar to this one. I used a program that had its own special language to describe addresses. These addresses were converted to a canonical form and used by another program to geocode them.
Is there a library for parsing US addresses?
How to Parse Addresses using Python and Google GeoCoding API
address-parser 1.0.0
deepparse
U.S. address parser
Data Science Tools - How to Parse Addresses in Python (YouTube)