My Doodles/Thoughts on Parsing MD files

Introduction

Here is my doodling/thinking about a MD parser and HTML generator. Consider the MD file text as a series of tokens. The tokens consist of Markdown (tags) tokens and plain text tokens. (Plain text tokes contain the text between and around Markdown tokens.)

For the sake of simplicity I am assuming Markdown tags must be completed within a single line. Therefore I parse lines individually and not paragraphs.

The two main parts of the program are the Lexical Analyzer and the HTML Code Generator. Because the test files are small the Lexical Analyzer saves the tokens in a list. After processing the file the list is passed to the HTML Code Generator and HTML is written to an output file. If an error is detected, an error message is displayed and the program halts.

List entries (tokens) contain the parser state and the associated text. For example, if the lexical analyzer finds a bold (**) Markdown tag it "remembers" the starting tag using a LIFO stack. When it finds another bold (**) Markdown tag, it looks in the LIFO stack to see if the end tag matching the start tag. (Remember tags must be nested. Out of order tags are an error.)

To see a more detailed explanation of how the LIFO stack "remembers" click HERE .

parser
state
MarkDownHTMLDescription
0    plain text
1blank line <p> ... </p> start/end paragraph
2# <h1> ... </h1>header 1
3## <h2> ... </h2>header 2
4**bold** <b> ... </b> start/end bold text
5//italic// <i> ... </i> start/end italic text
6__underlined__ <u> ... </u> start/end underline text
7\\ + [' ' or EOL]<br> line break
You can **__//combine//__** all of these.
Matching must be nested. If not it is an error.

Note:

  1. EOL is end-of-line
  2. EOF is end-of-file
  3. The parser state is used to indicate searching for a matching Markdown token
  4. Paragraphs are started by a blank line and end with a blank line or EOF
  5. The markdown tags are 1, 2, or in one case 3 characters in length

Bits and Bobs of Code

Please note that the code is obvious incomplete. It is here to give you hints.

# ----------------------------------------------------------------- # ---- compile regx # ----------------------------------------------------------------- import re # ---- regx to find md tags __, //, **, \\, ##, # REGX01 = re.compile(r"(__" r"|//" r"|\*\*" r"|\\\\$" r"|\\\\\s" r"|##" r"|#)") Question: Is the order of the regex patterns important? If so, why?

# ----------------------------------------------------------------- # ---- main # ---- tf = true/false flag # ----------------------------------------------------------------- lst = process_md_file(infile) ##print_list(lst) tf = generate_html_code(lst,outfile) if tf: print(f'output HTML file {outfile} created') else: print(f'output HTML file {outfile} not created')

# ----------------------------------------------------------------- # ---- process MD file # ----------------------------------------------------------------- import sys def process_md_file(infile) lst = [] for line in infile: tf = process_line(line,lst) if not tf: sys.exit()

# ----------------------------------------------------------------- # ---- process a line # ---- list (lst) contains tuples # ---- a tuple contains a parse state and text # ----------------------------------------------------------------- VERBOSE = False def process_line(line,lst): while True: if VERBOSE: print() print(f'processing line "{line}"') # --- end of line (no md tag found) if not line: lst.append((1,'')) return True # ---- search for a md tag res = re.search(REGX01,line) if not res: lst.append((0,line)) return True # ---- extract info from search results ln = len(line) end = res.end() start = res.start() tag = res.groups()[0] if VERBOSE: print(f'ln={ln},start={start},end={end},tag="{tag}"') # --- bold if tag == r'**': if ln > 2: lst.append((0,line[0:end-2])) lst.append((4,'bold')) # --- italic elif tag == r'//': if ln > 2: lst.append((0,line[0:end-2])) lst.append((5,'italic')) # --- h2 elif tag == r'##': lst.append((3,'H2')) lst.append((0,line[end:].strip())) lst.append((3,'H2')) return True ... ... ... # --- line break elif tag == r'\\ ': lst.append((0,line[0:end-3])) lst.append((7,'br')) # --- unknown tag else: print(f'internal error - unknown tag "{tag}"') print_list(lst) return False # ---- make the line shorter skipping # ---- the stuff we have already seen line = line[end:] return True

# ----------------------------------------------------------------- # ---- generate HTML code # ----------------------------------------------------------------- VERBOSE = False def generate_html_code(lst,outfile): que = my_lifo_queue() # ---- process lexical tokens for tok in lst: state = tok[0] # ---- plain text if state == 0: if fout : fout.write(tok[1] + '\n') if VERBOSE: print(tok[1]) continue # ---- line break elif state == 7: if fout : fout.write('<br>' + '\n') if VERBOSE: print('<br>') continue # ---- Remember, a blank line is the end of one paragraph # ---- and the start of another elif state == 1: if que.state() == 1: if fout: fout.write('</p>' + '\n') fout.write('<p>' + '\n') if VERBOSE: print('</p>') print('<p>') else: if fout : fout.write('<p>' + '\n') if VERBOSE: print('<p>') que.push(tok) continue # ---- header 1 elif state == 2: if que.state() == 2: if fout : fout.write('</h1>' + '\n') if VERBOSE: print('</h1>') que.pop() else: if fout : fout.write('<h1>' + '\n') if VERBOSE: print('<h1>') que.push(tok) continue ... ... ... # ---- a final </p> required? if que.state() == 1: if fout : fout.write('</p>' + '\n') if VERBOSE: print('</p>') que.pop() # ---- queue/stack should have zero entries? if que.length() > 0: print() print(f'Error: state queue is not empty') print(' MD tag error - exit program') print_que(que,'-------------- queue/stack -------------') # ---- close and delete output file terminate_output(fout,outfile) sys.exit()

# ----------------------------------------------------------------- # ---- terminate output file # ----------------------------------------------------------------- import os def terminate_output(fout,filename=None): if fout: fout.close() if filename is not None: if os.path.exists(filename): os.remove(filename) print(f'File {filename} deleted') else: print(f'File {filename} does not exist') Question: Are there failure tests missing?