Introduction
Here is my doodling/thinking about a MD parser and HTML generator. Consider the MD
file text as a series of tokens. The tokens consist of Markdown (tags) tokens and plain text tokens.
(Plain text tokes contain the text between and around Markdown tokens.)
For the sake of simplicity I am assuming Markdown tags must be completed within a single line.
Therefore I parse lines individually and not paragraphs.
The two main parts of the program are the Lexical Analyzer
and the HTML Code Generator.
Because the test files are small the Lexical Analyzer saves the tokens in a list. After processing the file
the list is passed to the HTML Code Generator and HTML is written to an output file. If an error
is detected, an error message is displayed and the program halts.
List entries (tokens) contain the parser state and the associated text. For example, if the lexical analyzer
finds a bold (**) Markdown tag it "remembers" the starting tag using a LIFO stack.
When it finds another bold (**) Markdown tag, it looks in the LIFO stack to see if
the end tag matching the start tag. (Remember tags must be nested. Out of order tags are an error.)
To see a more detailed explanation of how the LIFO stack "remembers" click
HERE
.
parser state | MarkDown | HTML | Description |
0 | | | plain text |
1 | blank line | <p> ... </p> | start/end paragraph |
2 | # | <h1> ... </h1> | header 1 |
3 | ## | <h2> ... </h2> | header 2 |
4 | **bold** | <b> ... </b> | start/end bold text |
5 | //italic// | <i> ... </i> | start/end italic text |
6 | __underlined__ | <u> ... </u> | start/end underline text |
7 | \\ + [' ' or EOL] | <br> | line break |
You can **__//combine//__** all of these.
Matching must be nested. If not it is an error. |
Note:
- EOL is end-of-line
- EOF is end-of-file
- The parser state is used to indicate searching for a matching Markdown token
- Paragraphs are started by a blank line and end with a blank line or EOF
- The markdown tags are 1, 2, or in one case 3 characters in length
Bits and Bobs of Code
Please note that the code is obvious incomplete.
It is here to give you hints.
# -----------------------------------------------------------------
# ---- compile regx
# -----------------------------------------------------------------
import re
# ---- regx to find md tags __, //, **, \\, ##, #
REGX01 = re.compile(r"(__"
r"|//"
r"|\*\*"
r"|\\\\$"
r"|\\\\\s"
r"|##"
r"|#)")
Question: Is the order of the regex patterns important? If so, why?
# -----------------------------------------------------------------
# ---- main
# ---- tf = true/false flag
# -----------------------------------------------------------------
lst = process_md_file(infile)
##print_list(lst)
tf = generate_html_code(lst,outfile)
if tf:
print(f'output HTML file {outfile} created')
else:
print(f'output HTML file {outfile} not created')
# -----------------------------------------------------------------
# ---- process MD file
# -----------------------------------------------------------------
import sys
def process_md_file(infile)
lst = []
for line in infile:
tf = process_line(line,lst)
if not tf:
sys.exit()
# -----------------------------------------------------------------
# ---- process a line
# ---- list (lst) contains tuples
# ---- a tuple contains a parse state and text
# -----------------------------------------------------------------
VERBOSE = False
def process_line(line,lst):
while True:
if VERBOSE:
print()
print(f'processing line "{line}"')
# --- end of line (no md tag found)
if not line:
lst.append((1,''))
return True
# ---- search for a md tag
res = re.search(REGX01,line)
if not res:
lst.append((0,line))
return True
# ---- extract info from search results
ln = len(line)
end = res.end()
start = res.start()
tag = res.groups()[0]
if VERBOSE:
print(f'ln={ln},start={start},end={end},tag="{tag}"')
# --- bold
if tag == r'**':
if ln > 2:
lst.append((0,line[0:end-2]))
lst.append((4,'bold'))
# --- italic
elif tag == r'//':
if ln > 2:
lst.append((0,line[0:end-2]))
lst.append((5,'italic'))
# --- h2
elif tag == r'##':
lst.append((3,'H2'))
lst.append((0,line[end:].strip()))
lst.append((3,'H2'))
return True
...
...
...
# --- line break
elif tag == r'\\ ':
lst.append((0,line[0:end-3]))
lst.append((7,'br'))
# --- unknown tag
else:
print(f'internal error - unknown tag "{tag}"')
print_list(lst)
return False
# ---- make the line shorter skipping
# ---- the stuff we have already seen
line = line[end:]
return True
# -----------------------------------------------------------------
# ---- generate HTML code
# -----------------------------------------------------------------
VERBOSE = False
def generate_html_code(lst,outfile):
que = my_lifo_queue()
# ---- process lexical tokens
for tok in lst:
state = tok[0]
# ---- plain text
if state == 0:
if fout : fout.write(tok[1] + '\n')
if VERBOSE: print(tok[1])
continue
# ---- line break
elif state == 7:
if fout : fout.write('<br>' + '\n')
if VERBOSE: print('<br>')
continue
# ---- Remember, a blank line is the end of one paragraph
# ---- and the start of another
elif state == 1:
if que.state() == 1:
if fout:
fout.write('</p>' + '\n')
fout.write('<p>' + '\n')
if VERBOSE:
print('</p>')
print('<p>')
else:
if fout : fout.write('<p>' + '\n')
if VERBOSE: print('<p>')
que.push(tok)
continue
# ---- header 1
elif state == 2:
if que.state() == 2:
if fout : fout.write('</h1>' + '\n')
if VERBOSE: print('</h1>')
que.pop()
else:
if fout : fout.write('<h1>' + '\n')
if VERBOSE: print('<h1>')
que.push(tok)
continue
...
...
...
# ---- a final </p> required?
if que.state() == 1:
if fout : fout.write('</p>' + '\n')
if VERBOSE: print('</p>')
que.pop()
# ---- queue/stack should have zero entries?
if que.length() > 0:
print()
print(f'Error: state queue is not empty')
print(' MD tag error - exit program')
print_que(que,'-------------- queue/stack -------------')
# ---- close and delete output file
terminate_output(fout,outfile)
sys.exit()
# -----------------------------------------------------------------
# ---- terminate output file
# -----------------------------------------------------------------
import os
def terminate_output(fout,filename=None):
if fout:
fout.close()
if filename is not None:
if os.path.exists(filename):
os.remove(filename)
print(f'File {filename} deleted')
else:
print(f'File {filename} does not exist')
Question: Are there failure tests missing?