Building a command line tool for literate programming
By John Lekberg on November 08, 2020.
This week's Python blog post is about building a command line tool for literate programming. You will learn:
- How to parse XML using Python's xml.etree.ElementTree module.
- How to traverse XML by using XPath queries.
- How to use regular expressions to parse macro definitions.
- How to use string methods to normalize text.
Literate programming is a programming paradigm created by Donald Knuth. Dr. Knuth, Professor Emeritus at Stanford University, is better known as the creator of TeX (see also LaTeX) and the book series "The Art of Computer Programming".
The goal of literate programming is to allow programs to be displayed in a narrative order -- as opposed to being limited by the constraints of a compiler or interpreter. Literate programming tools allow writers to use macros to abstract and reorder source code to best suit the narrative.
There are two things to do with a literate program:
- You can tangle the literate program into source code -- which can then be compiled and executed.
- Or, you can weave the literate program into documentation -- which can be read.
Script source code
litprog
#!/usr/bin/env python3
import pathlib
import re
import sys
import xml.etree.ElementTree as ET
def MAIN():
import argparse
parser = argparse.ArgumentParser()
subparsers = parser.add_subparsers(
dest="mode",
help="Weave or tangle source XML. (Required.)",
)
parser_weave = subparsers.add_parser(
"weave",
description="Weave source XML into HTML output.",
)
parser_weave.add_argument(
"file", help="CommonMark XML file"
)
parser_tangle = subparsers.add_parser(
"tangle",
description="Tangle source XML into code output.",
)
parser_tangle.add_argument(
"file", help="CommonMark XML file"
)
parser_tangle.add_argument(
"target", help="Root macro name"
)
args = parser.parse_args()
if args.mode is None:
parser.print_help()
sys.exit(1)
elif args.mode == "weave":
xml_input = pathlib.Path(args.file).read_text()
html_output = litprog_weave(xml_input)
print(html_output)
elif args.mode == "tangle":
xml_input = pathlib.Path(args.file).read_text()
code_output = litprog_tangle(
xml_input, root_target=args.target
)
print(code_output)
else:
raise RuntimeError(f"Unrecognized mode {mode!r}.")
macro_definition_re = re.compile(r"""(?x)
<<
(?P<name>[^\n>]+)
>>=
""")
macro_block_re = re.compile(r"""(?mx)
^
(?P<indent>[ \t]*)
<<
(?P<name>[^\n>]+)
>>
(?:[ \t]*)
$
""")
macro_inline_re = re.compile(r"""(?x)
<<
(?P<name>[^\n>]+)
>>
""")
def normal_spacing(text):
"""Normalize the white space in a string."""
return " ".join(text.split())
def litprog_weave(source_xml):
"""Weave CommonMark XML into HTML.
source_xml -- str.
"""
document = ET.fromstring(source_xml)
namespace = "http://commonmark.org/xml/1.0"
def elem_like(ppath):
"""Iterate over elements that match an XPath
query like ".//{namespace}ppath".
"""
return document.iterfind(
".//{%s}%s" % (namespace, ppath)
)
# NOTE: I only included the tag changes that I needed.
# For the full Document Type Definition (DTD), see:
# > https://github.com/commonmark/commonmark-spec/blob/master/CommonMark.dtd
# Easy tag changes.
tag_changes = [
("paragraph", "p"),
("thematic_break", "hr"),
("text", "span"),
("code", "code"),
]
for old_tag, new_tag in tag_changes:
for elem in elem_like(old_tag):
elem.tag = new_tag
elem.attrib.clear()
# Headings.
for elem in elem_like("heading"):
level = elem.attrib["level"]
elem.tag = f"h{level}"
elem.attrib.clear()
# Code blocks.
for elem in elem_like("code_block"):
info = elem.attrib.get("info", "")
code = elem.text
elem.tag = "figure"
elem.clear()
match_macro = macro_definition_re.search(info)
if match_macro:
caption = ET.Element("figcaption")
elem.append(caption)
caption.text = f"« {normal_spacing(match_macro['name'])} »:"
code = macro_inline_re.sub(
lambda m: f"«{normal_spacing(m['name'])}»",
code,
)
tag_pre = ET.Element("pre")
elem.append(tag_pre)
tag_code = ET.Element("code")
tag_pre.append(tag_code)
tag_code.text = code
document.tag = "div"
return ET.tostring(document).decode()
def litprog_tangle(source_xml, *, root_target):
"""Tangle CommonMark XML into source code.
source_xml -- str.
target -- str. Name of macro to tangle. e.g. "MAIN".
"""
document = ET.fromstring(source_xml)
namespace = "http://commonmark.org/xml/1.0"
def elem_like(ppath):
"""Iterate over elements that match an XPath
query like ".//{namespace}ppath".
"""
return document.iterfind(
".//{%s}%s" % (namespace, ppath)
)
body = {}
for elem in elem_like("code_block"):
info = elem.attrib.get("info", "")
match_macro = macro_definition_re.search(info)
if match_macro:
target = normal_spacing(
match_macro["name"]
).casefold()
body[target] = elem.text
def expand(target):
"""Recursively expand a macro.
target -- str. Macro name.
"""
original_target = target
target = normal_spacing(original_target).casefold()
text = body[target]
def expand_block(match):
text = expand(match["name"]).rstrip()
text = re.sub(r"(?m)^", match["indent"], text)
return text
def expand_inline(match):
text = expand(match["name"]).strip()
assert len(text.splitlines()) <= 1
return text
text = macro_block_re.sub(expand_block, text)
text = macro_inline_re.sub(expand_inline, text)
return text
return expand(root_target)
if __name__ == "__main__":
MAIN()
$ litprog -h
usage: litprog [-h] {weave,tangle} ...
positional arguments:
{weave,tangle} Weave or tangle source XML. (Required.)
optional arguments:
-h, --help show this help message and exit
$ litprog weave -h
usage: litprog weave [-h] file
Weave source XML into HTML output.
positional arguments:
file CommonMark XML file
optional arguments:
-h, --help show this help message and exit
$ litprog tangle -h
usage: litprog tangle [-h] file target
Tangle source XML into code output.
positional arguments:
file CommonMark XML file
target Root macro name
optional arguments:
-h, --help show this help message and exit
A sample literate program
Here is a sample literate program for Khan's Algorithm:
sample.markdown
# Topological Sort with Khan's Algorithm
``` <<MAIN>>=
<<imports>>
def khans_algorithm(*, V, E):
<< init graph >>
<< init topological order >>
<< init source nodes >>
while << source nodes exist >>:
<< take source node >>
<< add node to topological order >>
<< add neighboring source nodes >>
if << edges remain >>:
raise << cycle error >>
else:
return << topological order >>
```
I initialize the graph indices from vertices `V` and edges `E`:
``` << init graph >>=
E_idx0 = defaultdict(set)
E_idx1 = defaultdict(set)
def index(n, m):
E_idx0[n].add(m)
E_idx1[m].add(n)
def deindex(n, m):
E_idx0[n].remove(m)
E_idx1[m].remove(n)
def indegree(n):
return len(E_idx1[n])
def outdegree(n):
return len(E_idx0[n])
def is_source(n):
return indegree(n) == 0
for n, m in E:
index(n, m)
```
I use the `indegree` and `outdegree` helper functions to determine
whether any edges remain:
``` << edges remain >>=
any((indegree(n) > 0) or (outdegree(n) > 0) for n in V)
```
I represent the topological order as a list:
``` << topological order >>=
L
```
``` << init topological order >>=
<< topological order >> = []
```
``` << add node to topological order >>=
<< topological order >>.append(n)
```
I represent the source nodes as a set:
``` << source nodes >>=
S
```
``` << init source nodes >>=
<< source nodes >> = set(filter(is_source, V))
```
``` << take source node >>=
n = << source nodes >>.pop()
```
``` << source nodes exist >>=
len(<< source nodes >>) > 0
```
Given a node, I use the graph indices to find the
neighboring source nodes:
``` << neighbors >>=
tuple(E_idx0[n])
```
``` << add neighboring source nodes >>=
for m in << neighbors >>:
deindex(n, m)
if is_source(m):
<< source nodes >>.add(m)
```
If the graph contains a cycle, I will raise an exception:
``` << cycle error >>=
RuntimeError("Graph contains a cycle.")
```
* * *
Here are the required imports:
``` << imports >>=
from collections import defaultdict
```
To write literate programs, I use CommonMark, a style of markdown.
I turn this markdown file into XML using the cmark
command line tool:
$ cmark --to xml sample.markdown > sample.xml
To support macro definitions, I use the "info strings" on fenced code blocks. E.g.
$ echo ' ``` hello world a b ``` ' | cmark --to xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
<code_block info="hello world" xml:space="preserve">a
b
</code_block>
</document>
Using the script to weave documention
$ litprog weave sample.xml
<div xmlns:ns0="http://commonmark.org/xml/1.0">
<h1>
<span>Topological Sort with Khan's Algorithm</span>
</h1>
<figure><figcaption>« MAIN »:</figcaption><pre><code>«imports»
def khans_algorithm(*, V, E):
«init graph»
«init topological order»
«init source nodes»
while «source nodes exist»:
«take source node»
«add node to topological order»
«add neighboring source nodes»
if «edges remain»:
raise «cycle error»
else:
return «topological order»
</code></pre></figure><p>
<span>I initialize the graph indices from vertices </span>
<code>V</code>
<span> and edges </span>
<code>E</code>
<span>:</span>
</p>
[...]
Using the script to tangle source code
$ litprog tangle sample.xml MAIN
from collections import defaultdict
def khans_algorithm(*, V, E):
E_idx0 = defaultdict(set)
E_idx1 = defaultdict(set)
def index(n, m):
E_idx0[n].add(m)
E_idx1[m].add(n)
def deindex(n, m):
E_idx0[n].remove(m)
E_idx1[m].remove(n)
def indegree(n):
return len(E_idx1[n])
def outdegree(n):
return len(E_idx0[n])
def is_source(n):
return indegree(n) == 0
for n, m in E:
index(n, m)
L = []
S = set(filter(is_source, V))
while len(S) > 0:
n = S.pop()
L.append(n)
for m in tuple(E_idx0[n]):
deindex(n, m)
if is_source(m):
S.add(m)
if any((indegree(n) > 0) or (outdegree(n) > 0) for n in V):
raise RuntimeError("Graph contains a cycle.")
else:
return L
How the script works
I use argparse to parse the command line arguments.
To handle the XML, I use Python's xml.etree.ElementTree module. This allows me to parse the XML and traverse it using XPath queries.
To handle the macros, I use several regular expressions:
-
(?x)
activates the flag re.VERBOSE. -
(?m)
activates the flag re.MULTILINE. -
(?P<...>...)
creates named capture groups. -
I also use re.sub with a replacement function (instead of a replacement string). E.g.,
def f(match): return match.group()[::-1] re.sub(r"[a-z]+", f, "123 abc 343")
'123 cba 343'
In the function elem_like
, I use printf-style string formatting because
-- in this specific scenario -- I found it more readable than using the format
string syntax:
namespace = "http://commonmark.org/xml/1.0" ppath = "paragraph" ".//{%s}%s" % (namespace, ppath)
'.//{http://commonmark.org/xml/1.0}paragraph'
f".//{{{namespace}}}{ppath}"
'.//{http://commonmark.org/xml/1.0}paragraph'
I normalize the spacing by using str.split and str.join to make sure that changes in white space don't mess up macro references:
message = " hello world yeah " message.split()
['hello', 'world', 'yeah']
" ".join(message.split())
'hello world yeah'
I also use str.casefold to make sure that referring to a macro as MAIN
or
main
doesn't make a difference:
"MAIN" == "main"
False
"MAIN".casefold() == "main".casefold()
True
In conclusion...
In this week's blog post, you learned how to build a command line tool for literate programming. You parsed XML using Python's xml.etree.ElementTree module, and traversed it using XPath queries. You used regular expressions to parse macro definitions. And you used string methods (str.split, str.join, and str.casefold) to normalize text.
My challenge to you:
Modify
litprog_weave
to add macro hyperlinks to the output HTML. So, when you click on a macro, like «macro», you are taken the definition of that macro.
If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!
(If you spot any errors or typos on this post, contact me via my contact page.)