Building a command line tool for literate programming

By John Lekberg on November 08, 2020.

This week's Python blog post is about building a command line tool for literate programming. You will learn:

How to parse XML using Python's xml.etree.ElementTree module.
How to traverse XML by using XPath queries.
How to use regular expressions to parse macro definitions.
How to use string methods to normalize text.

Literate programming is a programming paradigm created by Donald Knuth. Dr. Knuth, Professor Emeritus at Stanford University, is better known as the creator of TeX (see also LaTeX) and the book series "The Art of Computer Programming".

The goal of literate programming is to allow programs to be displayed in a narrative order -- as opposed to being limited by the constraints of a compiler or interpreter. Literate programming tools allow writers to use macros to abstract and reorder source code to best suit the narrative.

There are two things to do with a literate program:

You can tangle the literate program into source code -- which can then be compiled and executed.
Or, you can weave the literate program into documentation -- which can be read.

Script source code

litprog

#!/usr/bin/env python3

import pathlib
import re
import sys
import xml.etree.ElementTree as ET


def MAIN():
    import argparse

    parser = argparse.ArgumentParser()
    subparsers = parser.add_subparsers(
        dest="mode",
        help="Weave or tangle source XML. (Required.)",
    )

    parser_weave = subparsers.add_parser(
        "weave",
        description="Weave source XML into HTML output.",
    )
    parser_weave.add_argument(
        "file", help="CommonMark XML file"
    )

    parser_tangle = subparsers.add_parser(
        "tangle",
        description="Tangle source XML into code output.",
    )
    parser_tangle.add_argument(
        "file", help="CommonMark XML file"
    )
    parser_tangle.add_argument(
        "target", help="Root macro name"
    )

    args = parser.parse_args()
    if args.mode is None:
        parser.print_help()
        sys.exit(1)
    elif args.mode == "weave":
        xml_input = pathlib.Path(args.file).read_text()
        html_output = litprog_weave(xml_input)
        print(html_output)
    elif args.mode == "tangle":
        xml_input = pathlib.Path(args.file).read_text()
        code_output = litprog_tangle(
            xml_input, root_target=args.target
        )
        print(code_output)
    else:
        raise RuntimeError(f"Unrecognized mode {mode!r}.")


macro_definition_re = re.compile(r"""(?x)
    <<
    (?P<name>[^\n>]+)
    >>=
""")

macro_block_re = re.compile(r"""(?mx)
    ^
    (?P<indent>[ \t]*)
    <<
    (?P<name>[^\n>]+)
    >>
    (?:[ \t]*)
    $
""")

macro_inline_re = re.compile(r"""(?x)
    <<
    (?P<name>[^\n>]+)
    >>
""")


def normal_spacing(text):
    """Normalize the white space in a string."""
    return " ".join(text.split())


def litprog_weave(source_xml):
    """Weave CommonMark XML into HTML.

    source_xml -- str.
    """
    document = ET.fromstring(source_xml)
    namespace = "http://commonmark.org/xml/1.0"

    def elem_like(ppath):
        """Iterate over elements that match an XPath
        query like ".//{namespace}ppath".
        """
        return document.iterfind(
            ".//{%s}%s" % (namespace, ppath)
        )

    # NOTE: I only included the tag changes that I needed.
    # For the full Document Type Definition (DTD), see:
    # > https://github.com/commonmark/commonmark-spec/blob/master/CommonMark.dtd

    # Easy tag changes.

    tag_changes = [
        ("paragraph", "p"),
        ("thematic_break", "hr"),
        ("text", "span"),
        ("code", "code"),
    ]
    for old_tag, new_tag in tag_changes:
        for elem in elem_like(old_tag):
            elem.tag = new_tag
            elem.attrib.clear()

    # Headings.

    for elem in elem_like("heading"):
        level = elem.attrib["level"]
        elem.tag = f"h{level}"
        elem.attrib.clear()

    # Code blocks.

    for elem in elem_like("code_block"):
        info = elem.attrib.get("info", "")
        code = elem.text

        elem.tag = "figure"
        elem.clear()

        match_macro = macro_definition_re.search(info)
        if match_macro:
            caption = ET.Element("figcaption")
            elem.append(caption)
            caption.text = f"« {normal_spacing(match_macro['name'])} »:"
            code = macro_inline_re.sub(
                lambda m: f"«{normal_spacing(m['name'])}»",
                code,
            )

        tag_pre = ET.Element("pre")
        elem.append(tag_pre)
        tag_code = ET.Element("code")
        tag_pre.append(tag_code)
        tag_code.text = code

    document.tag = "div"

    return ET.tostring(document).decode()


def litprog_tangle(source_xml, *, root_target):
    """Tangle CommonMark XML into source code.

    source_xml -- str.
    target -- str. Name of macro to tangle. e.g. "MAIN".
    """
    document = ET.fromstring(source_xml)
    namespace = "http://commonmark.org/xml/1.0"

    def elem_like(ppath):
        """Iterate over elements that match an XPath
        query like ".//{namespace}ppath".
        """
        return document.iterfind(
            ".//{%s}%s" % (namespace, ppath)
        )

    body = {}

    for elem in elem_like("code_block"):
        info = elem.attrib.get("info", "")

        match_macro = macro_definition_re.search(info)
        if match_macro:
            target = normal_spacing(
                match_macro["name"]
            ).casefold()
            body[target] = elem.text

    def expand(target):
        """Recursively expand a macro.

        target -- str. Macro name.
        """
        original_target = target
        target = normal_spacing(original_target).casefold()

        text = body[target]

        def expand_block(match):
            text = expand(match["name"]).rstrip()
            text = re.sub(r"(?m)^", match["indent"], text)
            return text

        def expand_inline(match):
            text = expand(match["name"]).strip()
            assert len(text.splitlines()) <= 1
            return text

        text = macro_block_re.sub(expand_block, text)
        text = macro_inline_re.sub(expand_inline, text)

        return text

    return expand(root_target)


if __name__ == "__main__":
    MAIN()

$ litprog -h

usage: litprog [-h] {weave,tangle} ...

positional arguments:
  {weave,tangle}  Weave or tangle source XML. (Required.)

optional arguments:
  -h, --help      show this help message and exit

$ litprog weave -h

usage: litprog weave [-h] file

Weave source XML into HTML output.

positional arguments:
  file        CommonMark XML file

optional arguments:
  -h, --help  show this help message and exit

$ litprog tangle -h

usage: litprog tangle [-h] file target

Tangle source XML into code output.

positional arguments:
  file        CommonMark XML file
  target      Root macro name

optional arguments:
  -h, --help  show this help message and exit

A sample literate program

Here is a sample literate program for Khan's Algorithm:

sample.markdown

# Topological Sort with Khan's Algorithm

``` <<MAIN>>=
<<imports>>

def khans_algorithm(*, V, E):
    << init graph >>
    << init topological order >>
    << init source nodes >>
    
    while << source nodes exist >>:
        << take source node >>
        << add node to topological order >>
        << add neighboring source nodes >>
    
    if << edges remain >>:
        raise << cycle error >>
    else:
        return << topological order >>
```

I initialize the graph indices from vertices `V` and edges `E`:

``` << init graph >>=
E_idx0 = defaultdict(set)
E_idx1 = defaultdict(set)

def index(n, m):
    E_idx0[n].add(m)
    E_idx1[m].add(n)

def deindex(n, m):
    E_idx0[n].remove(m)
    E_idx1[m].remove(n)

def indegree(n):
    return len(E_idx1[n])

def outdegree(n):
    return len(E_idx0[n])

def is_source(n):
    return indegree(n) == 0

for n, m in E:
    index(n, m)
```

I use the `indegree` and `outdegree` helper functions to determine
whether any edges remain:

``` << edges remain >>=
any((indegree(n) > 0) or (outdegree(n) > 0) for n in V)
```

I represent the topological order as a list:

``` << topological order >>=
L
```

``` << init topological order >>=
<< topological order >> = []
```

``` << add node to topological order >>=
<< topological order >>.append(n)
```

I represent the source nodes as a set:

``` << source nodes >>=
S
```

``` << init source nodes >>=
<< source nodes >> = set(filter(is_source, V))
```

``` << take source node >>=
n = << source nodes >>.pop()
```

``` << source nodes exist >>=
len(<< source nodes >>) > 0
```

Given a node, I use the graph indices to find the
neighboring source nodes:

``` << neighbors >>=
tuple(E_idx0[n])
```

``` << add neighboring source nodes >>=
for m in << neighbors >>:
    deindex(n, m)
    if is_source(m):
        << source nodes >>.add(m)
```

If the graph contains a cycle, I will raise an exception:

``` << cycle error >>=
RuntimeError("Graph contains a cycle.")
```

* * *

Here are the required imports:

``` << imports >>=
from collections import defaultdict
```

To write literate programs, I use CommonMark, a style of markdown. I turn this markdown file into XML using the cmark command line tool:

$ cmark --to xml sample.markdown > sample.xml

To support macro definitions, I use the "info strings" on fenced code blocks. E.g.

$ echo '
``` hello world
a 
b
```
' | cmark --to xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
  <code_block info="hello world" xml:space="preserve">a
b
</code_block>
</document>

Using the script to weave documention

$ litprog weave sample.xml

<div xmlns:ns0="http://commonmark.org/xml/1.0">
  <h1>
    <span>Topological Sort with Khan's Algorithm</span>
  </h1>
  <figure><figcaption>&#171; MAIN &#187;:</figcaption><pre><code>&#171;imports&#187;

def khans_algorithm(*, V, E):
    &#171;init graph&#187;
    &#171;init topological order&#187;
    &#171;init source nodes&#187;
    
    while &#171;source nodes exist&#187;:
        &#171;take source node&#187;
        &#171;add node to topological order&#187;
        &#171;add neighboring source nodes&#187;
    
    if &#171;edges remain&#187;:
        raise &#171;cycle error&#187;
    else:
        return &#171;topological order&#187;
</code></pre></figure><p>
    <span>I initialize the graph indices from vertices </span>
    <code>V</code>
    <span> and edges </span>
    <code>E</code>
    <span>:</span>
  </p>

[...]

Using the script to tangle source code

$ litprog tangle sample.xml MAIN

from collections import defaultdict

def khans_algorithm(*, V, E):
    E_idx0 = defaultdict(set)
    E_idx1 = defaultdict(set)
    
    def index(n, m):
        E_idx0[n].add(m)
        E_idx1[m].add(n)
    
    def deindex(n, m):
        E_idx0[n].remove(m)
        E_idx1[m].remove(n)
    
    def indegree(n):
        return len(E_idx1[n])
    
    def outdegree(n):
        return len(E_idx0[n])
    
    def is_source(n):
        return indegree(n) == 0
    
    for n, m in E:
        index(n, m)
    L = []
    S = set(filter(is_source, V))
    
    while len(S) > 0:
        n = S.pop()
        L.append(n)
        for m in tuple(E_idx0[n]):
            deindex(n, m)
            if is_source(m):
                S.add(m)
    
    if any((indegree(n) > 0) or (outdegree(n) > 0) for n in V):
        raise RuntimeError("Graph contains a cycle.")
    else:
        return L

How the script works

I use argparse to parse the command line arguments.

To handle the XML, I use Python's xml.etree.ElementTree module. This allows me to parse the XML and traverse it using XPath queries.

To handle the macros, I use several regular expressions:

(?x) activates the flag re.VERBOSE.
(?m) activates the flag re.MULTILINE.
(?P<...>...) creates named capture groups.

I also use re.sub with a replacement function (instead of a replacement string). E.g.,

def f(match):
    return match.group()[::-1]


re.sub(r"[a-z]+", f, "123 abc 343")

'123 cba 343'

In the function elem_like, I use printf-style string formatting because -- in this specific scenario -- I found it more readable than using the format string syntax:

namespace = "http://commonmark.org/xml/1.0"
ppath = "paragraph"

".//{%s}%s" % (namespace, ppath)

'.//{http://commonmark.org/xml/1.0}paragraph'

f".//{{{namespace}}}{ppath}"

'.//{http://commonmark.org/xml/1.0}paragraph'

I normalize the spacing by using str.split and str.join to make sure that changes in white space don't mess up macro references:

message = " hello     world  yeah "

message.split()

['hello', 'world', 'yeah']

" ".join(message.split())

'hello world yeah'

I also use str.casefold to make sure that referring to a macro as MAIN or main doesn't make a difference:

"MAIN" == "main"

False

"MAIN".casefold() == "main".casefold()

True

In conclusion...

In this week's blog post, you learned how to build a command line tool for literate programming. You parsed XML using Python's xml.etree.ElementTree module, and traversed it using XPath queries. You used regular expressions to parse macro definitions. And you used string methods (str.split, str.join, and str.casefold) to normalize text.

My challenge to you:

Modify litprog_weave to add macro hyperlinks to the output HTML. So, when you click on a macro, like «macro», you are taken the definition of that macro.

If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!

(If you spot any errors or typos on this post, contact me via my contact page.)