Return to Blog

Building a Pandoc filter in Python that turns CSV data into formatted tables

By John Lekberg on November 27, 2020.


This week's post is about building a Pandoc filter in Python that turns Comma-Separated Value (CSV) data into formatted tables. You will learn:

Pandoc is a document conversion system that allows you to convert between different markup formats. E.g., from Markdown to HTML, from LaTeX to PDF, or from Microsoft Word to HTML.

To install Pandoc, follow the installation instructions on its website:

"Installing pandoc" via pandoc.org (https://pandoc.org/installing.html)

(I'm using Pandoc version 2.9.2.1. Check your version with $ pandoc --version.)

Pandoc has a filter system that allows you to modify the abstract syntax tree (AST) that it creates. This AST acts as an intermediate document format, and it has a JSON representation, which can be parsed and modified by Python. For more details on Pandoc's filter system, see:

"Pandoc filters" via pandoc.org (https://pandoc.org/filters.html)

Filter source code

csv-code-table

#!/usr/bin/env python3

from copy import copy
import csv
import io
import json
import sys


def MAIN():
    data = json.load(sys.stdin)
    data = pandoc_map(CodeBlock_to_Table, data)
    json.dump(data, sys.stdout)


def pandoc_map(func, data):
    """Map a function over a Pandoc document. Returns a
    copy of the modified document.

    func -- callable. A function to transform parts of a
        Pandoc document.
    data -- dict. A Pandoc document, read from json.load.
    """
    assert callable(func)
    assert isinstance(data, dict)
    assert "blocks" in data

    def can_walk(x):
        """Return if a part of the document tree is
        walkable (True/False).
        """
        return (
            ("c" in x)
            and isinstance(x["c"], list)
            and all(isinstance(y, dict) for y in x["c"])
        )

    def walk(x):
        """Recursively apply func throughout the document
        tree.
        """
        y = func(x)
        if can_walk(y):
            y = copy(y)
            y["c"] = [walk(z) for z in y["c"]]
        return y

    result = copy(data)
    result["blocks"] = [walk(x) for x in result["blocks"]]

    assert isinstance(result, dict)
    assert "blocks" in result

    return result


def CodeBlock_to_Table(x):
    """Turn CodeBlock elements marked with "csv" into Table
    element. Meant to be used by pandoc_map.
    """
    if x["t"] == "CodeBlock":
        infostring = x["c"][0][1]
        if "csv" in infostring:
            text = x["c"][1]
            reader = csv.reader(io.StringIO(text))
            data_header = next(reader)
            data_rows = list(reader)
            n_columns = len(data_header)
            assert all(
                len(row) == n_columns for row in data_rows
            )
            Plain_Str = lambda x: {
                "t": "Plain",
                "c": [{"t": "Str", "c": x}],
            }
            result = {
                "t": "Table",
                "c": [
                    [],
                    [{"t": "AlignDefault"}] * n_columns,
                    [0] * n_columns,
                    [
                        [Plain_Str(column)]
                        for column in data_header
                    ],
                    [
                        [
                            [Plain_Str(cell.strip())]
                            for cell in row
                        ]
                        for row in data_rows
                    ],
                ],
            }
            return result
    return x


if __name__ == "__main__":
    MAIN()

An example of using the filter

Here is a sample Markdown document with a CSV code block:

document.md

Here are ratings for the 7 movies that you requested:

``` csv
Movie, Year, Rotten Tomatoes® rating
Detachment, 2011, 57%
Horrible Bosses 2, 2014, 36%
Intermission, 2003, 73%
October Sky, 1999, 90%
Blackfish, 2013, 98%
Hotel Transylvania, 2012, 45%
Interstellar, 2014, 73%
```

Please let me know which movie you'd like to watch.

And here's how to use csv-code-table as a filter on the JSON AST:

$ cat document.md |
  pandoc -f markdown -t json |
  ./csv-code-table |
  pandoc -f json -t markdown
Here are ratings for the 7 movies that you requested:

  Movie                 Year    Rotten Tomatoes® rating
  -------------------- ------- --------------------------
  Detachment           2011    57%
  Horrible Bosses 2    2014    36%
  Intermission         2003    73%
  October Sky          1999    90%
  Blackfish            2013    98%
  Hotel Transylvania   2012    45%
  Interstellar         2014    73%

Please let me know which movie you'd like to watch.

How the filter works

I use the json module to read and write the JSON documents produced by Pandoc. (See json.load and json.dump for details.)

The function pandoc_map is a higher-order function that recursively applies a function to a Pandoc document. It uses a helper function, walk, to do this. I also use copy.copy from the copy module to make a shallow copy (cf. a deep copy) of parts of the document.

The function CodeBlock_to_Table is to be used by pandoc_map. It checks each element to see if it is a CodeBlock element and if it is marked with "csv". I learned the structure of CodeBlock and Table elements by observing Pandoc's output on some sample data. E.g.

sample.md

``` csv
1, 2, 3
```

Col1  Col2
----  ----
   A     B
   C     D
$ pandoc -f markdown -t json sample.md
{
  "blocks": [
    {
      "c": [
        [ "", [ "csv" ], [] ],
        "1, 2, 3"
      ],
      "t": "CodeBlock"
    },
    {
      "c": [
        [],
        [ { "t": "AlignDefault" }, { "t": "AlignDefault" } ],
        [ 0, 0 ],
        [ [ { "c": [ { "c": "Col1", "t": "Str" } ],
              "t": "Plain" } ],
          [ { "c": [ { "c": "Col2", "t": "Str" } ],
              "t": "Plain" } ] ],
        [ [ [ { "c": [ { "c": "A", "t": "Str" } ],
                "t": "Plain" } ],
            [ { "c": [ { "c": "B", "t": "Str" } ],
                "t": "Plain" } ] ],
          [ [ { "c": [ { "c": "C", "t": "Str" } ],
                "t": "Plain" } ],
            [ { "c": [ { "c": "D", "t": "Str" } ],
                "t": "Plain" } ] ] ]
      ],
      "t": "Table"
    }
  ],
  "meta": {},
  "pandoc-api-version": [ 1, 20 ]
}

For generating some repetitive parts of the Table element, I use Python's sequence-repetition syntax. E.g.,

[3] * 7
[3, 3, 3, 3, 3, 3, 3]
"abc" * 2
'abcabc'
[1, 2, 3] * 3
[1, 2, 3, 1, 2, 3, 1, 2, 3]

To read the CSV data, I used Python's csv and io modules. csv.reader expects a file-like object, and io.StringIO allows me to turn a string object into a file-like object.

In conclusion...

In this week's post, you learned how to build a Pandoc filter in Python that turns CSV data into formatted tables. You used the json module to read and write JSON documents. You used the copy module to copy data and modify it without changing the original -- this makes it easy to express document transformations. And you used the csv module to parse embedded CSV data, which was made available using the io module.

My challenge to you:

Learn how Pandoc handles table alignment (e.g. right-aligned, left-aligned).

Modify the Python function CodeBlock_to_Table to support aligning the columns (e.g. "column 1 is right-aligned, column 2 is left-aligned").

If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!


(If you spot any errors or typos on this post, contact me via my contact page.)