Building a Pandoc filter in Python that turns CSV data into formatted tables
By John Lekberg on November 27, 2020.
This week's post is about building a Pandoc filter in Python that turns Comma-Separated Value (CSV) data into formatted tables. You will learn:
- How to use Python's json module to read and write JSON documents.
- How to use Python's copy module to copy data structures.
- How to read embedded CSV data using using Python's csv and io modules.
Pandoc is a document conversion system that allows you to convert between different markup formats. E.g., from Markdown to HTML, from LaTeX to PDF, or from Microsoft Word to HTML.
To install Pandoc, follow the installation instructions on its website:
"Installing pandoc" via pandoc.org (https://pandoc.org/installing.html)
(I'm using Pandoc version 2.9.2.1. Check your version with $ pandoc --version
.)
Pandoc has a filter system that allows you to modify the abstract syntax tree (AST) that it creates. This AST acts as an intermediate document format, and it has a JSON representation, which can be parsed and modified by Python. For more details on Pandoc's filter system, see:
"Pandoc filters" via pandoc.org (https://pandoc.org/filters.html)
Filter source code
csv-code-table
#!/usr/bin/env python3
from copy import copy
import csv
import io
import json
import sys
def MAIN():
data = json.load(sys.stdin)
data = pandoc_map(CodeBlock_to_Table, data)
json.dump(data, sys.stdout)
def pandoc_map(func, data):
"""Map a function over a Pandoc document. Returns a
copy of the modified document.
func -- callable. A function to transform parts of a
Pandoc document.
data -- dict. A Pandoc document, read from json.load.
"""
assert callable(func)
assert isinstance(data, dict)
assert "blocks" in data
def can_walk(x):
"""Return if a part of the document tree is
walkable (True/False).
"""
return (
("c" in x)
and isinstance(x["c"], list)
and all(isinstance(y, dict) for y in x["c"])
)
def walk(x):
"""Recursively apply func throughout the document
tree.
"""
y = func(x)
if can_walk(y):
y = copy(y)
y["c"] = [walk(z) for z in y["c"]]
return y
result = copy(data)
result["blocks"] = [walk(x) for x in result["blocks"]]
assert isinstance(result, dict)
assert "blocks" in result
return result
def CodeBlock_to_Table(x):
"""Turn CodeBlock elements marked with "csv" into Table
element. Meant to be used by pandoc_map.
"""
if x["t"] == "CodeBlock":
infostring = x["c"][0][1]
if "csv" in infostring:
text = x["c"][1]
reader = csv.reader(io.StringIO(text))
data_header = next(reader)
data_rows = list(reader)
n_columns = len(data_header)
assert all(
len(row) == n_columns for row in data_rows
)
Plain_Str = lambda x: {
"t": "Plain",
"c": [{"t": "Str", "c": x}],
}
result = {
"t": "Table",
"c": [
[],
[{"t": "AlignDefault"}] * n_columns,
[0] * n_columns,
[
[Plain_Str(column)]
for column in data_header
],
[
[
[Plain_Str(cell.strip())]
for cell in row
]
for row in data_rows
],
],
}
return result
return x
if __name__ == "__main__":
MAIN()
An example of using the filter
Here is a sample Markdown document with a CSV code block:
document.md
Here are ratings for the 7 movies that you requested:
``` csv
Movie, Year, Rotten Tomatoes® rating
Detachment, 2011, 57%
Horrible Bosses 2, 2014, 36%
Intermission, 2003, 73%
October Sky, 1999, 90%
Blackfish, 2013, 98%
Hotel Transylvania, 2012, 45%
Interstellar, 2014, 73%
```
Please let me know which movie you'd like to watch.
And here's how to use csv-code-table
as a filter on the JSON AST:
$ cat document.md | pandoc -f markdown -t json | ./csv-code-table | pandoc -f json -t markdown
Here are ratings for the 7 movies that you requested:
Movie Year Rotten Tomatoes® rating
-------------------- ------- --------------------------
Detachment 2011 57%
Horrible Bosses 2 2014 36%
Intermission 2003 73%
October Sky 1999 90%
Blackfish 2013 98%
Hotel Transylvania 2012 45%
Interstellar 2014 73%
Please let me know which movie you'd like to watch.
How the filter works
I use the json module to read and write the JSON documents produced by Pandoc. (See json.load and json.dump for details.)
The function pandoc_map
is a higher-order function that recursively
applies a function to a Pandoc document. It uses a helper function, walk
,
to do this. I also use copy.copy from the copy module to make
a shallow copy (cf. a deep copy) of parts of the document.
The function CodeBlock_to_Table
is to be used by pandoc_map
. It checks
each element to see if it is a CodeBlock
element and if it is marked with
"csv"
. I learned the structure of CodeBlock
and Table
elements by
observing Pandoc's output on some sample data. E.g.
sample.md
``` csv
1, 2, 3
```
Col1 Col2
---- ----
A B
C D
$ pandoc -f markdown -t json sample.md
{
"blocks": [
{
"c": [
[ "", [ "csv" ], [] ],
"1, 2, 3"
],
"t": "CodeBlock"
},
{
"c": [
[],
[ { "t": "AlignDefault" }, { "t": "AlignDefault" } ],
[ 0, 0 ],
[ [ { "c": [ { "c": "Col1", "t": "Str" } ],
"t": "Plain" } ],
[ { "c": [ { "c": "Col2", "t": "Str" } ],
"t": "Plain" } ] ],
[ [ [ { "c": [ { "c": "A", "t": "Str" } ],
"t": "Plain" } ],
[ { "c": [ { "c": "B", "t": "Str" } ],
"t": "Plain" } ] ],
[ [ { "c": [ { "c": "C", "t": "Str" } ],
"t": "Plain" } ],
[ { "c": [ { "c": "D", "t": "Str" } ],
"t": "Plain" } ] ] ]
],
"t": "Table"
}
],
"meta": {},
"pandoc-api-version": [ 1, 20 ]
}
For generating some repetitive parts of the Table
element, I use Python's
sequence-repetition syntax. E.g.,
[3] * 7
[3, 3, 3, 3, 3, 3, 3]
"abc" * 2
'abcabc'
[1, 2, 3] * 3
[1, 2, 3, 1, 2, 3, 1, 2, 3]
To read the CSV data, I used Python's csv and io modules. csv.reader expects a file-like object, and io.StringIO allows me to turn a string object into a file-like object.
In conclusion...
In this week's post, you learned how to build a Pandoc filter in Python that turns CSV data into formatted tables. You used the json module to read and write JSON documents. You used the copy module to copy data and modify it without changing the original -- this makes it easy to express document transformations. And you used the csv module to parse embedded CSV data, which was made available using the io module.
My challenge to you:
Learn how Pandoc handles table alignment (e.g. right-aligned, left-aligned).
Modify the Python function
CodeBlock_to_Table
to support aligning the columns (e.g. "column 1 is right-aligned, column 2 is left-aligned").
If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!
(If you spot any errors or typos on this post, contact me via my contact page.)