Building a command line tool to convert CSV into JSON
By John Lekberg on December 18, 2020.
This week's post is about building a command line tool for that converts CSV data into JSON.
Comma-Separated Values (CSV) (see RFC-4180) and JavaScript Object Notation (JSON) (see RFC-8259) are common data formats. When I create datasets by hand, I use Microsoft Excel. I can export my data into a CSV file, but it's easier for my programs to process JSON data. So, I built this tool to help myself out!
You will learn:
- How to parse CSV data and guess the "dialect" using csv.Sniffer.
- How to suppress exceptions with contextlib.suppress.
Script source code
csv2json
#!/usr/bin/env python3
from collections import defaultdict
from contextlib import suppress
from io import StringIO
import csv
import json
import sys
def MAIN():
from argparse import ArgumentParser, FileType
parser = ArgumentParser(
description="""
Convert a CSV file into JSON, either into "Structure of
Arrays" (soa) format -- {x:[...],y:[...]} -- or
"Array of Structures" (aos) format -- [{x,y},{x,y},...].
"""
)
parser.add_argument(
"--format",
choices=["soa", "aos"],
default="aos",
help="JSON format. Default=aos.",
)
parser.add_argument(
"--convert",
action="store_true",
help="Convert strings into numbers when possible.",
)
parser.add_argument(
"file",
type=FileType(),
default=sys.stdin,
help="CSV file. Default=STDIN.",
)
args = parser.parse_args()
if args.convert:
converter = convert_numbers
else:
converter = lambda x: x
with args.file:
csv_input = args.file.read()
try:
dialect = csv.Sniffer().sniff(csv_input)
except csv.Error:
dialect = csv.excel
reader = csv.DictReader(
StringIO(csv_input), dialect=dialect
)
if args.format == "aos":
result = [
{
field: converter(value)
for field, value in row.items()
}
for row in reader
]
elif args.format == "soa":
result = defaultdict(list)
for row in reader:
for field, value in row.items():
result[field].append(converter(value))
print(json.dumps(result))
def convert_numbers(x):
"""Attempt to convert a string into a number, otherwise
return the original value.
x -- str. E.g. "3", "hello".
"""
with suppress(ValueError):
return int(x)
with suppress(ValueError):
return float(x)
return x
if __name__ == "__main__":
MAIN()
$ csv2json --help
usage: csv2json [-h] [--format {soa,aos}] [--convert] file
Convert a CSV file into JSON, either into "Structure of Arrays" (soa) format
-- {x:[...],y:[...]} -- or "Array of Structures" (aos) format --
[{x,y},{x,y},...].
positional arguments:
file CSV file. Default=STDIN.
optional arguments:
-h, --help show this help message and exit
--format {soa,aos} JSON format. Default=aos.
--convert Convert strings into numbers when possible.
Using the script to convert CSV data into JSON
Here is a CSV file:
data.csv
"Name", "Age", "Height (in)", "Weight (lbs)"
"Alex", 41, 74, 170
"Bert", 42, 68, 166
"Carl", 32, 70, 155
"Dave", 39, 72, 167
"Elly", 30, 66, 124
"Fran", 33, 66, 115
"Gwen", 26, 64, 121
I turn it into an "Array of Structures" (AoS):
$ csv2json data.csv --convert --format aos | python3 -m json.tool
[
{
"Name": "Alex",
"Age": 41,
"Height (in)": 74,
"Weight (lbs)": 170
},
{
"Name": "Bert",
"Age": 42,
"Height (in)": 68,
"Weight (lbs)": 166
},
{
"Name": "Carl",
"Age": 32,
"Height (in)": 70,
"Weight (lbs)": 155
},
{
"Name": "Dave",
"Age": 39,
"Height (in)": 72,
"Weight (lbs)": 167
},
{
"Name": "Elly",
"Age": 30,
"Height (in)": 66,
"Weight (lbs)": 124
},
{
"Name": "Fran",
"Age": 33,
"Height (in)": 66,
"Weight (lbs)": 115
},
{
"Name": "Gwen",
"Age": 26,
"Height (in)": 64,
"Weight (lbs)": 121
}
]
And I turn it into a "Structure of Arrays" (SoA):
$ csv2json data.csv --convert --format soa | python3 -m json.tool
{
"Name": [
"Alex",
"Bert",
"Carl",
"Dave",
"Elly",
"Fran",
"Gwen"
],
"Age": [
41,
42,
32,
39,
30,
33,
26
],
"Height (in)": [
74,
68,
70,
72,
66,
66,
64
],
"Weight (lbs)": [
170,
166,
155,
167,
124,
115,
121
]
}
How the script works
In the MAIN
function, I use the argparse module to parse the command line
arguments. I use argparse.FileType to open the CSV data file given by the
user. csv.Sniffer.sniff allows me to guess the format of the CSV data -- if
that fails, then I default to using csv.excel. csv.DictReader is like
csv.reader, but exposes the rows as dictionary objects instead of list
objects, which means that I can access data by field name (e.g. "weight")
instead of by position (e.g. "column 3").
In the convert_numbers
function, I use contextlib.suppress to suppress
ValueError exceptions raised by int and
float. By suppressing the exceptions, the function continues to
the next alternative: attempting int-conversion, then
float-conversion, then falling back to returning the original
value.
In conclusion...
In this week's post, you learned how to use Python's csv module to parse CSV data using csv.DictReader and how to guess the "dialect" of CSV data using csv.Sniffer.sniff. You also learned how to suppress exceptions with contextlib.suppress.
My challenge to you:
Add support to
csv2json
for an "Array of Arrays" (AoA) format. Something like this:$ csv2json data.csv --format aoa
{ "columns": [ "Name", "Age", "Height (in)", "Weight (lbs)" ], "values": [ [ "Alex", "41", "74", "170" ], [ "Bert", "42", "68", "166" ], [ "Carl", "32", "70", "155" ], [ "Dave", "39", "72", "167" ], [ "Elly", "30", "66", "124" ], [ "Fran", "33", "66", "115" ], [ "Gwen", "26", "64", "121" ] ] }
If you enjoyed this week's post, share it with your friends.
The blog will finish next week. Stay tuned.
(If you spot any errors or typos on this post, contact me via my contact page.)