Return to Blog

Building a command line tool to convert CSV into JSON

By John Lekberg on December 18, 2020.


This week's post is about building a command line tool for that converts CSV data into JSON.

Comma-Separated Values (CSV) (see RFC-4180) and JavaScript Object Notation (JSON) (see RFC-8259) are common data formats. When I create datasets by hand, I use Microsoft Excel. I can export my data into a CSV file, but it's easier for my programs to process JSON data. So, I built this tool to help myself out!

You will learn:

Script source code

csv2json

#!/usr/bin/env python3

from collections import defaultdict
from contextlib import suppress
from io import StringIO
import csv
import json
import sys


def MAIN():
    from argparse import ArgumentParser, FileType

    parser = ArgumentParser(
        description="""
    Convert a CSV file into JSON, either into "Structure of
    Arrays" (soa) format -- {x:[...],y:[...]} -- or
    "Array of Structures" (aos) format -- [{x,y},{x,y},...].
    """
    )
    parser.add_argument(
        "--format",
        choices=["soa", "aos"],
        default="aos",
        help="JSON format. Default=aos.",
    )
    parser.add_argument(
        "--convert",
        action="store_true",
        help="Convert strings into numbers when possible.",
    )
    parser.add_argument(
        "file",
        type=FileType(),
        default=sys.stdin,
        help="CSV file. Default=STDIN.",
    )
    args = parser.parse_args()

    if args.convert:
        converter = convert_numbers
    else:
        converter = lambda x: x

    with args.file:
        csv_input = args.file.read()

    try:
        dialect = csv.Sniffer().sniff(csv_input)
    except csv.Error:
        dialect = csv.excel

    reader = csv.DictReader(
        StringIO(csv_input), dialect=dialect
    )

    if args.format == "aos":
        result = [
            {
                field: converter(value)
                for field, value in row.items()
            }
            for row in reader
        ]
    elif args.format == "soa":
        result = defaultdict(list)
        for row in reader:
            for field, value in row.items():
                result[field].append(converter(value))

    print(json.dumps(result))


def convert_numbers(x):
    """Attempt to convert a string into a number, otherwise
    return the original value.

    x -- str. E.g. "3", "hello".
    """
    with suppress(ValueError):
        return int(x)
    with suppress(ValueError):
        return float(x)
    return x


if __name__ == "__main__":
    MAIN()
$ csv2json --help
usage: csv2json [-h] [--format {soa,aos}] [--convert] file

Convert a CSV file into JSON, either into "Structure of Arrays" (soa) format
-- {x:[...],y:[...]} -- or "Array of Structures" (aos) format --
[{x,y},{x,y},...].

positional arguments:
  file                CSV file. Default=STDIN.

optional arguments:
  -h, --help          show this help message and exit
  --format {soa,aos}  JSON format. Default=aos.
  --convert           Convert strings into numbers when possible.

Using the script to convert CSV data into JSON

Here is a CSV file:

data.csv

"Name", "Age", "Height (in)", "Weight (lbs)"
"Alex", 41, 74, 170
"Bert", 42, 68, 166
"Carl", 32, 70, 155
"Dave", 39, 72, 167
"Elly", 30, 66, 124
"Fran", 33, 66, 115
"Gwen", 26, 64, 121

I turn it into an "Array of Structures" (AoS):

$ csv2json data.csv --convert --format aos |
      python3 -m json.tool
[
    {
        "Name": "Alex",
        "Age": 41,
        "Height (in)": 74,
        "Weight (lbs)": 170
    },
    {
        "Name": "Bert",
        "Age": 42,
        "Height (in)": 68,
        "Weight (lbs)": 166
    },
    {
        "Name": "Carl",
        "Age": 32,
        "Height (in)": 70,
        "Weight (lbs)": 155
    },
    {
        "Name": "Dave",
        "Age": 39,
        "Height (in)": 72,
        "Weight (lbs)": 167
    },
    {
        "Name": "Elly",
        "Age": 30,
        "Height (in)": 66,
        "Weight (lbs)": 124
    },
    {
        "Name": "Fran",
        "Age": 33,
        "Height (in)": 66,
        "Weight (lbs)": 115
    },
    {
        "Name": "Gwen",
        "Age": 26,
        "Height (in)": 64,
        "Weight (lbs)": 121
    }
]

And I turn it into a "Structure of Arrays" (SoA):

$ csv2json data.csv --convert --format soa |
      python3 -m json.tool
{
    "Name": [
        "Alex",
        "Bert",
        "Carl",
        "Dave",
        "Elly",
        "Fran",
        "Gwen"
    ],
    "Age": [
        41,
        42,
        32,
        39,
        30,
        33,
        26
    ],
    "Height (in)": [
        74,
        68,
        70,
        72,
        66,
        66,
        64
    ],
    "Weight (lbs)": [
        170,
        166,
        155,
        167,
        124,
        115,
        121
    ]
}

How the script works

In the MAIN function, I use the argparse module to parse the command line arguments. I use argparse.FileType to open the CSV data file given by the user. csv.Sniffer.sniff allows me to guess the format of the CSV data -- if that fails, then I default to using csv.excel. csv.DictReader is like csv.reader, but exposes the rows as dictionary objects instead of list objects, which means that I can access data by field name (e.g. "weight") instead of by position (e.g. "column 3").

In the convert_numbers function, I use contextlib.suppress to suppress ValueError exceptions raised by int and float. By suppressing the exceptions, the function continues to the next alternative: attempting int-conversion, then float-conversion, then falling back to returning the original value.

In conclusion...

In this week's post, you learned how to use Python's csv module to parse CSV data using csv.DictReader and how to guess the "dialect" of CSV data using csv.Sniffer.sniff. You also learned how to suppress exceptions with contextlib.suppress.

My challenge to you:

Add support to csv2json for an "Array of Arrays" (AoA) format. Something like this:

$ csv2json data.csv --format aoa
{
    "columns": [ "Name", "Age", "Height (in)", "Weight (lbs)" ],
    "values": [
        [ "Alex", "41", "74", "170" ],
        [ "Bert", "42", "68", "166" ],
        [ "Carl", "32", "70", "155" ],
        [ "Dave", "39", "72", "167" ],
        [ "Elly", "30", "66", "124" ],
        [ "Fran", "33", "66", "115" ],
        [ "Gwen", "26", "64", "121" ]
    ]
}

If you enjoyed this week's post, share it with your friends.

The blog will finish next week. Stay tuned.


(If you spot any errors or typos on this post, contact me via my contact page.)