Recreating the command line tool uniq in Python

By John Lekberg on March 18, 2020.

This week's post will cover recreating the command line tool uniq in Python. You will learn:

How to use the argparse module to build command line interfaces.
How to use enumerations (enums) to represent different "modes of execution".
How to use the itertoools.groupby function to group related data.

Uniq is a command line tool that removes repeated lines from text. For example, I have this data file

data.txt

a
a
a
b
b
b
c
c
a
b
b
b
b
b
c

I use uniq to remove repeated lines:

$ uniq data.txt

a
b
c
a
b
c

Uniq is specified by the POSIX standard. (See "uniq" for details.)

Recreating uniq in Python is a good way to practice using Python's standard library to build command line tools.

Script source code

uniq

#!/usr/bin/env python3
"""
A recreation of the POSIX uniq tool.
"""

import collections
import enum
import itertools
import re
import sys
import functools

UniqMode = enum.Enum(
    "UniqMode",
    [
        "normal",
        "count",
        "suppress_repeated",
        "suppress_not_repeated",
    ],
)
UniqMode.__doc__ = """
UniqMode enumerates the different modes that the script can
run in.

- UniqMode.normal is the default mode.
- UniqMode.count corresponds to the "-c" flag.
- UniqMode.suppress_repeated corresponds to the "-u" flag.
- UniqMode.suppress_not_repeated corresponds to the "-d"
  flag.
"""


def run(*, mode, input_file, output_file, fields, char):
    """Run the uniq script.
    
    mode -- the UniqMode of the script.
    input_file -- read lines from this file.
    output_file -- output information to this file.
    fields -- the number of fields to skip.
    char -- the number of characters to skip after fields.

    """
    lines = map(remove_newline, input_file)
    line_key = functools.partial(
        preprocess_line, fields=fields, char=char
    )

    for _, duplicates in itertools.groupby(lines, line_key):
        duplicates = tuple(duplicates)

        repeated = len(duplicates) > 1
        if mode is UniqMode.suppress_repeated and repeated:
            continue
        if (
            mode is UniqMode.suppress_not_repeated
            and not repeated
        ):
            continue

        if mode is UniqMode.count:
            message = f"{len(duplicates)} {duplicates[0]}"
        else:
            message = f"{duplicates[0]}"

        print(message, file=output_file)


def preprocess_line(line, *, fields, char):
    """Preprocess a line by removing initial fields and
    characters.

    fields -- the number of fields to remove.
    char -- the number of characters to remove after the
            fields.

    """
    if fields > 0:
        line = re.sub(r"\s*\S*", "", line, count=fields)
    line = line[char:]
    return line


def remove_newline(text):
    """Remove a trailing newline from text."""
    if text.endswith("\n"):
        text = text[:-1]
    return text


def positive_integer(text):
    """Parse a positive integer from text.

    Raises an Exception for bad text.
    """
    n = int(text)
    if n <= 0:
        raise ValueError()
    return n


def input_FileType(filename):
    """Like argparse.FileType("r") but the filename "-" is
    turned into sys.stdin.
    """
    if filename == "-":
        return sys.stdin
    else:
        return open(filename, "r")


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser(
        description="report or filter out repeated lines in a file",
        prefix_chars="-+",
    )
    parser.set_defaults(mode=UniqMode.normal)
    mode_group = parser.add_mutually_exclusive_group()
    mode_group.add_argument(
        "-c",
        "+c",
        help="""
        Precede each output line with a count of the number
        of times the line occurred in the input.
        """,
        action="store_const",
        const=UniqMode.count,
        dest="mode",
    )
    mode_group.add_argument(
        "-d",
        "+d",
        help="""
        Suppress the writing of lines that are not repeated
        in the input.
        """,
        action="store_const",
        const=UniqMode.suppress_not_repeated,
        dest="mode",
    )
    mode_group.add_argument(
        "-u",
        "+u",
        help="""
        Suppress the writing of lines that are repeated in
        the input.
        """,
        action="store_const",
        const=UniqMode.suppress_repeated,
        dest="mode",
    )
    parser.add_argument(
        "-f",
        "+f",
        metavar="fields",
        type=positive_integer,
        help="""
        Ignore the first fields fields on each input line
        when doing comparisons, where fields is a positive
        decimal integer. A field is the maximal string
        matched by the basic regular expression:
        /[[:blank:]]*[^[:blank:]]*/. If the fields
        option-argument specifies more fields than appear on
        an input line, a null string shall be used for
        comparison.
        """,
        default=0,
    )
    parser.add_argument(
        "-s",
        "+s",
        metavar="char",
        type=positive_integer,
        help="""
        Ignore the first chars characters when doing
        comparisons, where chars shall be a positive decimal
        integer. If specified in conjunction with the -f
        option, the first chars characters after the first
        fields fields shall be ignored. If the chars
        option-argument specifies more characters than
        remain on an input line, a null string shall be used
        for comparison.
        """,
        default=0,
    )
    parser.add_argument(
        "input_file",
        help="""
        A pathname of the input file. If the input_file
        operand is not specified, or if the input_file is
        '-', the standard input shall be used.
        """,
        nargs="?",
        type=input_FileType,
        default=sys.stdin,
    )
    parser.add_argument(
        "output_file",
        help="""
        A pathname of the output file. If the output_file
        operand is not specified, the standard output shall
        be used. The results are unspecified if the file
        named by output_file is the file named by
        input_file.
        """,
        nargs="?",
        type=argparse.FileType("w"),
        default=sys.stdout,
    )
    args = parser.parse_args()

    with args.input_file, args.output_file:
        run(
            mode=args.mode,
            input_file=args.input_file,
            output_file=args.output_file,
            fields=args.f,
            char=args.s,
        )

$ ./uniq --help

usage: uniq [-h] [-c | -d | -u] [-f fields] [-s char]
            [input_file] [output_file]

report or filter out repeated lines in a file

positional arguments:
  ...

optional arguments:
  ...

Using the script on data

I have a sorted list of words from the Declaration of Independence:

declaration-words.txt

all
among
among
and
and
and
and
another
are
are
are
assume
bands
be
[...]
truths
unalienable
we
when
which
which
which
with
with

(Some data is omitted with "[...]" for presentation.)

I use my uniq script to count how frequently each word appears:

$ ./uniq -c declaration-words.txt

1 a
1 all
2 among
4 and
1 another
3 are
1 assume
1 bands
1 be
[...]
1 truths
1 unalienable
1 we
1 when
3 which
2 with

(Some data is omitted with "[...]" for presentation.)

How the script works

I use the argparse module to build the command line interface. (Argparse was introduced in PEP-389.)

Uniq supports 4 different ways of operating.

Remove repeated lines.
Remove repeated lines and print how many repetitions occurred.
Remove repeated lines and only print lines that are repeated.
Remove repeated lines and only print lines that are not repeated.

Only one of these 4 "modes" can be active at once, so I use an enum, UniqMode. Enums are useful for representing something that has a finite set of states. I prefer using enums to strings because Python will warn me if I mistype an enum (e.g. UniqMode.norml).

Different command line switches activate different modes: -c, -u, and -d. I use ArgumentParser.add_mutually_exclusive_group to ensure that at most, one mode switch is activated. Because these three switches all set the mode, I use keyword arguments to have [ArgumentParser.add_argument][py.argpase.ArgumentParser.add_argument] send the mode value to the same place:

dest names the destination. E.g. dest="mode".
action="store_const" means that using the flag will store a constant value in the destination.
const names the value to store. E.g. const=UniqMode.suppress_not_repeated.

I use ArgumentParser.set_defaults to set the default mode as UniqMode.normal.

Argparse also allows arguments to be processed and validated, using a function supplied in the type keyword-parameter. In this script:

I use type=positive_integer to parse an integer argument and assert that it is a positive value.
I use type=input_FileType to open a file for input, treating the filename "-" as standard input.
I use type=argparse.FileType("w") to open a file for output.

argparse.FileType is like the built-in function open.

The core of this script is the itertools.groupby function. Groupby removes duplicate elements from a list, using a key function to compare elements.

In conclusion...

The argparse module allows you to build command line interfaces. Using argparse is more robust than manually parsing sys.argv. Enums can be used to represent a set of finite states (different "modes") and are more robust that using strings because Python will warn you if you misspell an enum value. itertools.groupby allows you to group repeated elements in an iterable using a key function and is the core of this uniq script.

My challenge to you:

Build a command line interface, using argparse, that simulates the command line interface of awk. (This will be a bit simpler than the interface used by uniq.)

If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!

(If you spot any errors or typos on this post, contact me via my contact page.)