Building a command line tool to check hyperlinks

By John Lekberg on January 15, 2020.

This week's post is about building a command line tool to check if links (e.g. https://www.google.com) are broken. My writing uses a lot of hyperlinks:

I write technical articles for this blog.
I maintain a personal wiki.
I write technical documentation at my job.

But the problem I run into is that I don't know if a link has gone bad until I check it myself. This means that I often find out a link is broken when I really need access to the page that the link was pointing at. Checking links by hand takes a lot of time, so I wrote a link checking tool in Python.

Script source code

check-links

#!/usr/bin/env python3

import urllib.request


def follow(link, *, timeout=None):
    user_agent = (
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;"
        " rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
    )
    header = {"User-Agent": user_agent}
    request = urllib.request.Request(link, None, header)
    urllib.request.urlopen(request, timeout=timeout)


if __name__ == "__main__":
    import argparse
    import sys

    parser = argparse.ArgumentParser(
        description="check STDIN for broken links"
    )
    parser.add_argument(
        "--timeout",
        metavar="SECONDS",
        default=None,
        type=int,
    )
    args = parser.parse_args()

    for line in sys.stdin:
        link = line.strip().rstrip(".")
        try:
            follow(link, timeout=args.timeout)
        except Exception as e:
            print(link, str(e), sep="\t")

$ check-links --help

usage: check-links [-h] [--timeout SECONDS]

check STDIN for broken links

optional arguments:
  -h, --help         show this help message and exit
  --timeout SECONDS

An example of using the script

Here's a document the contains several links:

python-notes.txt

Python Notes

main documentation at https://docs.python.org/3/. useful links:

importlib.reload
  https://docs.python.org/3/library/importlib.html#importlib.reload

sqlite3.Row
  https://docs.python.org/3/library/sqlite3.html#sqlite3.Row

numpy main website at https://numpy.org.

my python blog: https://johnlekberg.com/blogs.

another good blog to follow: https://thebestpythonblogever.com.

I extract the links using grep:

$ grep -oE 'https?://\S+' python-notes.txt

https://docs.python.org/3/.
https://docs.python.org/3/library/importlib.html#importlib.reload
https://docs.python.org/3/library/sqlite3.html#sqlite3.Row
https://numpy.org.
https://johnlekberg.com/blogs.
https://thebestpythonblogever.com.

I run the check-links script on these links:

$ grep -oE 'https?://\S+' python-notes.txt |
  check-links --timeout 5

https://johnlekberg.com/blogs   HTTP Error 404: Not Found
https://thebestpythonblogever.com       <urlopen error timed out>

How the script works

The core of this script is the follow function:

def follow(link, *, timeout=None):
    user_agent = (
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;"
        " rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
    )
    header = {"User-Agent": user_agent}
    request = urllib.request.Request(link, None, header)
    urllib.request.urlopen(request, timeout=timeout)

Which does nothing when called on a working link:

>>> follow("https://www.google.com")
>>>

But raises an exception on a broken link:

>>> follow("https://johnlekberg.com/blogs")
Traceback (most recent call last):
  ...
urllib.error.HTTPError: HTTP Error 404: Not Found
>>>

I use the urllib package to send the request. I set a custom user agent because the default user agent for urllib causes many websites to respond with a HTTP 403 error code, a false positive in this case. I include a timeout parameter to allow setting the timeout length. (On my system, the default timeout is 60 seconds. I prefer a shorter timeout.)

To create a script that works on the command line, I check __name__ == "__main__" and I use the argparse module to let the user supply a --timeout argument on the command line:

parser = argparse.ArgumentParser(
    description="check STDIN for broken links"
)
parser.add_argument(
    "--timeout",
    metavar="SECONDS",
    default=None,
    type=int,
)
args = parser.parse_args()

Since my method of extracting the links using grep sometimes leaves trailing punctuation, such as . in https://docs.python.org/3/. I use str.rstrip to remove that:

link = line.strip().rstrip(".")

That's it for this week's blog post. See you next week!

(If you spot any errors or typos on this post, contact me via my contact page.)