Building a command line tool to check hyperlinks
By John Lekberg on January 15, 2020.
This week's post is about building a command line tool to check if links (e.g.
https://www.google.com
) are broken.
My writing uses a lot of hyperlinks:
- I write technical articles for this blog.
- I maintain a personal wiki.
- I write technical documentation at my job.
But the problem I run into is that I don't know if a link has gone bad until I check it myself. This means that I often find out a link is broken when I really need access to the page that the link was pointing at. Checking links by hand takes a lot of time, so I wrote a link checking tool in Python.
Script source code
check-links
#!/usr/bin/env python3
import urllib.request
def follow(link, *, timeout=None):
user_agent = (
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;"
" rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
)
header = {"User-Agent": user_agent}
request = urllib.request.Request(link, None, header)
urllib.request.urlopen(request, timeout=timeout)
if __name__ == "__main__":
import argparse
import sys
parser = argparse.ArgumentParser(
description="check STDIN for broken links"
)
parser.add_argument(
"--timeout",
metavar="SECONDS",
default=None,
type=int,
)
args = parser.parse_args()
for line in sys.stdin:
link = line.strip().rstrip(".")
try:
follow(link, timeout=args.timeout)
except Exception as e:
print(link, str(e), sep="\t")
$ check-links --help
usage: check-links [-h] [--timeout SECONDS]
check STDIN for broken links
optional arguments:
-h, --help show this help message and exit
--timeout SECONDS
An example of using the script
Here's a document the contains several links:
python-notes.txt
Python Notes
main documentation at https://docs.python.org/3/. useful links:
importlib.reload
https://docs.python.org/3/library/importlib.html#importlib.reload
sqlite3.Row
https://docs.python.org/3/library/sqlite3.html#sqlite3.Row
numpy main website at https://numpy.org.
my python blog: https://johnlekberg.com/blogs.
another good blog to follow: https://thebestpythonblogever.com.
I extract the links using grep:
$ grep -oE 'https?://\S+' python-notes.txt
https://docs.python.org/3/.
https://docs.python.org/3/library/importlib.html#importlib.reload
https://docs.python.org/3/library/sqlite3.html#sqlite3.Row
https://numpy.org.
https://johnlekberg.com/blogs.
https://thebestpythonblogever.com.
I run the check-links
script on these links:
$ grep -oE 'https?://\S+' python-notes.txt | check-links --timeout 5
https://johnlekberg.com/blogs HTTP Error 404: Not Found
https://thebestpythonblogever.com <urlopen error timed out>
How the script works
The core of this script is the follow
function:
def follow(link, *, timeout=None):
user_agent = (
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;"
" rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
)
header = {"User-Agent": user_agent}
request = urllib.request.Request(link, None, header)
urllib.request.urlopen(request, timeout=timeout)
Which does nothing when called on a working link:
>>> follow("https://www.google.com")
>>>
But raises an exception on a broken link:
>>> follow("https://johnlekberg.com/blogs")
Traceback (most recent call last):
...
urllib.error.HTTPError: HTTP Error 404: Not Found
>>>
I use the urllib
package to send the request.
I set a custom user agent because the default user agent for urllib causes many
websites to respond with a
HTTP 403 error code, a
false positive in this case.
I include a timeout
parameter to allow setting the timeout length.
(On my system, the default timeout is 60 seconds. I prefer a shorter
timeout.)
To create a script that works on the command line, I
check __name__ == "__main__"
and I use the
argparse module to let the
user supply a --timeout
argument on the command line:
parser = argparse.ArgumentParser(
description="check STDIN for broken links"
)
parser.add_argument(
"--timeout",
metavar="SECONDS",
default=None,
type=int,
)
args = parser.parse_args()
Since my method of extracting the links using grep sometimes leaves
trailing punctuation, such as .
in https://docs.python.org/3/.
I use str.rstrip to remove that:
link = line.strip().rstrip(".")
That's it for this week's blog post. See you next week!
(If you spot any errors or typos on this post, contact me via my contact page.)