Return to Blog

Handling encoding and decoding errors in Python

By John Lekberg on April 03, 2020.


This week's blog post is about handling errors when encoding and decoding data. You will learn 6 different ways to handle these errors, ranging from strictly requiring all data to be valid, to skipping over malformed data.

Codecs

Python uses coder-decoders (codecs) to

(Shift JIS is a codec for the Japanese language.)

What happens when a codec operation fails?

When a codec operation encounters malformed data, that's an error:

"小島 秀夫 (Hideo Kojima)".encode("ascii")
UnicodeEncodeError: 'ascii' codec can't encode characters in
    position 0-1: ordinal not in range(128)
b"\x8f\xac\x93\x87 \x8fG\x95v (Hideo Kojima)".decode("ascii")
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in
    position 0: ordinal not in range(128)

How can I deal with codec operation failures?

Besides raising a UnicodeError exception, there are 5 other ways to deal with codec operation errors:

(There is another error handler, "surrogateescape", that is out of the scope of this blog post.)

Different error handling strategies are useful in different contexts. Here's a table of the 6 different errors handlers:

errors=...Do ... with malformed data
"strict"Raise UnicodeError
"ignore"Ignore and continue
"replace"Replace with replacement character
"backslashreplace"Replace with backslashed escape sequence
"xmlcharrefreplace"Replace with XML character reference
"namereplace"Replace with \N{...} (named unicode character)

"strict" is the default error handler.

Besides str.encode and bytes.decode, error handling is available ...

In conclusion...

In this post you learned 6 different ways to handle codec operation errors. The default strategy (errors="strict") raises an exception when an error occurs. But, sometimes you want your program to continue processing data, either by omitting bad data (errors="ignore") or by replacing bad data with replacement characters (errors="replace"). If you are generating a HTML or an XML document, you can replace malformed data with XML character references (errors="xmlcharrefreplace").

My challenge to you:

This post discussed 6 different ways to handle codec operation errors. There is another way, "surrogateescape". Learn how to use "surrogateescape" and create an example of decoding-then-encoding a file using it.

If you enjoyed this week's post, share it with you friends and stay tuned for next week's post. See you then!


(If you spot any errors or typos on this post, contact me via my contact page.)