Handling encoding and decoding errors in Python
By John Lekberg on April 03, 2020.
This week's blog post is about handling errors when encoding and decoding data. You will learn 6 different ways to handle these errors, ranging from strictly requiring all data to be valid, to skipping over malformed data.
Codecs
Python uses coder-decoders (codecs) to
-
Encode str objects into bytes objects.
"小島 秀夫 (Hideo Kojima)".encode("shift_jis")
b'\x8f\xac\x93\x87 \x8fG\x95v (Hideo Kojima)'
-
Decode bytes objects into str objects.
b"\x8f\xac\x93\x87 \x8fG\x95v (Hideo Kojima)".decode("shift_jis")
'小島 秀夫 (Hideo Kojima)'
(Shift JIS is a codec for the Japanese language.)
What happens when a codec operation fails?
When a codec operation encounters malformed data, that's an error:
"小島 秀夫 (Hideo Kojima)".encode("ascii")
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-1: ordinal not in range(128)
b"\x8f\xac\x93\x87 \x8fG\x95v (Hideo Kojima)".decode("ascii")
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in
position 0: ordinal not in range(128)
How can I deal with codec operation failures?
Besides raising a UnicodeError exception, there are 5 other ways to deal with codec operation errors:
-
When encoding and decoding, ignore malformed data:
"小島 秀夫 (Hideo Kojima)".encode("ascii", errors="ignore")
b' (Hideo Kojima)'
b"\x8f\xac\x93\x87 \x8fG\x95v (Hideo Kojima)".decode("ascii", errors="ignore")
' Gv (Hideo Kojima)'
-
When encoding and decoding, replace malformed data with a replacement character (
"�"
andb"?"
):"小島 秀夫 (Hideo Kojima)".encode("ascii", errors="replace")
b'?? ?? (Hideo Kojima)'
b"\x8f\xac\x93\x87 \x8fG\x95v (Hideo Kojima)".decode("ascii", errors="replace")
'���� �G�v (Hideo Kojima)'
-
When encoding and decoding, replace malformed data with backslashed escape sequences:
"小島 秀夫 (Hideo Kojima)".encode("ascii", errors="backslashreplace")
b'\\u5c0f\\u5cf6 \\u79c0\\u592b (Hideo Kojima)'
b"\x8f\xac\x93\x87 \x8fG\x95v (Hideo Kojima)".decode("ascii", errors="backslashreplace")
'\\x8f\\xac\\x93\\x87 \\x8fG\\x95v (Hideo Kojima)'
-
When encoding, replace malformed data with XML character references:
"小島 秀夫 (Hideo Kojima)".encode("ascii", errors="xmlcharrefreplace")
b'小島 秀夫 (Hideo Kojima)'
-
When encoding, replace malformed data with
\N{...}
(named unicode characters):"小島 秀夫 (Hideo Kojima)".encode("ascii", errors="namereplace")
b'\\N{CJK UNIFIED IDEOGRAPH-5C0F}\\N{CJK UNIFIED IDEOGRAPH-5CF6} \\N{CJK UNIFIED IDEOGRAPH-79C0}\\N{CJK UNIFIED IDEOGRAPH-592B} (Hideo Kojima)'
(There is another error handler, "surrogateescape"
, that is out of
the scope of this blog post.)
Different error handling strategies are useful in different contexts. Here's a table of the 6 different errors handlers:
errors=... | Do ... with malformed data |
---|---|
"strict" | Raise UnicodeError |
"ignore" | Ignore and continue |
"replace" | Replace with replacement character |
"backslashreplace" | Replace with backslashed escape sequence |
"xmlcharrefreplace" | Replace with XML character reference |
"namereplace" | Replace with \N{...} (named unicode character) |
"strict"
is the default error handler.
Besides str.encode and bytes.decode, error handling is available ...
- With the built-in function open.
- With the pathlib module functions Path.open and Path.read_text.
- With the codecs module functions and classes decode, encode, open, EncodedFile, iterencode, iterdecode, Codec.encode, Codec.decode, IncrementalEncoder, IncrementalDecoder, StreamWriter, Stream.Reader, StreamReaderWriter, and StreamRecoder.
- With the io module functions and classes open, and TextIOWrapper.
In conclusion...
In this post you learned 6 different ways to handle codec operation
errors.
The default strategy (errors="strict"
) raises an exception when an
error occurs.
But, sometimes you want your program to continue processing data,
either by omitting bad data (errors="ignore"
) or by replacing bad data
with replacement characters (errors="replace"
).
If you are generating a HTML or an XML document, you can replace
malformed data with XML character references
(errors="xmlcharrefreplace"
).
My challenge to you:
This post discussed 6 different ways to handle codec operation errors. There is another way,
"surrogateescape"
. Learn how to use"surrogateescape"
and create an example of decoding-then-encoding a file using it.
If you enjoyed this week's post, share it with you friends and stay tuned for next week's post. See you then!
(If you spot any errors or typos on this post, contact me via my contact page.)