Encoding

Unicode

We have worked with strings quite a bit so far.

In Python 3, the default str type represents a Unicode string.

Unicode is a standard that attempts to include all written languages characters.

Unicode does not dictate how these characters should be represented in actual bytes. For that, we need a character encoding. UTF-8 is the most popular and the one that we will be using primarily.

Bytes

In Python 3, strings are unicode. For Python 2 compatibility, strings also support a u prefix to denote unicode:

>>> "hello"
'hello'
>>> u"hello"
'hello'
>>> type("hello")
<class 'str'>
>>> type(u"hello")
<class 'str'>

If we want to represent our string as a UTF-8 byte string, we can encode it:

>>> word = "hello"
>>> byte_word = word.encode('utf-8')
>>> byte_word
b'hello'
>>> type(byte_word)
<class 'bytes'>

We can also make a byte string by using the b prefix:

>>> b"hello"
b'hello'

This pretty much looks the same except there’s a b prefix. ASCII characters are boring. Let’s make a Unicode sparkle character:

>>> sparkles = "\u2728"
>>> print(sparkles)
✨
>>> sparkle_bytes = sparkles.encode('utf-8')
>>> sparkle_bytes
b'\xe2\x9c\xa8'
>>> len(sparkles)
1
>>> len(sparkle_bytes)
3

We can also take a byte string and decode it to get unicode characters:

>>> party_bytes = b"\xf0\x9f\x8e\x88\xf0\x9f\x8e\x80"
>>> party = party_bytes.decode('utf-8')
>>> print(party)
🎈🎀

So decode can be used on byte strings to take a byte string of known encoding and convert it to a Unicode string. Likewise, encode can be used to take a Unicode string and convert it to a byte string, representing it using a particular encoding.

Encodings

Why do we care about encodings? Can’t we just try decoding and if it fails, try a different encoding?

Sometimes this works:

>>> surprise_bytes = b"=\xd8\xa9\xdc"
>>> surprise_bytes.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 3: unexpected end of data
>>> surprise_bytes.decode('utf-16le')
'💩'

Sometimes this doesn’t work:

>>> surprise_bytes = b"\xf0\x9f\x92\xa9"
>>> surprise_bytes.decode('utf-16le')
'\u9ff0ꦒ'
>>> surprise_bytes.decode('utf-8')
'💩'

In that case we could probably guess that we got some nonsense characters, but what if we get characters that seem to make sense?

>>> surprise_bytes = b"\xd8<\xdf\x86"
>>> surprise = surprise_bytes.decode('utf-16le')
>>> surprise
'㳘蛟'
>>> surprise_bytes.decode('utf-16be')
'🎆'

Pragmatic Unicode

When our programs store text on disk, send text over the network, or convert text to bytes for any other reason, we need to specify an encoding.

Ned Batchelder gave a talk called Pragmatic Unicode which is listed in the lessons for today.

The facts of life from Ned’s talk:

All input and output of your program is bytes. We can only send bytes to people.
The world needs more than 256 symbols to communicate.
Your program needs to deal with both bytes and Unicode.
You can’t infer encoding from a stream of bytes. Encoding is out-of-band.
Declared encodings can be wrong.

Pro tips from Ned’s talk:

Unicode sandwich: keep all text in your program as Unicode, and convert as close to the edges as possible.
Know what your strings are: you should be able to explain which of your strings are Unicode, which are bytes, and for your byte strings, what encoding they use.
Test your Unicode support. Use exotic strings throughout your test suites to be sure you’re covering all the cases.

A unicode sandwich has bytes on the outside and unicode on the inside. With a Unicode sandwich, we:

read data in raw bytes
convert data to unicode as soon as possible
convert data back to bytes only right before we need to