Bytes vs unicode codepoint handling in Python 3

2021-02-21

Yesterday there was an interesting discussion on Twitter about Python 2 vs Python 3's handling of byte strings and unicode strings.

I gave this some more thought and so you get to read (if you want) Walls Of Text about UTF-8, bytes and unicode right now, right here.

One interesting point that Andre made was that if we ignored all the legacy (non-UTF-8) byte encodings, the Python 2 approach of implicitly converting strings to bytes (always using UTF-8 as the byte encoding) would be better.

In my opinion (and experience), Python 3's way of dealing with this is actually quite sane and The Right Thing if you think about the operations you do on strings (and byte strings) and how they differ in behavior.

The Python 2 way was (even if we would ignore non-UTF-8 byte encodings) error-prone and caused lots of headache.

tl;dr: Because Python overloads len() and [] based on the type of data, bytes vs. unicode string needs to be encoded into the type system, because the programmer needs to pick how they want to interpret the data.

Update 2021-02-25: Found a link to the Rust documentation and how it handles strings: Bytes and Scalar Values and Grapheme Clusters! Oh My!

Different trade-offs -- you can't index into a String in Rust (because it stores everything as UTF-8 encoded bytes), but you can iterate over unicode codepoints with .chars() (because then you can anyway decode UTF-8 codepoints as you iterate over the string). Tradeoffs.

Some assumptions

This discussion ignores that the language might have to deal with legacy encodings (e.g. locale-based filesystem encoding). If you're running with a non-UTF-8 locale on your own machine in 2021, rethink your life choices. Then again, your code might need to run on some legacy hardware of some customer, but that could be solved via third party libraries, etc...

So we're left with a clean environment, all byte-based text I/O is encoded with UTF-8 (this includes filenames in the filesystem, file contents as well as network sockets, etc..). Yes, this includes HTTP sockets, let's pretend that HTTP would have been defined with UTF-8 as its default/fallback encoding, not Latin-1 (or again, it could be solved in libraries, anyway). And let's ignore that a HTML document can specify its encoding in a meta tag, which means it has to be parsed as Latin-1 first and then re-parsed when that meta tag is encountered, and it's all bad anyway. That's The Web for you.

Automagic conversion

So let's assume that byte strings (str in Py2, bytes in Py3) could always be auto-converted to/from unicode strings (unicode in Py2, str in py3) whenever needed, and the only encoding used for this would be UTF-8. This would actually work, and solve some problems, you couldn't deal with Latin-1 "in memory", but would have to somehow convert it in the interface points (local file, device I/O and network socket I/O), but the core language and types would only ever support UTF-8.

This would indeed work fine. But this assumes that the only thing we're ever doing with text strings is copying them as a whole. So let's see what operations we would do on a string.

Splitting on a separator

This would work fine for both UTF-8 encoded byte strings as well as unicode strings. Space (ASCII 0x20, encoded as 0x20 in UTF-8, also unicode codepoint U+0020) would have a unique representation, and splitting it at the bytes level would always work correctly (Python 3.9.1 here):

>>> hey = 'hello wörld'.encode()
>>> hey
b'hello w\xc3\xb6rld'
>>> hey.split()
[b'hello', b'w\xc3\xb6rld']
>>> [x.decode() for x in hey.split()]
['hello', 'wörld']

Similarly, splitting on any other ASCII character (or 7-bit codepoint that has a single-byte representation in UTF-8) would work independent of whether bytes or str is used.

Assuming we can split on arbitrary-length strings and not just a single byte, splitting on the UTF-8-encoded representation of a string would still work fine:

>>> 'aöböc'.split('ö')
['a', 'b', 'c']
>>> [x.decode() for x in 'aöböc'.encode().split('ö'.encode())]
['a', 'b', 'c']

Finding substrings

This would always work fine. If haystack and needle would be a different representation, auto-converting either to the type of the other would give the same result (continuing from above):

>>> hey.find('wörld'.encode())
6
>>> hey.decode().find('wörld')
6

Note that depending on whether the UTF-8 representation or the unicode string representation is used, the offset might be different:

>>> 'ö abc'.find('abc')
2
>>> 'ö abc'.encode().find('abc'.encode())
3

Which brings us to the next topic...

Bytes, Codepoints, On-Screen Characters

From this point, we have to think of different "units":

Bytes: Treat each byte as one "unit", this is useful for working with binary data / C structs, etc...
Codepoints: Each unicode codepoint is a "unit", this is useful if you care about "text" and e.g. an umlaut (which would be encoded as 2 bytes in UTF-8) should be a treated as single character
Character on screen: Unicode has modifiers, a zero- width joiner and other fun stuff (e.g. a 👩🏿‍🏫 "dark skin tone female teacher" is in some cases - depending on the font and operating system used - a single character on screen, but is composed of 4 unicode codepoints: U+1F469, U+1F3FF, U+200D, U+1F3EB, which encodes into 15(!) UTF-8 bytes, and on another machine, this 15-byte sequence could be rendered as 2 characters on screen: 👩🏿 "U+1F469, U+1F3FF dark skin tone woman" and 🏫 "U+1F3EB school" or depending on the font as 3 characters on screen: 👩 "U+1F469 woman", 🏿 "U+1F3FF dark skin tone" and 🏫 "U+1F3EB school" - the "U+200D zero width joiner" is not rendered on-screen, but I guess if you have an old enough font renderer, it might be rendered as tofu in some cases, so you would get 4 characters like □□□□ there)

For 7-bit data (ASCII characters), all 3 definitions are usually the same, a lowercase "a" is encoded as byte 0x61 in UTF-8, has the unicode codepoint U+0061 and is one character on-screen.

For the unicode codepoints that are in Latin-1 which are assigned codepoints 128-255, UTF-8 encodes those as 2 bytes (0xC2 0x80 for 128, 0xC3 0xBF for 255, the "ÿ") and so on. Those would, however, still be rendered as a single character per unicode codepoint on the screen (ignoring control characters here).

This by the way also explains why UTF-8 encoded data interpreted as Latin-1 leads to well-known garbage that you might have seen in shipping labels:

>>> 'Währinger Straße'.encode('utf-8').decode('latin-1')
'WÃ¤hringer StraÃ\x9fe'

Codepoint 228 ("ä", U+00E4) encodes as 0xC3 0xA4 in UTF-8, and 0xC3 in Latin-1 is "Ã" and 0xA4 is "¤". This is why you should only ever use UTF-8, but maybe you NEED to convert it to Latin-1 at the point where you communicate with your legacy serial label printer or whatever, but don't make your choice of label printer dictate how you store data on disk and in memory.

Byte count, string length, length "on screen"

In Python, the len() function is overloaded to work with all its built-in types: You can get the length of a list, a tuple, a dict, a bytes object or a (unicode) string with the same function.

Depending on the type of object, len() does different things (first is a (unicode) string, second is the string encoded as UTF-8 byte string):

>>> len('Währinger Straße')
16
>>> len('Währinger Straße'.encode())
18

So if the type system would automatically convert bytes to strings, should len() always work on bytes or on strings?

And what about our good old emoji with modifiers case?

>>> len('👩🏿‍🏫')
4
>>> len('👩🏿‍🏫'.encode())
15

It prints the length as 4 unicode codepoints, which is not wrong, but at the same time doesn't really correspond to what's rendered on-screen (depending on your terminal, OS and font, it might not even render as a single character, it does for me on macOS and the built-in Terminal.app as well as on Safari when viewing this web page).

The thing is, at the point where the REPL of Python "talks" to the system (usually some pseudo TTY), it doesn't know much about the terminal and OS/font that is used. There's the $TERM variable and various locale-specific variables (that hopefully make Python use a UTF-8 locale, or it would all be even worse!), and you can make some assumptions, but if I SSH into another machine, use a serial line or even inspect the console output of a program running in CI in a web browser (which itself might use OS-specific fonts and rendering), all bets are off in terms of "how is this set of Unicode codepoints displayed to the user?".

So as far as I'm aware, we can't really answer the "how many characters will this be on screen?" question properly from the language -- you can still do it if you use a GUI toolkit and/or have some control over how the characters are rendered, but for command-line tools, you cannot generally determine how things will be displayed.

String slicing / substr

Let's now assume that we want to get parts of a string, like "get the first N characters". In Python, this can be accomplished with the slicing operator, and just like len() it works not only on strings and bytes, but also on lists (and tuples):

>>> a = [1, 2, 3, 4, 5]
>>> a[:3]
[1, 2, 3]
>>> a[1:3]
[2, 3]
>>> a[2:]
[3, 4, 5]

The problem is, just like with len(), slicing has different output depending on whether the string is UTF-8 encoded bytes or a (unicode) string:

>>> 'brötchen'[6:]
'en'
>>> 'brötchen'.encode()[6:]
b'hen'

There's no Right or Wrong variant, both make sense, you just have to specify if your units are "bytes" or "unicode codepoints" (and again, "characters on screen" are a hard problem to solve, see above).

If your bytes are actually binary data, trying to interpret them as UTF-8 and "counting" codepoints this way will be very wrong. "Give me the first 6 bytes" and "give me the first 6 unicode codepoints" are two different requests which will give two different results.

This gets even worse if you expect to be able to cut UTF-8 byte strings at any point, which does not work.

For example, back in the Python 2 days, I had a rare bug in my code (which elided strings) that happened only in some corner cases. Consider that you want to limit your string to 15 characters, and if it's longer than that, elide it (I am aware that there is an ellipsis character and it's different from three dots, but we'll go with the three dots here for simplicity):

>>> s = 'Sprachenbenützungsproblem'
>>> if len(s) > 15:
...     s = s[:12] + '...'
...
>>> s
'Sprachenbenü...'

Sweet! Now how would this work with bytes?

>>> s = 'Sprachenbenützungsproblem'.encode()
>>> if len(s) > 15:
...     s = s[:12] + '...'
...
>>> s.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 11: invalid continuation byte

Argh, here we are -- the dreaded UnicodeDecodeError!

The reason is simple: "ü" gets encoded as 2 bytes:

>>> 'ü'.encode()
b'\xc3\xbc'

If you decode those two bytes back, it works just fine:

>>> b'\xc3\xbc'.decode()
'ü'

However, if you just decode the first byte, it will result in weirdness (the error is slightly different here, because here the string ends and it expects another byte, whereas above it tries to decode the first "." (in "...") as continuation byte, which fails):

>>> b'\xc3'.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 0: unexpected end of data

In any case, the point is that you can't just arbitrarily cut off UTF-8 byte strings and expect to get a valid UTF-8 sequence back. Here again, you need to know if you're dealing with bytes or text data: It is totally okay to cut off a bytes string in the middle of 0xC3 and 0xBC if it's not UTF-8 encoded data, but rather e.g. two bytes encoding the numbers 195 and 188 and you just care about the first number:

>>> import struct
>>> packed = struct.pack('<BB', 195, 188)
>>> packed
b'\xc3\xbc'
>>> struct.unpack('<B', packed[:1])
(195,)

Collation

Another thing that is actually very very tricky is sorting alphabetically. How things are sorted might differ from locale to locale. This is mostly a localization "problem", but since strings in Python are usually sorted "by codepoint", it means if you sort a list of strings, you might not get what you expect:

>>> sorted(['Argentinien', 'Österreich', 'Zimbabwe', 'Oman'])
['Argentinien', 'Oman', 'Zimbabwe', 'Österreich']

These are German-language country names, and Austria is "Österreich", which I kind of expect to be sorted next to other names starting with "O", but "Ö" has codepoint 214, whereas "O" is 79 and "Z" is 90. Not to mention that e.g. local web shops would probably have 99% of their customers order to Austria, so it might make sense to sort it all the way at the top.

This is a UX problem and doesn't directly relate to encoding, but since the technical default sorting order is unicode codepoints, it kinda relates to encoding and shows that it might make sense to have some "higher-level" representation where e.g. comparing two strings takes care of locale-specific collation and on-screen representation (the "how many characters on the screen" question above).

Closing remarks

It's true that all the problems above come from the simple fact that functions like len() and operators like the slicing operator (e.g. [:3]) are overloaded (the same syntax causes different behavior depending on the type of the object it's called on).

If you would have e.g. n_bytes() and n_codepoints() as functions, you could indeed treat UTF-8 encoded byte strings as one and the same type, at function-calling time you can decide how you want to interpret it (but again, n_codepoints() called on an arbitrary bytes object can fail if it's not valid UTF-8 encoding, so you can never fully "hide" the encoding).

In a language like C where there's no overloading of functions, you basically only get "byte strings" and have to deal with UTF-8 separately, but that's fine, you can have functions that parse it as UTF-8 and return the length in codepoints. And IMHO that's a much nicer solution than wchar_t - for C.

But since Python has overloaded functions and operators that work on bytes and strings (and you still need to differentiate between when you want to work on bytes and when you want to work on codepoints), it kind of requires that the type system encodes whether some value is currently bytes or unicode "text".

I do agree that UTF-8 should be the One True Encoding (if it isn't already...), but similarly I enjoy that Python 3 deals with it in a much more sane way than what Python 2 did given its language design choices (overloaded len(), index [i] and slicing [n:m] operator).

And yes, it means you get some one-time migration pains when going from Python 2 to Python 3, but the approach in Python 3 is much saner.

And yes, it means you get to hate on Python if some tooling you depend on was still on Python 2 and now it breaks.

Less magic, more thinking required from the developer where thinking -- as you can see above -- is actually needed and cannot be abstracted away fully.

In fact, maybe it even makes sense to have a "higher-level" type on top of unicode strings that can include features such as "visible characters on screen" for its len(), group those characters for slicing (so the dark skin tone female teacher is considered a "single item" and not 4 codepoints) and have other higher-level things abstracted away for those cases where you just want to e.g. "show the first 10 visible characters on screen".