Yesterday there was an interesting discussion on Twitter about Python 2 vs Python 3's handling of byte strings and unicode strings.
I gave this some more thought and so you get to read (if you want) Walls Of Text about UTF-8, bytes and unicode right now, right here.
One interesting point that Andre made was that if we ignored all the legacy (non-UTF-8) byte encodings, the Python 2 approach of implicitly converting strings to bytes (always using UTF-8 as the byte encoding) would be better.
In my opinion (and experience), Python 3's way of dealing with this is actually quite sane and The Right Thing if you think about the operations you do on strings (and byte strings) and how they differ in behavior.
The Python 2 way was (even if we would ignore non-UTF-8 byte encodings) error-prone and caused lots of headache.
tl;dr: Because Python overloads len()
and []
based
on the type of data, bytes vs. unicode string needs to
be encoded into the type system, because the programmer
needs to pick how they want to interpret the data.
Update 2021-02-25: Found a link to the Rust documentation and how it handles strings: Bytes and Scalar Values and Grapheme Clusters! Oh My!
Different trade-offs -- you can't index into a String
in Rust (because it stores everything as UTF-8 encoded
bytes), but you can iterate over unicode codepoints with
.chars()
(because then you can anyway decode UTF-8
codepoints as you iterate over the string). Tradeoffs.
Some assumptions
This discussion ignores that the language might have to deal with legacy encodings (e.g. locale-based filesystem encoding). If you're running with a non-UTF-8 locale on your own machine in 2021, rethink your life choices. Then again, your code might need to run on some legacy hardware of some customer, but that could be solved via third party libraries, etc...
So we're left with a clean environment, all byte-based text I/O is encoded with UTF-8 (this includes filenames in the filesystem, file contents as well as network sockets, etc..). Yes, this includes HTTP sockets, let's pretend that HTTP would have been defined with UTF-8 as its default/fallback encoding, not Latin-1 (or again, it could be solved in libraries, anyway). And let's ignore that a HTML document can specify its encoding in a meta tag, which means it has to be parsed as Latin-1 first and then re-parsed when that meta tag is encountered, and it's all bad anyway. That's The Web for you.
Automagic conversion
So let's assume that byte strings (str
in Py2, bytes
in
Py3) could always be auto-converted to/from unicode strings
(unicode
in Py2, str
in py3) whenever needed, and the
only encoding used for this would be UTF-8. This would
actually work, and solve some problems, you couldn't deal
with Latin-1 "in memory", but would have to somehow convert
it in the interface points (local file, device I/O and
network socket I/O), but the core language and types would
only ever support UTF-8.
This would indeed work fine. But this assumes that the only thing we're ever doing with text strings is copying them as a whole. So let's see what operations we would do on a string.
Splitting on a separator
This would work fine for both UTF-8 encoded byte strings as well as unicode strings. Space (ASCII 0x20, encoded as 0x20 in UTF-8, also unicode codepoint U+0020) would have a unique representation, and splitting it at the bytes level would always work correctly (Python 3.9.1 here):
>>> hey = 'hello wörld'.encode()
>>> hey
b'hello w\xc3\xb6rld'
>>> hey.split()
[b'hello', b'w\xc3\xb6rld']
>>> [x.decode() for x in hey.split()]
['hello', 'wörld']
Similarly, splitting on any other ASCII character (or 7-bit
codepoint that has a single-byte representation in UTF-8)
would work independent of whether bytes
or str
is used.
Assuming we can split on arbitrary-length strings and not just a single byte, splitting on the UTF-8-encoded representation of a string would still work fine:
>>> 'aöböc'.split('ö')
['a', 'b', 'c']
>>> [x.decode() for x in 'aöböc'.encode().split('ö'.encode())]
['a', 'b', 'c']
Finding substrings
This would always work fine. If haystack and needle would be a different representation, auto-converting either to the type of the other would give the same result (continuing from above):
>>> hey.find('wörld'.encode())
6
>>> hey.decode().find('wörld')
6
Note that depending on whether the UTF-8 representation or the unicode string representation is used, the offset might be different:
>>> 'ö abc'.find('abc')
2
>>> 'ö abc'.encode().find('abc'.encode())
3
Which brings us to the next topic...
Bytes, Codepoints, On-Screen Characters
From this point, we have to think of different "units":
- Bytes: Treat each byte as one "unit", this is useful for working with binary data / C structs, etc...
- Codepoints: Each unicode codepoint is a "unit", this is useful if you care about "text" and e.g. an umlaut (which would be encoded as 2 bytes in UTF-8) should be a treated as single character
- Character on screen: Unicode has modifiers, a zero- width joiner and other fun stuff (e.g. a 👩🏿🏫 "dark skin tone female teacher" is in some cases - depending on the font and operating system used - a single character on screen, but is composed of 4 unicode codepoints: U+1F469, U+1F3FF, U+200D, U+1F3EB, which encodes into 15(!) UTF-8 bytes, and on another machine, this 15-byte sequence could be rendered as 2 characters on screen: 👩🏿 "U+1F469, U+1F3FF dark skin tone woman" and 🏫 "U+1F3EB school" or depending on the font as 3 characters on screen: 👩 "U+1F469 woman", 🏿 "U+1F3FF dark skin tone" and 🏫 "U+1F3EB school" - the "U+200D zero width joiner" is not rendered on-screen, but I guess if you have an old enough font renderer, it might be rendered as tofu in some cases, so you would get 4 characters like □□□□ there)
For 7-bit data (ASCII characters), all 3 definitions are usually the same, a lowercase "a" is encoded as byte 0x61 in UTF-8, has the unicode codepoint U+0061 and is one character on-screen.
For the unicode codepoints that are in Latin-1 which are assigned codepoints 128-255, UTF-8 encodes those as 2 bytes (0xC2 0x80 for 128, 0xC3 0xBF for 255, the "ÿ") and so on. Those would, however, still be rendered as a single character per unicode codepoint on the screen (ignoring control characters here).
This by the way also explains why UTF-8 encoded data interpreted as Latin-1 leads to well-known garbage that you might have seen in shipping labels:
>>> 'Währinger Straße'.encode('utf-8').decode('latin-1')
'Währinger StraÃ\x9fe'
Codepoint 228 ("ä", U+00E4) encodes as 0xC3 0xA4 in UTF-8, and 0xC3 in Latin-1 is "Ã" and 0xA4 is "¤". This is why you should only ever use UTF-8, but maybe you NEED to convert it to Latin-1 at the point where you communicate with your legacy serial label printer or whatever, but don't make your choice of label printer dictate how you store data on disk and in memory.
Byte count, string length, length "on screen"
In Python, the len()
function is overloaded to work with
all its built-in types: You can get the length of a list,
a tuple, a dict, a bytes object or a (unicode) string with
the same function.
Depending on the type of object, len()
does different
things (first is a (unicode) string, second is the string
encoded as UTF-8 byte string):
>>> len('Währinger Straße')
16
>>> len('Währinger Straße'.encode())
18
So if the type system would automatically convert bytes
to strings, should len()
always work on bytes or on
strings?
And what about our good old emoji with modifiers case?
>>> len('👩🏿🏫')
4
>>> len('👩🏿🏫'.encode())
15
It prints the length as 4 unicode codepoints, which is not wrong, but at the same time doesn't really correspond to what's rendered on-screen (depending on your terminal, OS and font, it might not even render as a single character, it does for me on macOS and the built-in Terminal.app as well as on Safari when viewing this web page).
The thing is, at the point where the REPL of Python "talks" to the system
(usually some pseudo TTY), it doesn't know much about the terminal and OS/font
that is used. There's the $TERM
variable and various locale-specific
variables (that hopefully make Python use a UTF-8 locale, or it would all be even
worse!), and you can make some assumptions, but if I SSH into another machine,
use a serial line or even inspect the console output of a program running in CI
in a web browser (which itself might use OS-specific fonts and rendering), all
bets are off in terms of "how is this set of Unicode codepoints displayed to
the user?".
So as far as I'm aware, we can't really answer the "how many characters will this be on screen?" question properly from the language -- you can still do it if you use a GUI toolkit and/or have some control over how the characters are rendered, but for command-line tools, you cannot generally determine how things will be displayed.
String slicing / substr
Let's now assume that we want to get parts of a string,
like "get the first N characters". In Python, this can
be accomplished with the slicing operator, and just like
len()
it works not only on strings and bytes, but also
on lists (and tuples):
>>> a = [1, 2, 3, 4, 5]
>>> a[:3]
[1, 2, 3]
>>> a[1:3]
[2, 3]
>>> a[2:]
[3, 4, 5]
The problem is, just like with len()
, slicing has
different output depending on whether the string is
UTF-8 encoded bytes or a (unicode) string:
>>> 'brötchen'[6:]
'en'
>>> 'brötchen'.encode()[6:]
b'hen'
There's no Right or Wrong variant, both make sense, you just have to specify if your units are "bytes" or "unicode codepoints" (and again, "characters on screen" are a hard problem to solve, see above).
If your bytes are actually binary data, trying to interpret them as UTF-8 and "counting" codepoints this way will be very wrong. "Give me the first 6 bytes" and "give me the first 6 unicode codepoints" are two different requests which will give two different results.
This gets even worse if you expect to be able to cut UTF-8 byte strings at any point, which does not work.
For example, back in the Python 2 days, I had a rare bug in my code (which elided strings) that happened only in some corner cases. Consider that you want to limit your string to 15 characters, and if it's longer than that, elide it (I am aware that there is an ellipsis character and it's different from three dots, but we'll go with the three dots here for simplicity):
>>> s = 'Sprachenbenützungsproblem'
>>> if len(s) > 15:
... s = s[:12] + '...'
...
>>> s
'Sprachenbenü...'
Sweet! Now how would this work with bytes?
>>> s = 'Sprachenbenützungsproblem'.encode()
>>> if len(s) > 15:
... s = s[:12] + '...'
...
>>> s.decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 11: invalid continuation byte
Argh, here we are -- the dreaded UnicodeDecodeError
!
The reason is simple: "ü" gets encoded as 2 bytes:
>>> 'ü'.encode()
b'\xc3\xbc'
If you decode those two bytes back, it works just fine:
>>> b'\xc3\xbc'.decode()
'ü'
However, if you just decode the first byte, it will result in weirdness (the error is slightly different here, because here the string ends and it expects another byte, whereas above it tries to decode the first "." (in "...") as continuation byte, which fails):
>>> b'\xc3'.decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 0: unexpected end of data
In any case, the point is that you can't just arbitrarily cut off UTF-8 byte strings and expect to get a valid UTF-8 sequence back. Here again, you need to know if you're dealing with bytes or text data: It is totally okay to cut off a bytes string in the middle of 0xC3 and 0xBC if it's not UTF-8 encoded data, but rather e.g. two bytes encoding the numbers 195 and 188 and you just care about the first number:
>>> import struct
>>> packed = struct.pack('<BB', 195, 188)
>>> packed
b'\xc3\xbc'
>>> struct.unpack('<B', packed[:1])
(195,)
Collation
Another thing that is actually very very tricky is sorting alphabetically. How things are sorted might differ from locale to locale. This is mostly a localization "problem", but since strings in Python are usually sorted "by codepoint", it means if you sort a list of strings, you might not get what you expect:
>>> sorted(['Argentinien', 'Österreich', 'Zimbabwe', 'Oman'])
['Argentinien', 'Oman', 'Zimbabwe', 'Österreich']
These are German-language country names, and Austria is "Österreich", which I kind of expect to be sorted next to other names starting with "O", but "Ö" has codepoint 214, whereas "O" is 79 and "Z" is 90. Not to mention that e.g. local web shops would probably have 99% of their customers order to Austria, so it might make sense to sort it all the way at the top.
This is a UX problem and doesn't directly relate to encoding, but since the technical default sorting order is unicode codepoints, it kinda relates to encoding and shows that it might make sense to have some "higher-level" representation where e.g. comparing two strings takes care of locale-specific collation and on-screen representation (the "how many characters on the screen" question above).
Closing remarks
It's true that all the problems above come from the
simple fact that functions like len()
and operators
like the slicing operator (e.g. [:3]
) are overloaded
(the same syntax causes different behavior depending
on the type of the object it's called on).
If you would have e.g. n_bytes()
and n_codepoints()
as functions, you could indeed treat UTF-8 encoded byte
strings as one and the same type, at function-calling
time you can decide how you want to interpret it (but
again, n_codepoints()
called on an arbitrary bytes
object can fail if it's not valid UTF-8 encoding, so
you can never fully "hide" the encoding).
In a language like C where there's no overloading of
functions, you basically only get "byte strings" and
have to deal with UTF-8 separately, but that's fine,
you can have functions that parse it as UTF-8 and
return the length in codepoints. And IMHO that's a
much nicer solution than wchar_t
- for C.
But since Python has overloaded functions and operators that work on bytes and strings (and you still need to differentiate between when you want to work on bytes and when you want to work on codepoints), it kind of requires that the type system encodes whether some value is currently bytes or unicode "text".
I do agree that UTF-8 should be the One True Encoding
(if it isn't already...), but similarly I enjoy that
Python 3 deals with it in a much more sane way than
what Python 2 did given its language design choices
(overloaded len()
, index [i]
and slicing [n:m]
operator).
And yes, it means you get some one-time migration pains when going from Python 2 to Python 3, but the approach in Python 3 is much saner.
And yes, it means you get to hate on Python if some tooling you depend on was still on Python 2 and now it breaks.
Less magic, more thinking required from the developer where thinking -- as you can see above -- is actually needed and cannot be abstracted away fully.
In fact, maybe it even makes sense to have a "higher-level"
type on top of unicode strings that can include features such
as "visible characters on screen" for its len()
, group
those characters for slicing (so the dark skin tone female
teacher is considered a "single item" and not 4 codepoints)
and have other higher-level things abstracted away for those
cases where you just want to e.g. "show the first 10 visible
characters on screen".