One of the most important things to understand is the difference between a character set and an encoding. Confusing the two is probably the source of most misunderstandings how all this works.
As an example, these are character sets:
- Windows-1252 (also known as CP-1252)
- ISO-8859-1 (also known as Latin-1)
while these are encodings:
What's the difference?!
If you want to deal with a piece of text (i.e. a "string"), the first problem you face is that computers can only handle numbers, not letters. The obvious solution is to come up with a system that converts letters to numbers (known as "code points") e.g. A=1, B=2, C=3, etc.
Another character set is BCDThis was very popular in the early days of computing, although it's not used much these days. (see below), and the same string "Hello" would be represented by the numbers: 24 21 35 35 38.
Note that when you want to convert a series of numbers back into letters, you must know what character set is being used. For example, if we tried to decode the ASCII numbers above as BCD, or vice versa, we won't get the correct string back, since each number corresponds to different letters in the two character sets.
ASCII only allows for 128 different numbers (and hence a maximum of 128 different letters) and BCD even fewer (64), so while they might be OK for English, they're woefully inadequate for languages such as Chinese or Japanese, which have many thousands of characters between them. So, a new character set called Unicode was created, that allows for over a million charactersAlthough only about 12% of these are currently in use..
If you want to write code that can handle international text, Unicode is the only character set you need to worry about. For example, if you wanted to write "日本""Japan", in Japanese, these 2 characters would be represented by the numbers 26085 and 26412Since Unicode has over a million characters, it can't be represented in a screenshot as for ASCII and BCD, but you can look up characters at places like this or this., or in hexadecimal: 0x65E5 and 0x672C.
However, there's a snag: if we want to store these numbers in memory, it's not so simple, since memory is a series of bytes, which can only handle numbers from 0 to 255, and these numbers are both much bigger than that. One solution is to store each number spread over two bytes, like this:
However, some CPU's were designed to store 2-byte values the other way aroundYou might ask why people would want to store numbers the "wrong" way around, but in the early days of computing, memory diagrams were drawn vertically, going upwards, and storing the low byte first makes sense there., like this:
This is known as a little-endian system, since the low-value (or little) byte is stored first, while the first system above is known as big-endian.
The important thing to understand here is that there are 2 different ways we can store the numbers in memory (known as "encodings"), and as with character sets, it's crucial to know which one is being used when you're trying to convert the numbers back to text. If you take a string that was encoded big-endian (i.e. the first diagram above) and try to read it back as little-endian, the numbers will come out wrong (i.e. 0xE565 and 0x2C67), and hence will be converted to the wrong letters.
Back in the day, when the Unicode was smaller and only had 65,535 characters, the 2 encodings above were, in fact, usedThese encodings are known as UCS-2BE and UCS-2LE., but are now obsoleteSince Unicode has grown, and would now require at least 3 bytes to store all the possible values.. Furthermore, encodings like these are very wasteful when storing plain ASCII textSince they would use 3 bytes to store every letter, when each one really only needs 1 byte., so new encodings have been devised to address these problems. By far, the most common is UTF-8, but there are others e.g. UTF-16 or EUC-JP.
How each encoding works is not important for the purpose of this discussion, but it's crucial to remember that they all do the same thing: convert code points (i.e. the numbers that represent each letter of your string) into bytes that can be stored in memoryOr in a file, or in a network packet, anything that deals with a series of bytes.. However, the way each encoding does this is different, so if you store a string in memory using one encoding, and then try to read it back using another, the code points will be wrong, and so the string read back will be wrong. You can think of it like compressing files - if you compress a file to a ZIP, then try to decompress it as a TAR, it won't work.
Putting it all together
In practice, there is only one rule: every time you do something with a string, you must know what character set is being used, and how the string was encoded. And just to make things more difficult , it's not always obvious when "every time you do something with a string" applies.
As an example, we'll write some Python code that gets information out of Awasu, some of which is not English, and generate an HTML page listing it, first using Python 2, and then again using Python 3, taking a look at the coding issues that come up, and how the two Pythons differ.
But Python 3 uses Unicode strings...?
As an aside, I suspect a lot of confusion people have about using strings in Python is because they keep hearing the phrase "Python 3 uses Unicode strings", which is a little misleading.
In Python 2, string variables are stored in memory as a series of bytes, nothing more, and can be used to store Unicode text (e.g. Chinese, Japanese, or other non-ASCII stuff), but you have to manage the encoding yourself. If you adopt a convention of always using, say, UTF-8, then you can quite happily manage Unicode text, even in Python 2. However, Python 2 string variables can also be used to store ASCII stringsOr text in any character set, since they are, after all, just stored as a series of bytes., so you need to be very aware of what character set each string variable is using, since it's very easy to write code that looks like it's working (because you only tested it with ASCII text), only to have it fail when you push Unicode text through it
In Python 3, string variables always use the Unicode character set, and if you try to store ASCII text in one, it will be converted to Unicode firstThis will always work, since ASCII is a subset of Unicode.. These strings are then stored in memory using the UTF-16 or UTF-32 encoding, depending on what platform and build of Python you are using. If you really want to store the string as a series of bytesFor example, because you really want an ASCII string, or a Unicode string encoded using UTF-8., you need to use a bytes variable.
So, the difference is:
- in Python 2, strings are stored as a series of bytes (in variables of type str), and if you want to handle Unicode text, you have to manage the encoding yourself. Or, just use variables of type unicode.
- in Python 3, strings (variables of type str) always use the Unicode character set, with the encoding managed internally by the Python interpreterWhich encoding it uses is not really important, unless you ever need access to the underlying bytes., and if you want a series-of-bytes string, you have to use a bytes variable, and manage the encoding yourself.
|Tutorial index||Calling the Awasu API using Python 2 »|
[ + ]
|1.||↵||This was very popular in the early days of computing, although it's not used much these days.|
|2.||↵||Although only about 12% of these are currently in use.|
|3.||↵||"Japan", in Japanese|
|4.||↵||Since Unicode has over a million characters, it can't be represented in a screenshot as for ASCII and BCD, but you can look up characters at places like this or this.|
|5.||↵||You might ask why people would want to store numbers the "wrong" way around, but in the early days of computing, memory diagrams were drawn vertically, going upwards, and storing the low byte first makes sense there.|
|6.||↵||These encodings are known as UCS-2BE and UCS-2LE.|
|7.||↵||Since Unicode has grown, and would now require at least 3 bytes to store all the possible values.|
|8.||↵||Since they would use 3 bytes to store every letter, when each one really only needs 1 byte.|
|9.||↵||Or in a file, or in a network packet, anything that deals with a series of bytes.|
|10.||↵||Or text in any character set, since they are, after all, just stored as a series of bytes.|
|11.||↵||This will always work, since ASCII is a subset of Unicode.|
|12.||↵||For example, because you really want an ASCII string, or a Unicode string encoded using UTF-8.|
|13.||↵||Which encoding it uses is not really important, unless you ever need access to the underlying bytes.|