Yah, Unicode and encodings is difficult
You really need to understand the basics of how it all works to have any chance of working with this stuff - this
is a good place to start.
The TL;DR is that every time you deal with text, you need to know (1) what characters are in the text and (2) what encoding it is in.
When the Python interpreter reads your script (so that it can run it), it's reading text, so it needs to know what encoding it's in. When you write:
you're telling Python "the text in this file is encoded using UTF8". This lets you have non-English text in your script file e.g. for string literals (generally not a good idea, but probably OK if you declare the encoding like this).
When Python downloads the HTML page from a URL, it's reading text, so it needs to know what encoding it's in. This line:
Code: Select all
text = data.decode("utf8","ignore")
converts the downloaded data from raw bytes to text, on the understanding that the text has been encoded using UTF8.
NOTE: This will work most of the time since UTF8 is, by far, the most common encoding, but if you ever come across a page that has been encoded using something else, this line won't work because it will be using the wrong encoding. The correct way to do this is to check the HTML, and or HTTP headers, to find out what encoding the page is in.
When a browser displays a web page, it needs to read the HTML, which is text, so it needs to know what encoding it's in. This line:
Code: Select all
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
tells the browser that the file is encoded using UTF8.
You didn't show the code that outputs the HTML itself, but I'll bet you're using plain old print
statements, and this is where your problem almost certainly is. You included a <meta>
tag, that says that the HTML is encoded using UTF8, but when you print the HTML out, you also need to encode it!
In Python 2, you could do something like this:
but this doesn't work in Python 3 because it insists on doing everything using "Unicode", even when you don't want it to
There are a few ways around this, but the easiest is to write raw
output to stdout e,.g.
Code: Select all
sys.stdout.buffer.write( "日本".encode("utf8") )
This will work, but things are never this this easy
Your console also deals with text, when it's displaying output and accepting input, and so also needs to have an associated encoding, and if you're on Windows, it's not going to be UTF8. If you're running Windows in English or some other Western languages, it will be Windows-1252, but it doesn't really matter - unless it's UTF8, your program will be outputting text in UTF8 while your console is trying to interpret it as something else, which is not going to work.
NOTE: Unfortunately, if your script happens to only output ASCII text, it will look like it's working (since this encoding mismatch won't come up), but it will just be by luck, and will break if it evers outputs any non-English text.
However, if you pipe the output of your script to a file, then open that file as a UTF8 file
, you'll see that it has worked properly. This is what Awasu does, so everything will work if you do things this way, you just need to be aware that while you're working on your script, you need to pipe the output to a file (or somehow set the console's encoding to UTF8).