When you are crawling web page, you may get this error: UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x8b in position 0. In this tutorial, we will introduce how to fix this error.
Code generates this error
content = crawl_response.read().decode("utf-8")
Then run this code, you may get error:
content = crawl_response.read().decode(“utf-8”)
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x8b in position 0: invalid start byte
If you do not decode(“utf-8”), you may get this output.
From output, you will find the content is not encoded by utf-8.
Check response header on Content-Encoding
We find:
Content-Encoding: br
Which means the response content is compressed by Brotli algorithm, if you want to print it correctly, you should decompress it firstly.
Understand Content-Encoding: br and Decompress String – Python Web Crawler Tutorial
Here is an simple example to decompress content compressed by Brotli algorithm, you can check and learn how to decompress string with it.
content = crawl_response.read() import brotli content = brotli.decompress(content) content = content.decode("utf-8") print(content)