When we are reading content from a text file using python, we may get invalid character \ufeff. In this tutorial, we will introduce how to remove it.
For example:
We may use code below to read a file.
with open("test.txt", 'rb') as f: for line in f: line = line.decode('utf-8', 'ignore') line = line.strip().split('\t')
Here line is the content in test.txt
However, we may find \ufeff in line.
How to remove \ufeff?
The simplest way is to use utf-8-sig encoding.
For example:
with open("test.txt", 'rb') as f: for line in f: line = line.decode('utf-8-sig', 'ignore') line = line.strip().split('\t')
Then, we will find \ufeff is removed.
Hello,
there seems to be a better way, without possibly destroying the encoding by down-converting to utf-8.
I had this problem when loading utf-16le files.
with open(“”, encoding=”utf-16le”) as f
line = f.readline().lstrip(“\ufeff”)
…
Some remarks:
– The \ufeff is only found in the first line. It’s the beginning of the file.
– Because I don’t know which encoding an incoming file has, I did the following. Surely there is a better way but it works (on Linux):
output = subprocess.check_output([“file”, “–mime-encoding”, “”], universal_newlines=True)
encoding = output.split(” “)[1].rstrip()
with open(“”, encoding=encoding) as f
…
yes, that is a good solution.