读取python中的unicode文件，该文件以与python源代码相同的方式声明其编码

I wish to write a python program which reads files containing unicode text. These files are normally encoded with UTF-8, but might not be; if they aren't, the alternate encoding will be explicitly declared at the beginning of the file. More precisely, it will be declared using exactly the same rules as Python itself uses to allow Python source code to have an explicitly declared encoding (as in PEP 0263, see https://www.python.org/dev/peps/pep-0263/ for more details). Just to be clear, the files being processed are not actually python source, but they do declare their encodings (when not in UTF-8) using the same rules.

我希望编写一个python程序来读取包含unicode文本的文件。这些文件通常用UTF-8编码，但可能不是;如果不是，则在文件的开头显式声明替代编码。更准确地说，它将使用与Python本身相同的规则来声明，以允许Python源代码具有显式声明的编码(如PEP 0263中所示，参见https://www.python.org/dev/peps/pep-0263/了解更多细节)。需要说明的是，正在处理的文件实际上并不是python源代码，但是它们确实使用相同的规则声明了它们的编码(当不是UTF-8时)。

If one knows the encoding of a file before one opens it, Python provides a very easy way to read the file with automatic decoding: the codecs.open command; for instance, one might do:

如果在打开文件之前知道文件的编码，那么Python提供了一种非常简单的方法来读取带有自动解码的文件:编解码器。开放的命令;例如，可以这样做:

import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
    print repr(line)

and each line we get in the loop will be a unicode string. Is there a Python library which does a similar thing, but choosing the encoding according to the rules above (which are Python 3.0's rules, I think)? (e.g. does Python expose the 'read file with self-declared encoding' it uses to read source to the language?) If not, what's the easiest way to achieve the desired effect?

我们在循环中得到的每一行都是unicode字符串。是否有一个Python库做类似的事情，但是根据上面的规则选择编码(我认为这是Python 3.0的规则)?(例如，Python是否公开它用于将源代码读入语言的“带自声明编码的读文件”?)如果没有，最简单的方法是什么?

One thought is to open the file using the usual open, read the first two lines, interpret them as UTF-8, look for a coding declaration using the regexp in the PEP, and if one finds one start decoding all subsequent lines using the encoding declared. For this to be sure to work, we need to know that for all the encodings that Python allows in Python source, the usual Python readline will correctly split the file into lines - that is, we need to know that for all the encodings Python allows in Python source, the byte string '\n' always really mean newline, and isn't part of some multi-byte sequence encoding another character. (In fact I also need to worry about '\r\n' as well.) Does anyone know if this is true? The docs were not very specific.

一种想法是使用通常的打开方式打开文件，读取前两行，将它们解释为UTF-8，在PEP中使用regexp查找一个编码声明，如果发现一个使用声明的编码开始解码所有后续的行。这个可以肯定的是,我们需要知道所有Python允许在Python中源的编码,通常的Python readline将正确的文件分割成线,我们需要知道所有的编码Python允许在Python源,字节字符串' \ n '总是真正的意思换行符,而不是多字节序列编码的一部分另一个字符。(事实上，我也需要担心‘\r\n’。)有人知道这是真的吗?文件并不是很具体。

Another thought is to look in the Python sources. Does anyone know where in the Python source the source-code-encoding-processing is done?

另一个想法是查看Python源代码。有人知道在Python源代码的什么地方完成了源代码编码处理吗?

5 个解决方案

#1

You should be able to roll your own decoder in Python. If you're only supporting 8-bit encodings which are supersets of ASCII the code below should work as-is.

您应该能够在Python中滚动自己的解码器。如果您只支持8位编码，这是ASCII的超集，下面的代码应该按原样工作。

If you need support 2-byte encodings like UTF-16 you'd need to augment the pattern to match \x00c\x00o.. or the reverse, depending on the byte order mark. First, generate a few test files which advertise their encoding:

如果你需要像UTF-16这样支持2字节的编码，你需要增加模式来匹配\x00c\x00o..或者相反，取决于字节顺序标记。首先，生成一些测试文件来宣传它们的编码:

import codecs, sys
for encoding in ('utf-8', 'cp1252'):
    out = codecs.open('%s.txt' % encoding, 'w', encoding)
    out.write('# coding = %s\n' % encoding)
    out.write(u'\u201chello se\u00f1nor\u201d')
    out.close()

Then write the decoder:

然后写译码器:

import codecs, re

def open_detect(path):
    fin = open(path, 'rb')
    prefix = fin.read(80)
    encs = re.findall('#\s*coding\s*=\s*([\w\d\-]+)\s+', prefix)
    encoding = encs[0] if encs else 'utf-8'
    fin.seek(0)
    return codecs.EncodedFile(fin, 'utf-8', encoding)

for path in ('utf-8.txt','cp1252.txt'):
    fin = open_detect(path)
    print repr(fin.readlines())

Output:

输出:

['# coding = utf-8\n', '\xe2\x80\x9chello se\xc3\xb1nor\xe2\x80\x9d']
['# coding = cp1252\n', '\xe2\x80\x9chello se\xc3\xb1nor\xe2\x80\x9d']

5 个解决方案

#1

更多相关文章

随机推荐