python 中 unicode原样转成str, unicode-escape与string_escape

通过BS4抓取的数据竟然抓到了这样的字符串

text = u'\xe9\x95\xbf\xe5\x9f\x8e'

奇怪吧，这是一个utf8编码的汉字，但是前面却出现了u，成了unicode码，导致无法正确解码

首先是要把这个unicode原样转成str，也就是需要得到

text = '\xe9\x95\xbf\xe5\x9f\x8e'

这里使用

text = text.encode('unicode-escape')

现在text的值为

text = '\\xe9\\x95\\xbf\\xe5\\x9f\\x8e'

咦，转过来了，但是，反斜杠也被原样转了。

接下来使用

text = text.decode('string_escape')

现在text的值为

text = '\xe9\x95\xbf\xe5\x9f\x8e'

耶，需求实现

完整代码

text = u'\xe9\x95\xbf\xe5\x9f\x8e'
text = text.encode('unicode-escape').decode('string_escape')

print text.decode('utf8')

长城

随机推荐