I have this python script where I am using nltk library to parse,tokenize,tag and chunk some lets say random text from the web.

我有这个python脚本,我使用nltk库解析,标记,标记和块一些让我们说来自网络的随机文本。

I need to format and write in a file the output of chunked1,chunked2,chunked3. These have type class 'nltk.tree.Tree'

我需要格式化并在文件中写入chunked1,chunked2,chunked3的输出。这些类型为'nltk.tree.Tree'

More specifically I need to write only the lines that match the regular expressions chunkGram1, chunkGram2, chunkGram3.

更具体地说,我只需要编写与正则表达式chunkGram1,chunkGram2,chunkGram3匹配的行。

How can i do that?

我怎样才能做到这一点?

#! /usr/bin/python2.7

import nltk
import re
import codecs

xstring = ["An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."]


def processLanguage():
    for item in xstring:
        tokenized = nltk.word_tokenize(item)
        tagged = nltk.pos_tag(tokenized)
        #print tokenized
        #print tagged

        chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
        chunkGram2 = r"""Chunk: {<JJ\w?>*<NNS>}"""
        chunkGram3 = r"""Chunk: {<NNP\w?>*<NNS>}"""

        chunkParser1 = nltk.RegexpParser(chunkGram1)
        chunked1 = chunkParser1.parse(tagged)

        chunkParser2 = nltk.RegexpParser(chunkGram2)
        chunked2 = chunkParser2.parse(tagged)

        chunkParser3 = nltk.RegexpParser(chunkGram3)
        chunked3 = chunkParser2.parse(tagged)

        #print chunked1
        #print chunked2
        #print chunked3

        # with codecs.open('path\to\file\output.txt', 'w', encoding='utf8') as outfile:

            # for i,line in enumerate(chunked1):
                # if "JJ" in line:
                    # outfile.write(line)
                # elif "NNP" in line:
                    # outfile.write(line)



processLanguage()

For the time being when I am trying to run it I get error:

暂时我试图运行它时出现错误:

`Traceback (most recent call last):
  File "sentdex.py", line 47, in <module>
    processLanguage()
  File "sentdex.py", line 40, in processLanguage
    outfile.write(line)
  File "C:\Python27\lib\codecs.py", line 688, in write
    return self.writer.write(data)
  File "C:\Python27\lib\codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
TypeError: coercing to Unicode: need string or buffer, tuple found`

edit: After @Alvas answer I managed to do what I wanted. However now, I would like to know how I could strip all non-ascii characters from a text corpus. example:

编辑:在@Alvas回答后,我设法做了我想做的事。但是现在,我想知道如何从文本语料库中删除所有非ascii字符。例:

#store cleaned file into variable
with open('path\to\file.txt', 'r') as infile:
    xstring = infile.readlines()
infile.close

    def remove_non_ascii(line):
        return ''.join([i if ord(i) < 128 else ' ' for i in line])

    for i, line in enumerate(xstring):
        line = remove_non_ascii(line)

#tokenize and tag text
def processLanguage():
    for item in xstring:
        tokenized = nltk.word_tokenize(item)
        tagged = nltk.pos_tag(tokenized)
        print tokenized
        print tagged
processLanguage()

This above is taken from another answer here in S/O. However it doesn't seem to work. What might be wrong? The error I am getting is:

以上是S / O中的另一个答案。但它似乎没有用。可能有什么问题?我得到的错误是:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
not in range(128)

2 个解决方案

#1


6

Your code has several problems, though the main culprit is that your for loop does not modify the contents of the xstring:

你的代码有几个问题,但主要的罪魁祸首是你的for循环不会修改xstring的内容:

I will address all the issues in your code here:

我将在此处解决代码中的所有问题:

you cannot write paths like this with single \, as \t will be interpreted as a tabulator, and \f as a linefeed character. You must double them. I know it was an example here, but such confusions often arise:

你不能用单\来写这样的路径,因为\ t将被解释为制表符,而\ f将被解释为换行符。你必须加倍他们。我知道这是一个例子,但这种混淆经常出现:

with open('path\\to\\file.txt', 'r') as infile:
    xstring = infile.readlines()

The following infile.close line is wrong. It does not call the close method, it does not actually do anything. Furthermore, your file was closed already by the with clause if you see this line in any answer anywhere, please just downvote the answer outright with the comment saying that file.close is wrong, should be file.close().

以下infile.close行是错误的。它不会调用close方法,它实际上并没有做任何事情。此外,你的文件已经被with子句关闭了,如果你在任何地方的任何答案中看到这一行,请直接用注释说明file.close是错误的,应该是file.close()。

The following should work, but you need to be aware that it replacing every non-ascii character with ' ' it will break words such as naïve and café

以下应该可以工作,但你需要知道它用''替换每个非ascii字符会破坏naïve和café等词

def remove_non_ascii(line):
    return ''.join([i if ord(i) < 128 else ' ' for i in line])

But here is the reason why your code fails with an unicode exception: you are not modifying the elements of xstring at all, that is, you are calculating the line with ascii characters removed, yes, but that is a new value, that is never stored into the list:

但是这就是你的代码因unicode异常而失败的原因:你根本没有修改xstring的元素,也就是说,你正在计算删除了ascii字符的行,是的,但这是一个新值,这是永远的存储到列表中:

for i, line in enumerate(xstring):
   line = remove_non_ascii(line)

Instead it should be:

相反它应该是:

for i, line in enumerate(xstring):
    xstring[i] = remove_non_ascii(line)

or my preferred very pythonic:

或者我喜欢的非常pythonic:

xstring = [ remove_non_ascii(line) for line in xstring]

Though these Unicode Errors occur mainly just because you are using Python 2.7 for handling pure Unicode text, something for which recent Python 3 versions are way ahead, thus I'd recommend you that if you are in very beginning with task that you'd upgrade to Python 3.4+ soon.

虽然这些Unicode错误的发生主要是因为你使用Python 2.7来处理纯Unicode文本,但是最近的Python 3版本是先行的,因此我建议你如果你刚开始使用任务升级很快就到了Python 3.4+。

更多相关文章

  1. Python多行正则表达式忽略字符串中的n行
  2. 从正则表达式中浏览和提取字符类
  3. Python中字符大小写的问题
  4. 套接字。接受错误24:对许多打开的文件
  5. 简单的python爬取网页字符串内容并保存
  6. scikit-learn:在标记化时不要分隔带连字符的单词
  7. 你怎么检查python字符串是否只包含数字?
  8. Python之错误异常和文件处理
  9. django模板引擎有错误检查?

随机推荐

  1. 使用delphi 开发多层应用(十三)使用Basic4a
  2. (转载)Android下Affinities和Task
  3. Android中称为四大组件
  4. activity 生命周期
  5. 开源阅读器FBReader Android版本的编译
  6. android瀑布流
  7. android bionic缺失pthread_cancel的解决
  8. Android四大组件之Activity---生命周期那
  9. Android(安卓)3D旋转动画之Camera 和 Mat
  10. Android中的动画有哪几类?各自的特点和区