I am using the CountVectorizer and don't want to separate hyphenated words into different tokens. I have tried passing different pregex patterns into the token_pattern argument, but haven't been able to get the desired result.

我正在使用CountVectorizer,并且不希望将带连字符的单词分成不同的标记。我尝试将不同的pregex模式传递给token_pattern参数,但是无法获得所需的结果。

Here's what I have tried:

这是我尝试过的:

pattern = r''' (?x)         # set flag to allow verbose regexps 
([A-Z]\.)+          # abbreviations (e.g. U.S.A.)
| \w+(-\w+)*        # words with optional internal hyphens
| \$?\d+(\.\d+)?%?  # currency & percentages
| \.\.\.            # ellipses '''

text = 'I hate traffic-ridden streets.'
vectorizer = CountVectorizer(stop_words='english',token_pattern=pattern)
analyze = vectorizer.build_analyzer()
analyze(text)

I have also tried to use nltk's regexp_tokenize as suggested in an earlier question but it's behaviour seems to have changed as well.

我也试过使用nltk的regexp_tokenize,如前面的问题所示,但它的行为似乎也发生了变化。

1 个解决方案

#1


6

There are a couple things to note. The first is that adding in all of those spaces, line breaks and comments into your pattern string makes all of those characters part of your regular expression. See here:

有几点需要注意。首先,在模式字符串中添加所有这些空格,换行符和注释会使所有这些字符成为正则表达式的一部分。看这里:

import re
>>> re.match("[0-9]","3")
<_sre.SRE_Match object at 0x104caa920>
>>> re.match("[0-9] #a","3")
>>> re.match("[0-9] #a","3 #a")
<_sre.SRE_Match object at 0x104caa718>

The second is that you need to escape special sequences when constructing your regex pattern within a string. For example pattern = "\w" really needs to be pattern = "\\w". Once you account for those things you should be able to write the regex for your desired tokenizer. For example if you just wanted to add in hyphens something like this will work:

第二个是在字符串中构造正则表达式时需要转义特殊序列。例如,pattern =“\ w”确实需要是pattern =“\\ w”。一旦你考虑到这些东西,你应该能够为你想要的标记器编写正则表达式。例如,如果你只是想添加连字符,这样的东西会起作用:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> pattern = "(?u)\\b[\\w-]+\\b"
>>> 
>>> text = 'I hate traffic-ridden streets.'
>>> vectorizer = CountVectorizer(stop_words='english',token_pattern=pattern)
>>> analyze = vectorizer.build_analyzer()
>>> analyze(text)
[u'hate', u'traffic-ridden', u'streets']

更多相关文章

  1. 你怎么检查python字符串是否只包含数字?
  2. Python - 去除字符串首尾填充
  3. python - pandas或者sklearn中如何将字符形式的标签数字化
  4. Python处理字符串
  5. python list range 字符串的截取 如 text[1:5]
  6. python的list要打印中文字符
  7. Python——字符格式化
  8. 在Python中使用正则表达式匹配的字符串周围添加括号
  9. 用python将二进制整数或字符串写入文件

随机推荐

  1. (二)Android事件分发机制 - ViewGroup篇
  2. Android字体(一)
  3. Android消息循环实现原理分析
  4. Google Play Store 应用无法安装解决方案
  5. android bitmap compress(图片压缩)
  6. Ubuntu安装Android的SDK
  7. Android(安卓)greenDao开源数据库框架
  8. Android(安卓)Display System -- Surface
  9. Android(安卓)UI界面刷新与交互
  10. Android EditText控件