微信公众号:数据运营人
本系列为博主的读书学习笔记,如需转载请注明出处。

第三章 加工原料文本

3.8 分割断句分词

3.8 分割

断句

在词级水平处理文本通常假定能够将文本划分成单个句子,一些语料库已经提供在句子级别的访问,计算布朗语料库中每个句子的平均词数:

importnltk
len(nltk.corpus.brown.words())/len(nltk.corpus.brown.sents())

20.250994070456922

sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
text=nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
sents=sent_tokenizer.tokenize(text)
pprint.pprint(sents[171:181])

['"Nonsense!','" said Gregory, who was very rational when anyone else\nattempted paradox.','"Why do all the clerks and navvies in the\nrailway trains look so sad and tired,…','I will\ntell you.','It is because they know that the train is going right.','It\nis because they know that whatever place they have taken a ticket\nfor that …','It is because after they have\npassed Sloane Square they know that the next stat…','Oh, their wild rapture!','oh,\ntheir eyes like stars and their souls again in Eden, if the next\nstation w…''"\n\n"It is you who are unpoetical," replied the poet Syme.']

分词

在中文中,三个字符的字符串:爱国人(ai4 “love” [verb], guo3 “country”,ren2 “person”) 可以被分词为“爱国/人” , “country-loving person” ,或者“爱/国人” , “ love country-person” 。

例1-1:从分词表示字符串seg1和seg2 中重建文本分词。 seg1 和 seg2 表示假设的一些儿童讲话的初始和最终分词。函数 segment() 可以使用它们重现分词的文本。

defsegment(text,segs):
words=[]
last=0
foriinrange(len(segs)):
ifsegs[i]=='1':
words.append(text[last:i+1])
last=i+1
words.append(text[last:])
returnwords
text="doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1="0000000000000001000000000010000000000000000100000000000"
seg2="0100100100100001001001000010100100010010000100010010000"
print(segment(text,seg1))
print(segment(text,seg2))

['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
['do', 'you', 'see', 'the', 'kitty', 'see', 'the', 'doggy', 'do', 'you','like', 'the', kitty', 'like', 'the', 'doggy']

例1-2:计算存储词典和重构源文本的成本。

defevaluate(text,segs):
words=segment(text,segs)
text_size=len(words)
lexicon_size=len(''.join(list(set(words))))
returntext_size+lexicon_size
text="doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1="0000000000000001000000000010000000000000000100000000000"
seg2="0100100100100001001001000010100100010010000100010010000"
seg3="0000100100000011001000000110000100010000001100010000001"
print(segment(text,seg3))
print(evaluate(text,seg3))
print(evaluate(text,seg2))
print(evaluate(text,seg1))

['doyou', 'see', 'thekitt', 'y', 'see', 'thedogg', 'y', 'doyou', 'like','thekitt', 'y', 'like', 'thedogg', 'y']
46
47
63
例1-3:使用模拟退火算法的非确定性搜索:一开始仅搜索短语分词;随机扰动 0 和 1 ,它们与“温度”成比例;每次迭代温度都会降低,扰动边界会减少。

fromrandomimportrandint
defflip(segs,pos):
returnsegs[:pos]+str(1-int(segs[pos]))+segs[pos+1:]
defflip_n(segs,n):
foriinrange(n):
segs=flip(segs,randint(0,len(segs)-1))
returnsegs
defanneal(text,segs,iterations,cooling_rate):
temperature=float(len(segs))
whiletemperature>0.5:
best_segs,best=segs,evaluate(text,segs)
foriinrange(iterations):
guess=flip_n(segs,int(round(temperature)))
score=evaluate(text,guess)
ifscore<best:
best,best_segs=score,guess
score,segs=best,best_segs
temperature=temperature/cooling_rate
print(evaluate(text,segs),segment(text,segs))
returnsegs
text="doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1="0000000000000001000000000010000000000000000100000000000"
anneal(text,seg1,5000,1.2)

60 ['doyouseetheki', 'tty', 'see', 'thedoggy', 'doyouliketh', 'ekittylike', 'thedoggy']
58 ['doy', 'ouseetheki', 'ttysee', 'thedoggy', 'doy', 'o', 'ulikethekittylike', 'thedoggy']
56 ['doyou', 'seetheki', 'ttysee', 'thedoggy', 'doyou', 'liketh', 'ekittylike', 'thedoggy']
54 ['doyou', 'seethekit', 'tysee', 'thedoggy', 'doyou', 'likethekittylike', 'thedoggy']
53 ['doyou', 'seethekit', 'tysee', 'thedoggy', 'doyou', 'like', 'thekitty', 'like', 'thedoggy']
51 ['doyou', 'seethekittysee', 'thedoggy', 'doyou', 'like', 'thekitty', 'like', 'thedoggy']42 ['doyou', 'see', 'thekitty', 'see', 'thedoggy', 'doyou', 'like', 'thekitty', 'like', 'thedoggy']
'0000100100000001001000000010000100010000000100010000000'

更多相关文章

  1. python 读写文本文件
  2. Python自然语言处理实践: 在NLTK中使用斯坦福中文分词器
  3. Matplotlib:中心文本在其bbox。
  4. 机器学习之路:python 文本特征提取 CountVectorizer, TfidfVector
  5. Python Module之textwrap - 文本段落格式编排
  6. 【python 编程】网页中文过滤分词及词频统计
  7. Python -在文本文件中添加日期戳
  8. python在文本开头插入一行
  9. python的PIL绘制多行文本的图像。

随机推荐

  1. Android 强制设置横屏或竖屏 设置全屏
  2. android之ListView和SimpleAdapter的组合
  3. android各种提示Dialog 弹出框
  4. 系出名门Android(7) - 控件(View)
  5. Android集成Facebook 事件统计
  6. ActionBar的自定义样式
  7. android语音搜索结果显示页实现
  8. Android用ViewPager实现多页面的切换效果
  9. 系出名门Android(4) - 活动(Activity),
  10. listview常用的优化技巧