用于搜索和替换大字符串的最快Python方法
I'm looking for the fastest way to replace a large number of sub-strings inside a very large string. Here are two examples I've used.
我正在寻找一种在非常大的字符串中替换大量子字符串的最快方法。这是我用过的两个例子。
findall() feels simpler and more elegant, but it takes an astounding amount of time.
findall()感觉更简单,更优雅,但需要花费大量时间。
finditer() blazes through a large file, but I'm not sure this is the right way to do it.
finditer()通过一个大文件,但我不确定这是正确的方法。
Here's some sample code. Note that the actual text I'm interested in is a single string around 10MB in size, and there's a huge difference in these two methods.
这是一些示例代码。请注意,我感兴趣的实际文本是一个大小约10MB的单个字符串,这两种方法有很大的不同。
import re
def findall_replace(text, reg, rep):
for match in reg.findall(text):
output = text.replace(match, rep)
return output
def finditer_replace(text, reg, rep):
cursor_pos = 0
output = ''
for match in reg.finditer(text):
output += "".join([text[cursor_pos:match.start(1)], rep])
cursor_pos = match.end(1)
output += "".join([text[cursor_pos:]])
return output
reg = re.compile(r'(dog)')
rep = 'cat'
text = 'dog cat dog cat dog cat'
finditer_replace(text, reg, rep)
findall_replace(text, reg, rep)
UPDATE Added re.sub method to tests:
更新为测试添加了re.sub方法:
def sub_replace(reg, rep, text):
output = re.sub(reg, rep, text)
return output
Results
结果
re.sub() - 0:00:00.031000
finditer() - 0:00:00.109000
findall() - 0:01:17.260000
re.sub() - 0:00:00.031000 finditer() - 0:00:00.109000 findall() - 0:01:17.260000
3 个解决方案
#1
14
The standard method is to use the built-in
标准方法是使用内置的
re.sub(reg, rep, text)
Incidentally the reason for the performance difference between your versions is that each replacement in your first version causes the entire string to be recopied. Copies are fast, but when you're copying 10 MB at a go, enough copies will become slow.
顺便提一下,版本之间性能差异的原因是第一个版本中的每个替换都会导致整个字符串被重新复制。副本速度很快,但是当你一次复制10 MB时,足够的副本会变慢。
更多相关文章
- python笔记7:接口实现方法
- 【Python】Python3 字典 copy()方法
- jieba(结巴)Python分词器加载到Eclipse方法
- python,os模块的常用方法
- Python语言的特点、程序设计基本方法
- Pandas 文本数据方法 findall( )
- python 字符串操作
- python中函数参数传递的几种方法
- TensorFlow数据集(一)——数据集的基本使用方法