I'm trying to read from disk a huge file (~2GB) and split each line into multiple strings:

我试图从磁盘读取一个巨大的文件(~2GB)并将每行分成多个字符串:

def get_split_lines(file_path):
    with open(file_path, 'r') as f:
        split_lines = [line.rstrip().split() for line in f]
    return split_lines

Problem is, it tries to allocate tens and tens of GB in memory. I found out that it doesn't happen if I change my code in the following way:

问题是,它试图在内存中分配数十和数十GB。我发现如果我按以下方式更改代码不会发生:

def get_split_lines(file_path):
    with open(file_path, 'r') as f:
        split_lines = [line.rstrip() for line in f]    # no splitting
    return split_lines

I.e., if I do not split the lines, memory usage drastically goes down. Is there any way to handle this problem, maybe some smart way to store split lines without filling up the main memory?

即,如果我不拆分线路,内存使用量会大幅下降。有没有办法解决这个问题,也许有一些聪明的方法来存储分割线而不填满主存储器?

Thank you for your time.

感谢您的时间。

2 个解决方案

#1


After the split, you have multiple objects: a tuple plus some number of string objects. Each object has its own overhead in addition to the actual set of characters that make up the original string.

拆分后,您有多个对象:元组加上一些字符串对象。除了构成原始字符串的实际字符集外,每个对象都有自己的开销。

Rather than reading the entire file into memory, use a generator.

不是将整个文件读入内存,而是使用生成器。

def get_split_lines(file_path):
    with open(file_path, 'r') as f:
        for line in f:
            yield line.rstrip.split()

for t in get_split_lines(file_path):
    # Do something with the tuple t 

This does not preclude you from writing something like

这并不妨碍你写一些类似的东西

lines = list(get_split_lines(file_path))

if you really need to read the entire file into memory.

如果你真的需要将整个文件读入内存。

更多相关文章

  1. python 中 字符串转换为数组,字典或表达式
  2. Python多行正则表达式忽略字符串中的n行
  3. 使用python 3.6将多个文件并行加载到内存中的最佳方法是什么?
  4. 简单的python爬取网页字符串内容并保存
  5. 你怎么检查python字符串是否只包含数字?
  6. Python - 去除字符串首尾填充
  7. Python处理字符串
  8. python list range 字符串的截取 如 text[1:5]
  9. 在Python中使用正则表达式匹配的字符串周围添加括号

随机推荐

  1. 动画滚动无法在Firefox中运行?
  2. HTML5 Canvas编写五彩连珠(3):设计
  3. HTML:关于a标签的target属性
  4. 当锚标记被单击时,角值从一个页面传递到另
  5. jQuery .load停止嵌入页面/重新加载整个
  6. “/图标。ico " vs
  7. 我无法理解为什么我的代码中的单击选择文
  8. 儿童视图不在angular-ui-router中工作
  9. jQuery延迟淡入时间超过预期
  10. word和.txt文件转html 及pdf文件, 使用poi