I have a dataset that on one of its columns, each element is a list. I would like to flatten it, such that every list element would have a row of it's own.

我有一个数据集,在其中一个列上,每个元素都是一个列表。我想弄平它,这样每个列表元素都会有一行自己的行。

I managed to solve it with iterrows, dict and append(see below) but it is too slow with my true DF that is large. Is there a way to make things faster?

我设法用iterrows,dict和append解决它(见下文),但是我的真DF很大。有没有办法让事情变得更快?

I can consider replacing the column with list per element in another format (maybe hierarchical df? ) if that would make more sense.

我可以考虑用另一种格式(可能是分层df?)替换每个元素的列,如果这更有意义的话。

EDIT: I have many columns, and some might change in the future. The only thing i know for sure is that I have the fields column. That's why I used dict in my solution

编辑:我有很多专栏,有些可能会在未来发生变化。我唯一知道的是我有田野专栏。这就是我在我的解决方案中使用dict的原因

A minimal example, creating a df to play with:

一个最小的例子,创建一个df来玩:

import StringIO
df = pd.read_csv(StringIO.StringIO("""
id|name|fields
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]
"""), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
print df

resulting df:

得到的df:

   id name        fields
0   1  abc  [qq, ww, rr]
1   2  efg  [zz, xx, rr]

my (slow) solution:

我的(慢)解决方案:

new_df = pd.DataFrame(index=[], columns=df.columns)

for _, i in df.iterrows():
    flattened_d = [dict(i.to_dict(), fields=c) for c in i.fields]
    new_df = new_df.append(flattened_d )

Resulting with

结果

    id name fields
0  1.0  abc     qq
1  1.0  abc     ww
2  1.0  abc     rr
0  2.0  efg     zz
1  2.0  efg     xx
2  2.0  efg     rr

3 个解决方案

#1


1

You can break the lists in the fields column into multiple columns by applying pandas.Series to fields and then merging to id and name like so:

您可以通过将pandas.Series应用于字段然后合并到id和name来将fields列中的列表分成多个列,如下所示:

cols = df.columns[df.columns != 'fields'].tolist() # adapted from @jezrael 
df = df[cols].join(df.fields.apply(pandas.Series))

Then you can melt the resulting new columns using set_index and stack, and then reseting the index:

然后,您可以使用set_index和stack来融合生成的新列,然后重置索引:

df = df.set_index(cols).stack().reset_index()

Finally, drop the redundant column generated by reset_index and rename the generated column to "field":

最后,删除reset_index生成的冗余列,并将生成的列重命名为“field”:

df = df.drop(df.columns[-2], axis=1).rename(columns={0: 'field'})

更多相关文章

  1. 008 Python基本语法元素小结
  2. Python根据第一项从2d数组中删除元素
  3. 无法安装ndg-httpsclient或者我的解决方案错误
  4. Python ElementTree“找不到元素”异常
  5. Linux无法连接网络解决方案
  6. 将现有数组中的所有元素传递给xargs
  7. Media-S 简介(一个开源的DRM解决方案)
  8. AppScan安全问题解决方案
  9. PLSQL乱码解决方案

随机推荐

  1. HTML 5 就是 Web Application
  2. CSS文件:SyntaxError:期望表达式,得到'。
  3. 如何让这段插入的innerHTML 里动态赋予的
  4. html/css实现文字自动换行,超出部分出现(.
  5. 为什么所有验证都在同一时间进行
  6. 花式框内的内容需要响应
  7. jQuery动画div滑动,不覆盖文本。
  8. 关于input的一些问题解决方法分享
  9. css让背景图片拉伸填充的属性
  10. 解决FCKEditor编辑器在浏览器返回时显示h