如何有效地扩展/展平pandas数据帧

I have a dataset that on one of its columns, each element is a list. I would like to flatten it, such that every list element would have a row of it's own.

我有一个数据集，在其中一个列上，每个元素都是一个列表。我想弄平它，这样每个列表元素都会有一行自己的行。

I managed to solve it with iterrows, dict and append(see below) but it is too slow with my true DF that is large. Is there a way to make things faster?

我设法用iterrows，dict和append解决它（见下文），但是我的真DF很大。有没有办法让事情变得更快？

I can consider replacing the column with list per element in another format (maybe hierarchical df? ) if that would make more sense.

我可以考虑用另一种格式（可能是分层df？）替换每个元素的列，如果这更有意义的话。

EDIT: I have many columns, and some might change in the future. The only thing i know for sure is that I have the fields column. That's why I used dict in my solution

编辑：我有很多专栏，有些可能会在未来发生变化。我唯一知道的是我有田野专栏。这就是我在我的解决方案中使用dict的原因

A minimal example, creating a df to play with:

一个最小的例子，创建一个df来玩：

import StringIO
df = pd.read_csv(StringIO.StringIO("""
id|name|fields
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]
"""), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
print df

resulting df:

得到的df：

   id name        fields
0   1  abc  [qq, ww, rr]
1   2  efg  [zz, xx, rr]

my (slow) solution:

我的（慢）解决方案：

new_df = pd.DataFrame(index=[], columns=df.columns)

for _, i in df.iterrows():
    flattened_d = [dict(i.to_dict(), fields=c) for c in i.fields]
    new_df = new_df.append(flattened_d )

Resulting with

结果

    id name fields
0  1.0  abc     qq
1  1.0  abc     ww
2  1.0  abc     rr
0  2.0  efg     zz
1  2.0  efg     xx
2  2.0  efg     rr

3 个解决方案

#1

You can break the lists in the fields column into multiple columns by applying pandas.Series to fields and then merging to id and name like so:

您可以通过将pandas.Series应用于字段然后合并到id和name来将fields列中的列表分成多个列，如下所示：

cols = df.columns[df.columns != 'fields'].tolist() # adapted from @jezrael 
df = df[cols].join(df.fields.apply(pandas.Series))

Then you can melt the resulting new columns using set_index and stack, and then reseting the index:

然后，您可以使用set_index和stack来融合生成的新列，然后重置索引：

df = df.set_index(cols).stack().reset_index()

Finally, drop the redundant column generated by reset_index and rename the generated column to "field":

最后，删除reset_index生成的冗余列，并将生成的列重命名为“field”：

df = df.drop(df.columns[-2], axis=1).rename(columns={0: 'field'})

3 个解决方案

#1

更多相关文章

随机推荐