如何有效地扩展/展平pandas数据帧
I have a dataset that on one of its columns, each element is a list. I would like to flatten it, such that every list element would have a row of it's own.
我有一个数据集,在其中一个列上,每个元素都是一个列表。我想弄平它,这样每个列表元素都会有一行自己的行。
I managed to solve it with iterrows
, dict
and append
(see below) but it is too slow with my true DF that is large. Is there a way to make things faster?
我设法用iterrows,dict和append解决它(见下文),但是我的真DF很大。有没有办法让事情变得更快?
I can consider replacing the column with list per element in another format (maybe hierarchical df? ) if that would make more sense.
我可以考虑用另一种格式(可能是分层df?)替换每个元素的列,如果这更有意义的话。
EDIT: I have many columns, and some might change in the future. The only thing i know for sure is that I have the fields column. That's why I used dict
in my solution
编辑:我有很多专栏,有些可能会在未来发生变化。我唯一知道的是我有田野专栏。这就是我在我的解决方案中使用dict的原因
A minimal example, creating a df to play with:
一个最小的例子,创建一个df来玩:
import StringIO
df = pd.read_csv(StringIO.StringIO("""
id|name|fields
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]
"""), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
print df
resulting df:
得到的df:
id name fields
0 1 abc [qq, ww, rr]
1 2 efg [zz, xx, rr]
my (slow) solution:
我的(慢)解决方案:
new_df = pd.DataFrame(index=[], columns=df.columns)
for _, i in df.iterrows():
flattened_d = [dict(i.to_dict(), fields=c) for c in i.fields]
new_df = new_df.append(flattened_d )
Resulting with
结果
id name fields
0 1.0 abc qq
1 1.0 abc ww
2 1.0 abc rr
0 2.0 efg zz
1 2.0 efg xx
2 2.0 efg rr
3 个解决方案
#1
1
You can break the lists in the fields
column into multiple columns by applying pandas.Series
to fields
and then merging to id
and name
like so:
您可以通过将pandas.Series应用于字段然后合并到id和name来将fields列中的列表分成多个列,如下所示:
cols = df.columns[df.columns != 'fields'].tolist() # adapted from @jezrael
df = df[cols].join(df.fields.apply(pandas.Series))
Then you can melt the resulting new columns using set_index
and stack
, and then reseting the index:
然后,您可以使用set_index和stack来融合生成的新列,然后重置索引:
df = df.set_index(cols).stack().reset_index()
Finally, drop the redundant column generated by reset_index and rename the generated column to "field":
最后,删除reset_index生成的冗余列,并将生成的列重命名为“field”:
df = df.drop(df.columns[-2], axis=1).rename(columns={0: 'field'})
更多相关文章
- 008 Python基本语法元素小结
- Python根据第一项从2d数组中删除元素
- 无法安装ndg-httpsclient或者我的解决方案错误
- Python ElementTree“找不到元素”异常
- Linux无法连接网络解决方案
- 将现有数组中的所有元素传递给xargs
- Media-S 简介(一个开源的DRM解决方案)
- AppScan安全问题解决方案
- PLSQL乱码解决方案