从字典中创建NumPy数组的最佳方法是什么?

I'm just starting with NumPy so I may be missing some core concepts...

我只是从NumPy开始，所以我可能漏掉了一些核心概念…

What's the best way to create a NumPy array from a dictionary whose values are lists?

从值列表的字典中创建NumPy数组的最佳方法是什么?

Something like this:

是这样的:

d = { 1: [10,20,30] , 2: [50,60], 3: [100,200,300,400,500] }

Should turn into something like:

应该变成:

data = [
  [10,20,30,?,?],
  [50,60,?,?,?],
  [100,200,300,400,500]
]

I'm going to do some basic statistics on each row, eg:

我将对每一行做一些基本的统计。

deviations = numpy.std(data, axis=1)

Questions:

问题:

What's the best / most efficient way to create the numpy.array from the dictionary? The dictionary is large; a couple of million keys, each with ~20 items.
什么是创建numpy最好/最有效的方法?数组的字典吗?字典是大;两百万个键，每个键都有~20个条目。
The number of values for each 'row' are different. If I understand correctly numpy wants uniform size, so what do I fill in for the missing items to make std() happy?
每个“行”的值是不同的。如果我正确地理解了numpy需要统一的大小，那么我要填什么才能使std()高兴呢?

Update: One thing I forgot to mention - while the python techniques are reasonable (eg. looping over a few million items is fast), it's constrained to a single CPU. Numpy operations scale nicely to the hardware and hit all the CPUs, so they're attractive.

更新:我忘记提到的一件事——python技术是合理的(如。循环超过几百万项是快速的)，它被限制为一个单一的CPU。Numpy操作可以很好地扩展到硬件，并击中所有cpu，因此它们很有吸引力。

3 个解决方案

#1

You don't need to create numpy arrays to call numpy.std(). You can call numpy.std() in a loop over all the values of your dictionary. The list will be converted to a numpy array on the fly to compute the standard variation.

您不需要创建numpy数组来调用numpy.std()。您可以在对字典的所有值的循环中调用numpy.std()。该列表将被转换为一个动态的numpy数组以计算标准的变化。

The downside of this method is that the main loop will be in python and not in C. But I guess this should be fast enough: you will still compute std at C speed, and you will save a lot of memory as you won't have to store 0 values where you have variable size arrays.

这种方法的缺点是,主循环将在python中而不是在C。但是我想这应该是足够快:你仍然以C速度计算性病,你将节省大量内存不需要存储0值,变量数组大小。

If you want to further optimize this, you can store your values into a list of numpy arrays, so that you do the python list -> numpy array conversion only once.
如果您想进一步优化这个，您可以将您的值存储到一个numpy数组的列表中，这样您就可以只执行一次python列表-> numpy数组转换。
if you find that this is still too slow, try to use psycho to optimize the python loop.
如果您发现这仍然太慢，请尝试使用psycho来优化python循环。
if this is still too slow, try using Cython together with the numpy module. This Tutorial claims impressive speed improvements for image processing. Or simply program the whole std function in Cython (see this for benchmarks and examples with sum function )
如果仍然太慢，可以尝试使用Cython和numpy模块。本教程要求对图像处理有令人印象深刻的速度改进。或者简单地在Cython中编写整个std函数(请参阅本文的基准和示例和sum函数)
An alternative to Cython would be to use SWIG with numpy.i.
与Cython的另一种选择是使用numpyi的SWIG。
if you want to use only numpy and have everything computed at C level, try grouping all the records of same size together in different arrays and call numpy.std() on each of them. It should look like the following example.
如果您只想使用numpy并在C级别上使用所有计算，那么可以尝试将相同大小的所有记录组合在不同的数组中，并在每个数组中调用numpy.std()。它应该看起来像下面的例子。

example with O(N) complexity:

复杂度为O(N)的例子:

import numpy
list_size_1 = []
list_size_2 = []
for row in data.itervalues():
    if len(row) == 1:
      list_size_1.append(row)
    elif len(row) == 2:
      list_size_2.append(row)
list_size_1 = numpy.array(list_size_1)
list_size_2 = numpy.array(list_size_2)
std_1 = numpy.std(list_size_1, axis = 1)
std_2 = numpy.std(list_size_2, axis = 1)

3 个解决方案

#1

更多相关文章

随机推荐