sklearn.LabelEncoder以前从未见过的值

If a sklearn.LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set.

如果sklearn.LabelEncoder已经安装在训练集上，如果在测试集上使用时遇到新值，则可能会中断。

The only solution I could come up with for this is to map everything new in the test set (i.e. not belonging to any existing class) to "<unknown>", and then explicitly add a corresponding class to the LabelEncoder afterward:

我能想到的唯一解决方案是将测试集中的所有新内容（即不属于任何现有类）映射到“ ”，然后在LabelEncoder中显式添加相应的类：

# train and test are pandas.DataFrame's and c is whatever column
le = LabelEncoder()
le.fit(train[c])
test[c] = test[c].map(lambda s: '<unknown>' if s not in le.classes_ else s)
le.classes_ = np.append(le.classes_, '<unknown>')
train[c] = le.transform(train[c])
test[c] = le.transform(test[c])

This works, but is there a better solution?

这有效，但有更好的解决方案吗？

Update

更新

As @sapo_cosmico points out in a comment, it seems that the above doesn't work anymore, given what I assume is an implementation change in LabelEncoder.transform, which now seems to use np.searchsorted (I don't know if it was the case before). So instead of appending the <unknown> class to the LabelEncoder's list of already extracted classes, it needs to be inserted in sorted order:

正如@sapo_cosmico在评论中指出的那样，似乎上面的内容不再适用，因为我假设LabelEncoder.transform中的实现更改，现在似乎使用了np.searchsorted（我不知道它是否是之前的情况）。因此，不是将类附加到LabelEncoder的已提取类列表中，而是需要按排序顺序插入：

import bisect
le_classes = le.classes_.tolist()
bisect.insort_left(le_classes, '<unknown>')
le.classes_ = le_classes

However, as this feels pretty clunky all in all, I'm certain there is a better approach for this.

然而，总而言之，这感觉非常笨重，我确信有更好的方法。

6 个解决方案

#1

I ended up switching to Pandas' get_dummies due to this problem of unseen data.

由于这个看不见的数据问题，我最终切换到了Pandas的get_dummies。

create the dummies on the training data
dummy_train = pd.get_dummies(train)
在训练数据上创建虚拟人物dummy_train = pd.get_dummies（train）
create the dummies in the new (unseen data)
dummy_new = pd.get_dummies(new_data)
在新的（看不见的数据）dummy_new = pd.get_dummies（new_data）中创建虚拟对象
re-index the new data to the columns of the training data, filling the missing values with 0
dummy_new.reindex(columns = dummy_train.columns, fill_value=0)
将新数据重新索引到训练数据的列，用0 dummy_new.reindex（columns = dummy_train.columns，fill_value = 0）填充缺失值

Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.

实际上，任何明确的新功能都不会进入分类器，但我认为这不会引起问题，因为它不知道如何处理它们。

6 个解决方案

#1

更多相关文章

随机推荐