sklearn.LabelEncoder以前从未见过的值
If a sklearn.LabelEncoder
has been fitted on a training set, it might break if it encounters new values when used on a test set.
如果sklearn.LabelEncoder已经安装在训练集上,如果在测试集上使用时遇到新值,则可能会中断。
The only solution I could come up with for this is to map everything new in the test set (i.e. not belonging to any existing class) to "<unknown>"
, and then explicitly add a corresponding class to the LabelEncoder
afterward:
我能想到的唯一解决方案是将测试集中的所有新内容(即不属于任何现有类)映射到“
# train and test are pandas.DataFrame's and c is whatever column
le = LabelEncoder()
le.fit(train[c])
test[c] = test[c].map(lambda s: '<unknown>' if s not in le.classes_ else s)
le.classes_ = np.append(le.classes_, '<unknown>')
train[c] = le.transform(train[c])
test[c] = le.transform(test[c])
This works, but is there a better solution?
这有效,但有更好的解决方案吗?
Update
更新
As @sapo_cosmico points out in a comment, it seems that the above doesn't work anymore, given what I assume is an implementation change in LabelEncoder.transform
, which now seems to use np.searchsorted
(I don't know if it was the case before). So instead of appending the <unknown>
class to the LabelEncoder
's list of already extracted classes, it needs to be inserted in sorted order:
正如@sapo_cosmico在评论中指出的那样,似乎上面的内容不再适用,因为我假设LabelEncoder.transform中的实现更改,现在似乎使用了np.searchsorted(我不知道它是否是之前的情况)。因此,不是将
import bisect
le_classes = le.classes_.tolist()
bisect.insort_left(le_classes, '<unknown>')
le.classes_ = le_classes
However, as this feels pretty clunky all in all, I'm certain there is a better approach for this.
然而,总而言之,这感觉非常笨重,我确信有更好的方法。
6 个解决方案
#1
21
I ended up switching to Pandas' get_dummies due to this problem of unseen data.
由于这个看不见的数据问题,我最终切换到了Pandas的get_dummies。
- create the dummies on the training data
dummy_train = pd.get_dummies(train)
- 在训练数据上创建虚拟人物dummy_train = pd.get_dummies(train)
- create the dummies in the new (unseen data)
dummy_new = pd.get_dummies(new_data)
- 在新的(看不见的数据)dummy_new = pd.get_dummies(new_data)中创建虚拟对象
- re-index the new data to the columns of the training data, filling the missing values with 0
dummy_new.reindex(columns = dummy_train.columns, fill_value=0)
- 将新数据重新索引到训练数据的列,用0 dummy_new.reindex(columns = dummy_train.columns,fill_value = 0)填充缺失值
Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.
实际上,任何明确的新功能都不会进入分类器,但我认为这不会引起问题,因为它不知道如何处理它们。
更多相关文章
- TensorFlow数据集(一)——数据集的基本使用方法
- python爬虫学习之post数据的传送
- 如何让django芹菜写入测试数据库进行功能测试?
- Python数据挖掘实例(实时更新)
- 通过分隔符计数和位置从数据框中提取特定文本
- pytorch中tensor数据和numpy数据转换中注意的一个问题
- python处理数据,存进hive表
- python 读写json数据
- python编程之一:使用网格索引算法进行空间数据查询