If a sklearn.LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set.

如果sklearn.LabelEncoder已经安装在训练集上,如果在测试集上使用时遇到新值,则可能会中断。

The only solution I could come up with for this is to map everything new in the test set (i.e. not belonging to any existing class) to "<unknown>", and then explicitly add a corresponding class to the LabelEncoder afterward:

我能想到的唯一解决方案是将测试集中的所有新内容(即不属于任何现有类)映射到“ ”,然后在LabelEncoder中显式添加相应的类:

# train and test are pandas.DataFrame's and c is whatever column
le = LabelEncoder()
le.fit(train[c])
test[c] = test[c].map(lambda s: '<unknown>' if s not in le.classes_ else s)
le.classes_ = np.append(le.classes_, '<unknown>')
train[c] = le.transform(train[c])
test[c] = le.transform(test[c])

This works, but is there a better solution?

这有效,但有更好的解决方案吗?

Update

更新

As @sapo_cosmico points out in a comment, it seems that the above doesn't work anymore, given what I assume is an implementation change in LabelEncoder.transform, which now seems to use np.searchsorted (I don't know if it was the case before). So instead of appending the <unknown> class to the LabelEncoder's list of already extracted classes, it needs to be inserted in sorted order:

正如@sapo_cosmico在评论中指出的那样,似乎上面的内容不再适用,因为我假设LabelEncoder.transform中的实现更改,现在似乎使用了np.searchsorted(我不知道它是否是之前的情况)。因此,不是将 类附加到LabelEncoder的已提取类列表中,而是需要按排序顺序插入:

import bisect
le_classes = le.classes_.tolist()
bisect.insort_left(le_classes, '<unknown>')
le.classes_ = le_classes

However, as this feels pretty clunky all in all, I'm certain there is a better approach for this.

然而,总而言之,这感觉非常笨重,我确信有更好的方法。

6 个解决方案

#1


21

I ended up switching to Pandas' get_dummies due to this problem of unseen data.

由于这个看不见的数据问题,我最终切换到了Pandas的get_dummies。

  • create the dummies on the training data
    dummy_train = pd.get_dummies(train)
  • 在训练数据上创建虚拟人物dummy_train = pd.get_dummies(train)
  • create the dummies in the new (unseen data)
    dummy_new = pd.get_dummies(new_data)
  • 在新的(看不见的数据)dummy_new = pd.get_dummies(new_data)中创建虚拟对象
  • re-index the new data to the columns of the training data, filling the missing values with 0
    dummy_new.reindex(columns = dummy_train.columns, fill_value=0)
  • 将新数据重新索引到训练数据的列,用0 dummy_new.reindex(columns = dummy_train.columns,fill_value = 0)填充缺失值

Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.

实际上,任何明确的新功能都不会进入分类器,但我认为这不会引起问题,因为它不知道如何处理它们。

更多相关文章

  1. TensorFlow数据集(一)——数据集的基本使用方法
  2. python爬虫学习之post数据的传送
  3. 如何让django芹菜写入测试数据库进行功能测试?
  4. Python数据挖掘实例(实时更新)
  5. 通过分隔符计数和位置从数据框中提取特定文本
  6. pytorch中tensor数据和numpy数据转换中注意的一个问题
  7. python处理数据,存进hive表
  8. python 读写json数据
  9. python编程之一:使用网格索引算法进行空间数据查询

随机推荐

  1. f-string 竟然能有 73 个例子,我要学习下
  2. 7 个省时高效的 pytest 特性和插件
  3. Android中JSON解析
  4. 再来 6 个例子教你重构 Python 代码
  5. Django2.0+小程序技术打造微信小程序助手
  6. 再次为王!Python 是 2020 年度编程语言
  7. android用jdbc多线程操作sqlite小结
  8. C语言的一些练习以及自己写一个猜数字小
  9. 算法面试专题课(Java版)
  10. centos LVM(逻辑卷管理)