TargetEncoder#

class sklearn.preprocessing.TargetEncoder(categories='auto', target_type='auto', smooth='auto', cv=5, shuffle=True, random_state=None)[源代码]#

用于回归和分类目标的目标编码器。

每个类别基于属于该类别的观测的平均目标值的收缩估计进行编码。编码方案将全局目标平均值与以类别值为条件的目标平均值混合（参见 [MIC]) .

当目标类型是“多类”时，编码基于每个类的条件概率估计。目标首先使用“one-vs-all”方案进行二进制化， LabelBinarizer ，然后使用每个类别和每个类别的平均目标值进行编码，结果 n_features * n_classes 编码输出功能。

TargetEncoder 考虑缺失的值，例如 np.nan 或 None ，作为另一个类别，并像任何其他类别一样对它们进行编码。期间未看到的类别 fit 用目标均值编码，即 target_mean_ .

关于 TargetEncoder 内部交叉配合，请参阅目标编码器的内部交叉拟合 .有关不同编码器的比较，请参阅比较目标编码器与其他编码器 .阅读更多的 User Guide .

备注

fit(X, y).transform(X) 不等于 fit_transform(X, y) 因为 cross fitting 方案用于 fit_transform 用于编码。看到 User Guide 有关详细信息

Added in version 1.3.

参数:

categories“Auto”或类似阵列的形状列表（n_features，），默认=“Auto”

每个功能的类别（唯一值）：

"auto" ：根据训练数据自动确定类别。
列表： categories[i] 保存第i列中预期的类别。传递的类别不应该在单个特征中混合字符串和数值，并且应该在数值的情况下进行排序。

使用的类别存储在 categories_ 适合的属性。

target_type{“Auto”，“continuous”，“binary”，“multiclass”}，默认=“Auto”

目标类型。

"auto" : Type of target is inferred with type_of_target .
"continuous" ：持续目标
"binary" ：二元目标
"multiclass" ：多类目标

备注

通过推断的目标类型 "auto" 可能不是用于建模的所需目标类型。例如，如果目标由0到100之间的整数组成，那么 type_of_target 将推断目标为 "multiclass" .在这种情况下， target_type="continuous" 将目标指定为回归问题。的 target_type_ 属性提供编码器使用的目标类型。

在 1.4 版本发生变更: 添加了“multiclass”选项。

smooth“Auto”或float，默认=“Auto”

目标平均值的混合量，取决于类别与全局目标平均值的值。更大的 smooth 价值将更重视全球目标平均值。如果 "auto" 那么 smooth 设置为经验性的Bayes估计。

cvint，默认=5

确定中的折叠数 cross fitting 使用的策略 fit_transform .对于分类目标， StratifiedKFold 用于连续目标， KFold 采用了

shuffle布尔，默认=True

Whether to shuffle the data in fit_transform before splitting into folds. Note that the samples within each split will not be shuffled.

random_stateint，RandomState实例或无，默认=无

当 shuffle 是真的， random_state 影响指数的顺序，从而控制每个折叠的随机性。否则，该参数没有任何作用。传递int以获得跨多个函数调用的可重复输出。看到 Glossary .

属性:

encodings_形状列表（n_features，）或（n_features * n_classes）的 ndarray: 所有学习的编码 X .用于特征 i , encodings_[i] 编码是否与中列出的类别匹配 categories_[i] .当 target_type_ 是“多类”，即特征的编码 i 和阶级 j 存储在 encodings_[j + (i * len(classes_))] .例如，对于2个特征（f）和3个类别（c），编码顺序为：f0_c0、f0_c1、f0_c2、f1_c0、f1_c1、f1_c2、
categories_ndray的形状（n_features，）列表: 每个输入要素的类别在装配过程中确定或在中指定 categories (in要素的顺序 X 并对应于 transform ).
target_type_str: 目标类型。
target_mean_浮子: 目标的总体平均值。此值仅用于 transform 对类别进行编码。
n_features_in_int: 期间看到的功能数量 fit .
feature_names_in_ ：nd形状数组 (n_features_in_ ,)nd数组形状（: Names of features seen during fit. Defined only when X has feature names that are all strings.
classes_ndray或无: If target_type_ is 'binary' or 'multiclass', holds the label for each class, otherwise None.

参见

OrdinalEncoder: 执行分类特征的序数（整数）编码。与TargetEncoder相反，这种编码不受监督。因此，将所得编码视为数值特征会导致任意排序的值，因此当用作分类器或回归量的预处理时，通常会导致较低的预测性能。
OneHotEncoder: 对类别特征执行一次性编码。这种无监督编码更适合低基数的类别变量，因为它为每个唯一类别生成一个新特征。

引用

[MIC]

Micci-Barreca, Daniele. "A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems" SIGKDD Explor. Newsl. 3, 1 (July 2001), 27–32.

示例

与 smooth="auto" ，将平滑参数设置为经验Bayes估计：

>>> import numpy as np
>>> from sklearn.preprocessing import TargetEncoder
>>> X = np.array([["dog"] * 20 + ["cat"] * 30 + ["snake"] * 38], dtype=object).T
>>> y = [90.3] * 5 + [80.1] * 15 + [20.4] * 5 + [20.1] * 25 + [21.2] * 8 + [49] * 30
>>> enc_auto = TargetEncoder(smooth="auto")
>>> X_trans = enc_auto.fit_transform(X, y)

>>> # A high `smooth` parameter puts more weight on global mean on the categorical
>>> # encodings:
>>> enc_high_smooth = TargetEncoder(smooth=5000.0).fit(X, y)
>>> enc_high_smooth.target_mean_
np.float64(44...)
>>> enc_high_smooth.encodings_
[array([44..., 44..., 44...])]

>>> # On the other hand, a low `smooth` parameter puts more weight on target
>>> # conditioned on the value of the categorical:
>>> enc_low_smooth = TargetEncoder(smooth=1.0).fit(X, y)
>>> enc_low_smooth.encodings_
[array([20..., 80..., 43...])]

fit(X, y)[源代码]#

符合 TargetEncoder 到X和y。

参数:

X形状类似阵列（n_samples，n_features）: 用于确定每个功能类别的数据。
y形状类似阵列（n_samples，）: 用于对类别进行编码的目标数据。

返回:

self对象: 安装编码器。

fit_transform(X, y)[源代码]#

配合 TargetEncoder 并用目标编码变换X。

备注

fit(X, y).transform(X) 不等于 fit_transform(X, y) 因为 cross fitting 方案用于 fit_transform 用于编码。看到 User Guide .有关详细信息

参数:

X形状类似阵列（n_samples，n_features）: 用于确定每个功能类别的数据。
y形状类似阵列（n_samples，）: 用于对类别进行编码的目标数据。

返回:

X_trans形状的nd数组（n_samples，n_features）或（n_samples，（n_features * n_classes））: 转换的输入。

get_feature_names_out(input_features=None)[源代码]#

获取用于转换的输出要素名称。

参数:

input_features字符串或无的类数组，默认=无: 未使用，此处列出是为了按照惯例实现API一致性。

返回:

feature_names_out字符串对象的nd数组: 转换的功能名称。 feature_names_in_ 除非未定义，否则将使用，在这种情况下，将生成以下输入要素名称： ["x0", "x1", ..., "x(n_features_in_ - 1)"] .当 type_of_target_ 是“多类”，名称的格式为“<feature_name>_<class_name>”。

get_metadata_routing()[源代码]#

获取此对象的元数据路由。

请检查 User Guide 关于路由机制如何工作。

返回:

routingMetadataRequest: A MetadataRequest 封装路由信息。

get_params(deep=True)[源代码]#

获取此估计器的参数。

参数:

deep布尔，默认=True: 如果为True，将返回此估计量和包含的作为估计量的子对象的参数。

返回:

paramsdict: 参数名称映射到其值。

set_output(*, transform=None)[源代码]#

设置输出容器。

看到介绍 set_output API 了解如何使用API的示例。

参数:

transform{“默认”，“pandas”，“polars”}，默认=无

配置输出 transform 和 fit_transform .

"default" ：Transformer的默认输出格式
"pandas" ：DataFrame输出
"polars" ：两极输出
None ：转换配置不变

Added in version 1.4: "polars" 添加了选项。

返回:

self估计器实例: 估计实例。

set_params(**params)[源代码]#

设置此估计器的参数。

该方法适用于简单估计器以及嵌套对象（例如 Pipeline ).后者具有以下形式的参数 <component>__<parameter> 以便可以更新嵌套对象的每个组件。

参数:

**paramsdict: 估计参数。

返回:

self估计器实例: 估计实例。

transform(X)[源代码]#

使用目标编码变换X。

备注

fit(X, y).transform(X) 不等于 fit_transform(X, y) 因为 cross fitting 方案用于 fit_transform 用于编码。看到 User Guide .有关详细信息

参数:

X形状类似阵列（n_samples，n_features）: 用于确定每个功能类别的数据。

返回:

X_trans形状的nd数组（n_samples，n_features）或（n_samples，（n_features * n_classes））: 转换的输入。