备注

Go to the end 下载完整的示例代码。或者通过浏览器中的MysterLite或Binder运行此示例

介绍 `set_output` API#

这个例子将演示 set_output API to configure transformers to output pandas DataFrames. set_output can be configured per estimator by calling the set_output method or globally by setting set_config(transform_output="pandas"). For details, see SLEP018 .

首先，我们将iris数据集作为DataFrame加载，以演示 set_output API.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
X_train.head()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
60	5.0	2.0	3.5	1.0
1	4.9	3.0	1.4	0.2
8	4.4	2.9	1.4	0.2
93	5.0	2.3	3.3	1.0
106	4.9	2.5	4.5	1.7

要配置估计器，例如 preprocessing.StandardScaler 要返回数据帧，请调用 set_output .此功能需要安装pandas。

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().set_output(transform="pandas")

scaler.fit(X_train)
X_test_scaled = scaler.transform(X_test)
X_test_scaled.head()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
39	-0.894264	0.798301	-1.271411	-1.327605
12	-1.244466	-0.086944	-1.327407	-1.459074
48	-0.660797	1.462234	-1.271411	-1.327605
23	-0.894264	0.576989	-1.159419	-0.933197
81	-0.427329	-1.414810	-0.039497	-0.275851

set_output 可以在 fit 配置 transform 事后。

scaler2 = StandardScaler()

scaler2.fit(X_train)
X_test_np = scaler2.transform(X_test)
print(f"Default output type: {type(X_test_np).__name__}")

scaler2.set_output(transform="pandas")
X_test_df = scaler2.transform(X_test)
print(f"Configured pandas output type: {type(X_test_df).__name__}")

Default output type: ndarray
Configured pandas output type: DataFrame

中 pipeline.Pipeline , set_output 配置输出数据帧的所有步骤。

from sklearn.feature_selection import SelectPercentile
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

clf = make_pipeline(
    StandardScaler(), SelectPercentile(percentile=75), LogisticRegression()
)
clf.set_output(transform="pandas")
clf.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('selectpercentile', SelectPercentile(percentile=75)),
                ('logisticregression', LogisticRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

管道中的每个Transformer都配置为返回数据帧。这意味着最终的逻辑回归步骤包含输入的特征名称。

clf[-1].feature_names_in_

array(['sepal length (cm)', 'petal length (cm)', 'petal width (cm)'],
      dtype=object)

备注

如果使用该方法 set_params ，该Transformer将被具有默认输出格式的新转换器替换。

clf.set_params(standardscaler=StandardScaler())
clf.fit(X_train, y_train)
clf[-1].feature_names_in_

array(['x0', 'x2', 'x3'], dtype=object)

要保持预期行为，请使用 set_output 提前在新Transformer上

scaler = StandardScaler().set_output(transform="pandas")
clf.set_params(standardscaler=scaler)
clf.fit(X_train, y_train)
clf[-1].feature_names_in_

array(['sepal length (cm)', 'petal length (cm)', 'petal width (cm)'],
      dtype=object)

接下来，我们加载泰坦尼克号数据集来演示 set_output 与 compose.ColumnTransformer 和异类数据。

from sklearn.datasets import fetch_openml

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

的 set_output API可以通过使用全局配置 set_config 和设置 transform_output 到 "pandas" .

from sklearn import set_config
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

set_config(transform_output="pandas")

num_pipe = make_pipeline(SimpleImputer(), StandardScaler())
num_cols = ["age", "fare"]
ct = ColumnTransformer(
    (
        ("numerical", num_pipe, num_cols),
        (
            "categorical",
            OneHotEncoder(
                sparse_output=False, drop="if_binary", handle_unknown="ignore"
            ),
            ["embarked", "sex", "pclass"],
        ),
    ),
    verbose_feature_names_out=False,
)
clf = make_pipeline(ct, SelectPercentile(percentile=50), LogisticRegression())
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.801829268292683

通过全局配置，所有转换器都会输出数据帧。这使我们能够轻松地用相应的特征名称绘制逻辑回归系数。

import pandas as pd

log_reg = clf[-1]
coef = pd.Series(log_reg.coef_.ravel(), index=log_reg.feature_names_in_)
_ = coef.sort_values().plot.barh()

为了证明 config_context 下面的功能，让我们先重置一下 transform_output 其默认值。

set_config(transform_output="default")

配置输出类型时， config_context 当时的配置 transform 或 fit_transform 被称为才是最重要的。仅在构建或安装Transformer时设置这些不会产生任何影响。

from sklearn import config_context

scaler = StandardScaler()
scaler.fit(X_train[num_cols])

StandardScaler()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

with config_context(transform_output="pandas"):
    # the output of transform will be a Pandas DataFrame
    X_test_scaled = scaler.transform(X_test[num_cols])
X_test_scaled.head()

	age	fare
629	0.628306	-0.063210
688	-0.057984	-0.515704
439	1.314596	0.566624
664	-0.675645	-0.512279
669	-0.744274	-0.496950

在上下文管理器之外，输出将是NumPy数组

X_test_scaled = scaler.transform(X_test[num_cols])
X_test_scaled[:5]

array([[ 0.62830616, -0.06320955],
       [-0.05798371, -0.51570367],
       [ 1.31459603,  0.56662405],
       [-0.6756446 , -0.51227857],
       [-0.74427358, -0.49694966]])

Total running time of the script: （0分0.155秒）

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
60	5.0	2.0	3.5	1.0
1	4.9	3.0	1.4	0.2
8	4.4	2.9	1.4	0.2
93	5.0	2.3	3.3	1.0
106	4.9	2.5	4.5	1.7

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
60	5.0	2.0	3.5	1.0
1	4.9	3.0	1.4	0.2
8	4.4	2.9	1.4	0.2
93	5.0	2.3	3.3	1.0
106	4.9	2.5	4.5	1.7

介绍 set_output API#

介绍 `set_output` API#

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
60	5.0	2.0	3.5	1.0
1	4.9	3.0	1.4	0.2
8	4.4	2.9	1.4	0.2
93	5.0	2.3	3.3	1.0
106	4.9	2.5	4.5	1.7