介绍 set_output API#

这个例子将演示 set_output API to configure transformers to output pandas DataFrames. set_output can be configured per estimator by calling the set_output method or globally by setting set_config(transform_output="pandas"). For details, see SLEP018 .

首先,我们将iris数据集作为DataFrame加载,以演示 set_output API.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
X_train.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
60 5.0 2.0 3.5 1.0
1 4.9 3.0 1.4 0.2
8 4.4 2.9 1.4 0.2
93 5.0 2.3 3.3 1.0
106 4.9 2.5 4.5 1.7


要配置估计器,例如 preprocessing.StandardScaler 要返回数据帧,请调用 set_output .此功能需要安装pandas。

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().set_output(transform="pandas")

scaler.fit(X_train)
X_test_scaled = scaler.transform(X_test)
X_test_scaled.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
39 -0.894264 0.798301 -1.271411 -1.327605
12 -1.244466 -0.086944 -1.327407 -1.459074
48 -0.660797 1.462234 -1.271411 -1.327605
23 -0.894264 0.576989 -1.159419 -0.933197
81 -0.427329 -1.414810 -0.039497 -0.275851


set_output 可以在 fit 配置 transform 事后。

scaler2 = StandardScaler()

scaler2.fit(X_train)
X_test_np = scaler2.transform(X_test)
print(f"Default output type: {type(X_test_np).__name__}")

scaler2.set_output(transform="pandas")
X_test_df = scaler2.transform(X_test)
print(f"Configured pandas output type: {type(X_test_df).__name__}")
Default output type: ndarray
Configured pandas output type: DataFrame

pipeline.Pipeline , set_output 配置输出数据帧的所有步骤。

from sklearn.feature_selection import SelectPercentile
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

clf = make_pipeline(
    StandardScaler(), SelectPercentile(percentile=75), LogisticRegression()
)
clf.set_output(transform="pandas")
clf.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('selectpercentile', SelectPercentile(percentile=75)),
                ('logisticregression', LogisticRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


管道中的每个Transformer都配置为返回数据帧。这意味着最终的逻辑回归步骤包含输入的特征名称。

clf[-1].feature_names_in_
array(['sepal length (cm)', 'petal length (cm)', 'petal width (cm)'],
      dtype=object)

备注

如果使用该方法 set_params ,该Transformer将被具有默认输出格式的新转换器替换。

clf.set_params(standardscaler=StandardScaler())
clf.fit(X_train, y_train)
clf[-1].feature_names_in_
array(['x0', 'x2', 'x3'], dtype=object)

要保持预期行为,请使用 set_output 提前在新Transformer上

scaler = StandardScaler().set_output(transform="pandas")
clf.set_params(standardscaler=scaler)
clf.fit(X_train, y_train)
clf[-1].feature_names_in_
array(['sepal length (cm)', 'petal length (cm)', 'petal width (cm)'],
      dtype=object)

接下来,我们加载泰坦尼克号数据集来演示 set_outputcompose.ColumnTransformer 和异类数据。

from sklearn.datasets import fetch_openml

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

set_output API可以通过使用全局配置 set_config 和设置 transform_output"pandas" .

from sklearn import set_config
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

set_config(transform_output="pandas")

num_pipe = make_pipeline(SimpleImputer(), StandardScaler())
num_cols = ["age", "fare"]
ct = ColumnTransformer(
    (
        ("numerical", num_pipe, num_cols),
        (
            "categorical",
            OneHotEncoder(
                sparse_output=False, drop="if_binary", handle_unknown="ignore"
            ),
            ["embarked", "sex", "pclass"],
        ),
    ),
    verbose_feature_names_out=False,
)
clf = make_pipeline(ct, SelectPercentile(percentile=50), LogisticRegression())
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
0.801829268292683

通过全局配置,所有转换器都会输出数据帧。这使我们能够轻松地用相应的特征名称绘制逻辑回归系数。

import pandas as pd

log_reg = clf[-1]
coef = pd.Series(log_reg.coef_.ravel(), index=log_reg.feature_names_in_)
_ = coef.sort_values().plot.barh()
plot set output

为了证明 config_context 下面的功能,让我们先重置一下 transform_output 其默认值。

set_config(transform_output="default")

配置输出类型时, config_context 当时的配置 transformfit_transform 被称为才是最重要的。仅在构建或安装Transformer时设置这些不会产生任何影响。

from sklearn import config_context

scaler = StandardScaler()
scaler.fit(X_train[num_cols])
StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


with config_context(transform_output="pandas"):
    # the output of transform will be a Pandas DataFrame
    X_test_scaled = scaler.transform(X_test[num_cols])
X_test_scaled.head()
age fare
629 0.628306 -0.063210
688 -0.057984 -0.515704
439 1.314596 0.566624
664 -0.675645 -0.512279
669 -0.744274 -0.496950


在上下文管理器之外,输出将是NumPy数组

X_test_scaled = scaler.transform(X_test[num_cols])
X_test_scaled[:5]
array([[ 0.62830616, -0.06320955],
       [-0.05798371, -0.51570367],
       [ 1.31459603,  0.56662405],
       [-0.6756446 , -0.51227857],
       [-0.74427358, -0.49694966]])

Total running time of the script: (0分0.155秒)

相关实例

管道

Displaying Pipelines

scikit-learn 1.4的发布亮点

Release Highlights for scikit-learn 1.4

混合类型的列Transformer

Column Transformer with Mixed Types

SVM-Anova:具有单变量特征选择的支持者

SVM-Anova: SVM with univariate feature selection

Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io> _