流水线ANOVA SVM#

此示例展示了如何将特征选择轻松集成到机器学习管道中。

我们还表明，您可以轻松检查管道的部分。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

我们将首先生成二进制分类数据集。随后，我们将将数据集分为两个子集。

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_features=20,
    n_informative=3,
    n_redundant=0,
    n_classes=2,
    n_clusters_per_class=2,
    random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

特征选择的一个常见错误是在完整数据集中搜索区分性特征的子集，而不是仅使用训练集。scikit-learn的使用 Pipeline 防止犯这样的错误。

在这里，我们将演示如何构建管道，其中的第一步是功能选择。

打电话时 fit 在训练数据上，将选择特征的子集并存储这些所选特征的索引。特征选择器随后将减少特征的数量，并将此子集传递给将被训练的分类器。

from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC

anova_filter = SelectKBest(f_classif, k=3)
clf = LinearSVC()
anova_svm = make_pipeline(anova_filter, clf)
anova_svm.fit(X_train, y_train)

Pipeline(steps=[('selectkbest', SelectKBest(k=3)), ('linearsvc', LinearSVC())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

一旦训练完成，我们就可以预测新的未见样本。在这种情况下，特征选择器将仅根据训练期间存储的信息选择最具区分性的特征。然后，数据将被传递给分类器，并进行预测。

在这里，我们通过分类报告显示最终的指标。

from sklearn.metrics import classification_report

y_pred = anova_svm.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.80      0.86        15
           1       0.75      0.90      0.82        10

    accuracy                           0.84        25
   macro avg       0.84      0.85      0.84        25
weighted avg       0.85      0.84      0.84        25

请注意，您可以检查管道中的一个步骤。例如，我们可能对分类器的参数感兴趣。由于我们选择了三个特征，因此我们预计有三个系数。

anova_svm[-1].coef_

array([[0.75788833, 0.27161955, 0.26113448]])

然而，我们不知道从原始数据集中选择了哪些特征。我们可以通过多种方式进行。在这里，我们将颠倒这些系数的变换，以获取有关原始空间的信息。

anova_svm[:-1].inverse_transform(anova_svm[-1].coef_)

array([[0.        , 0.        , 0.75788833, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.27161955,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.26113448]])

我们可以看到，系数非零的特征是第一步选择的特征。

Total running time of the script: （0分0.010秒）