使用Pipeline和GridSearchCV选择降维#

此示例构建了一个管道,该管道先进行降维,然后使用支持载体分类器进行预测。它演示了 GridSearchCVPipeline 在单次CV运行中优化不同类别的估计量--无监督 PCANMF 将维度缩减与网格搜索期间的单变量特征选择进行比较。

Additionally, Pipeline can be instantiated with the memory argument to memoize the transformers within the pipeline, avoiding to fit again the same transformers over and over.

请注意,使用 memory 当Transformer的安装成本很高时,启用缓存就会变得有趣。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

说明 PipelineGridSearchCV#

import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import load_digits
from sklearn.decomposition import NMF, PCA
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import LinearSVC

X, y = load_digits(return_X_y=True)

pipe = Pipeline(
    [
        ("scaling", MinMaxScaler()),
        # the reduce_dim stage is populated by the param_grid
        ("reduce_dim", "passthrough"),
        ("classify", LinearSVC(dual=False, max_iter=10000)),
    ]
)

N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
    {
        "reduce_dim": [PCA(iterated_power=7), NMF(max_iter=1_000)],
        "reduce_dim__n_components": N_FEATURES_OPTIONS,
        "classify__C": C_OPTIONS,
    },
    {
        "reduce_dim": [SelectKBest(mutual_info_classif)],
        "reduce_dim__k": N_FEATURES_OPTIONS,
        "classify__C": C_OPTIONS,
    },
]
reducer_labels = ["PCA", "NMF", "KBest(mutual_info_classif)"]

grid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid)
grid.fit(X, y)
GridSearchCV(estimator=Pipeline(steps=[('scaling', MinMaxScaler()),
                                       ('reduce_dim', 'passthrough'),
                                       ('classify',
                                        LinearSVC(dual=False,
                                                  max_iter=10000))]),
             n_jobs=1,
             param_grid=[{'classify__C': [1, 10, 100, 1000],
                          'reduce_dim': [PCA(iterated_power=7),
                                         NMF(max_iter=1000)],
                          'reduce_dim__n_components': [2, 4, 8]},
                         {'classify__C': [1, 10, 100, 1000],
                          'reduce_dim': [SelectKBest(score_func=<function mutual_info_classif at 0x7fa298083b00>)],
                          'reduce_dim__k': [2, 4, 8]}])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


import pandas as pd

mean_scores = np.array(grid.cv_results_["mean_test_score"])
# scores are in the order of param_grid iteration, which is alphabetical
mean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS))
# select score for best C
mean_scores = mean_scores.max(axis=0)
# create a dataframe to ease plotting
mean_scores = pd.DataFrame(
    mean_scores.T, index=N_FEATURES_OPTIONS, columns=reducer_labels
)

ax = mean_scores.plot.bar()
ax.set_title("Comparing feature reduction techniques")
ax.set_xlabel("Reduced number of features")
ax.set_ylabel("Digit classification accuracy")
ax.set_ylim((0, 1))
ax.legend(loc="upper left")

plt.show()
Comparing feature reduction techniques

在一个内缓存变压器 Pipeline#

有时,存储特定Transformer的状态是值得的,因为它可以再次使用。使用管道在 GridSearchCV 触发此类情况。因此,我们使用这个论点 memory 以启用缓存。

警告

但是,请注意,这个示例只是一个说明,因为对于这种特定情况,匹配PCA不一定比加载缓存慢。因此,使用 memory 当Transformer的装配成本很高时,则需要使用构造函数参数。

from shutil import rmtree

from joblib import Memory

# Create a temporary folder to store the transformers of the pipeline
location = "cachedir"
memory = Memory(location=location, verbose=10)
cached_pipe = Pipeline(
    [("reduce_dim", PCA()), ("classify", LinearSVC(dual=False, max_iter=10000))],
    memory=memory,
)

# This time, a cached pipeline will be used within the grid search


# Delete the temporary cache before exiting
memory.clear(warn=False)
rmtree(location)

PCA 仅在评估的第一个配置时计算配合 C 参数 LinearSVC 分类器的其他配置 C 将触发缓存的加载 PCA 估计器数据,从而节省处理时间。因此,使用缓存管道的使用 memory 当安装Transformer成本很高时,

Total running time of the script: (0分41.371秒)

相关实例

增量PCA

Incremental PCA

特征聚集与单变量选择

Feature agglomeration vs. univariate selection

缓存最近的邻居

Caching nearest neighbors

级联多种特征提取方法

Concatenating multiple feature extraction methods

Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io> _