用于文本特征提取和评估的样本管道#

此示例中使用的数据集是 20个新闻组文本数据集它将被自动下载、缓存和重新用于文档分类示例。

在这个示例中，我们使用 RandomizedSearchCV .有关其他一些分类器性能的演示，请参阅使用稀疏特征对文本文档进行分类笔记本

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

数据加载#

我们从训练集中加载两个类别。您可以通过将类别名称添加到列表或设置中来调整类别的数量 categories=None 当调用数据集加载器时 fetch_20newsgroups 才能得到其中20个。

from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
]

data_train = fetch_20newsgroups(
    subset="train",
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=("headers", "footers", "quotes"),
)

data_test = fetch_20newsgroups(
    subset="test",
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=("headers", "footers", "quotes"),
)

print(f"Loading 20 newsgroups dataset for {len(data_train.target_names)} categories:")
print(data_train.target_names)
print(f"{len(data_train.data)} documents")

Loading 20 newsgroups dataset for 2 categories:
['alt.atheism', 'talk.religion.misc']
857 documents

具有超参数调整的管道#

我们定义了一个管道，将文本特征向量器与简单的分类器相结合，但对于文本分类有效。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import ComplementNB
from sklearn.pipeline import Pipeline

pipeline = Pipeline(
    [
        ("vect", TfidfVectorizer()),
        ("clf", ComplementNB()),
    ]
)
pipeline

Pipeline(steps=[('vect', TfidfVectorizer()), ('clf', ComplementNB())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We define a grid of hyperparameters to be explored by the RandomizedSearchCV. Using a GridSearchCV instead would explore all the possible combinations on the grid, which can be costly to compute, whereas the parameter n_iter of the RandomizedSearchCV controls the number of different random combination that are evaluated. Notice that setting n_iter larger than the number of possible combinations in a grid would lead to repeating already-explored combinations. We search for the best parameter combination for both the feature extraction (vect__) and the classifier (clf__).

import numpy as np

parameter_grid = {
    "vect__max_df": (0.2, 0.4, 0.6, 0.8, 1.0),
    "vect__min_df": (1, 3, 5, 10),
    "vect__ngram_range": ((1, 1), (1, 2)),  # unigrams or bigrams
    "vect__norm": ("l1", "l2"),
    "clf__alpha": np.logspace(-6, 6, 13),
}

在这种情况下 n_iter=40 并不是对超参数网格的详尽搜索。在实践中，增加参数会很有趣 n_iter 以获得更多信息量的分析。结果，计算时间增加。我们可以通过增加通过参数使用的中央处理器数量来利用参数组合评估的并行化来减少它 n_jobs .

from pprint import pprint

from sklearn.model_selection import RandomizedSearchCV

random_search = RandomizedSearchCV(
    estimator=pipeline,
    param_distributions=parameter_grid,
    n_iter=40,
    random_state=0,
    n_jobs=2,
    verbose=1,
)

print("Performing grid search...")
print("Hyperparameters to be evaluated:")
pprint(parameter_grid)

Performing grid search...
Hyperparameters to be evaluated:
{'clf__alpha': array([1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01,
       1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06]),
 'vect__max_df': (0.2, 0.4, 0.6, 0.8, 1.0),
 'vect__min_df': (1, 3, 5, 10),
 'vect__ngram_range': ((1, 1), (1, 2)),
 'vect__norm': ('l1', 'l2')}

from time import time

t0 = time()
random_search.fit(data_train.data, data_train.target)
print(f"Done in {time() - t0:.3f}s")

Fitting 5 folds for each of 40 candidates, totalling 200 fits
Done in 33.566s

print("Best parameters combination found:")
best_parameters = random_search.best_estimator_.get_params()
for param_name in sorted(parameter_grid.keys()):
    print(f"{param_name}: {best_parameters[param_name]}")

Best parameters combination found:
clf__alpha: 0.01
vect__max_df: 0.2
vect__min_df: 1
vect__ngram_range: (1, 1)
vect__norm: l1

test_accuracy = random_search.score(data_test.data, data_test.target)
print(
    "Accuracy of the best parameters using the inner CV of "
    f"the random search: {random_search.best_score_:.3f}"
)
print(f"Accuracy on test set: {test_accuracy:.3f}")

Accuracy of the best parameters using the inner CV of the random search: 0.816
Accuracy on test set: 0.709

前缀 vect 和 clf 是为了避免流水线中可能的模糊性，但对于可视化结果不是必需的。正因为如此，我们定义了一个函数，它将重命名调优的超参数并提高可读性。

import pandas as pd


def shorten_param(param_name):
    """Remove components' prefixes in param_name."""
    if "__" in param_name:
        return param_name.rsplit("__", 1)[1]
    return param_name


cv_results = pd.DataFrame(random_search.cv_results_)
cv_results = cv_results.rename(shorten_param, axis=1)

我们可以使用一个 plotly.express.scatter 可视化评分时间和平均测试分数（即“CV分数”）之间的权衡。将光标移到给定点上会显示相应的参数。误差条对应于在交叉验证的不同折叠中计算的一个标准偏差。

import plotly.express as px

param_names = [shorten_param(name) for name in parameter_grid.keys()]
labels = {
    "mean_score_time": "CV Score time (s)",
    "mean_test_score": "CV score (accuracy)",
}
fig = px.scatter(
    cv_results,
    x="mean_score_time",
    y="mean_test_score",
    error_x="std_score_time",
    error_y="std_test_score",
    hover_data=param_names,
    labels=labels,
)
fig.update_layout(
    title={
        "text": "trade-off between scoring time and mean test score",
        "y": 0.95,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    }
)
fig

请注意，图左上角的模型集群在准确性和评分时间之间具有最佳权衡。在这种情况下，使用二元组会增加所需的评分时间，而不会显着提高管道的准确性。

备注

有关如何自定义自动调优以最大化得分和最小化得分时间的更多信息，请参见示例笔记本具有交叉验证的网格搜索自定义改装策略 .

我们还可以使用 plotly.express.parallel_coordinates 以进一步将平均测试分数可视化为调整后的超参数的函数。这有助于找到两个以上超参数之间的相互作用，并直观地了解它们与改善管道性能的相关性。

我们应用一个 math.log10 转型对 alpha 轴以扩大活动范围并提高情节的可读性。的值 \(x\) 在所述轴上应理解为 \(10^x\) .

import math

column_results = param_names + ["mean_test_score", "mean_score_time"]

transform_funcs = dict.fromkeys(column_results, lambda x: x)
# Using a logarithmic scale for alpha
transform_funcs["alpha"] = math.log10
# L1 norms are mapped to index 1, and L2 norms to index 2
transform_funcs["norm"] = lambda x: 2 if x == "l2" else 1
# Unigrams are mapped to index 1 and bigrams to index 2
transform_funcs["ngram_range"] = lambda x: x[1]

fig = px.parallel_coordinates(
    cv_results[column_results].apply(transform_funcs),
    color="mean_test_score",
    color_continuous_scale=px.colors.sequential.Viridis_r,
    labels=labels,
)
fig.update_layout(
    title={
        "text": "Parallel coordinates plot of text classifier pipeline",
        "y": 0.99,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    }
)
fig

The parallel coordinates plot displays the values of the hyperparameters on different columns while the performance metric is color coded. It is possible to select a range of results by clicking and holding on any axis of the parallel coordinate plot. You can then slide (move) the range selection and cross two selections to see the intersections. You can undo a selection by clicking once again on the same axis.

特别是对于这种超参数搜索，有趣的是注意到，性能最好的模型似乎并不依赖于正规化 norm ，但它们确实取决于 max_df , min_df 以及正规化实力 alpha .原因在于，包括噪声特征（即， max_df 接近 \(1.0\) 或 min_df 接近 \(0\) ）倾向于过度适应，因此需要更强的正规化来补偿。拥有更少的功能需要更少的规则化和更少的评分时间。

最佳准确度得分是在以下情况下获得的： alpha 之间 \(10^{-6}\) 和 \(10^0\) ，无论超参数如何 norm .

Total running time of the script: （0分35.067秒）

Download Jupyter notebook: plot_grid_search_text_feature_extraction.ipynb

Download Python source code: plot_grid_search_text_feature_extraction.py

Download zipped: plot_grid_search_text_feature_extraction.zip