比较随机森林和柱状图梯度增强模型#

不过，在这个示例中，我们比较了随机森林（RF）和柱状图梯度增强（HGBT）模型在回归数据集的得分和计算时间方面的性能 all the concepts here presented apply to classification as well .

通过根据每个估计器改变控制树数量的参数来进行比较：

n_estimators 控制森林中树木的数量。这是一个固定的数字。
max_iter 是基于梯度增强的模型中的最大迭代次数。迭代次数对应于回归和二元分类问题的树的数量。此外，模型所需的实际树木数量取决于停止标准。

HGBT使用梯度提升，通过将每棵树拟合到损失函数相对于预测值的负梯度来迭代地提高模型的性能。另一方面，RF是基于装袋，并使用多数票来预测结果。

看到 User Guide 有关整体模型的更多信息或参阅梯度增强树的梯度中的功能以展示HGBT模型的一些其他特征为例。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

加载数据集#

from sklearn.datasets import fetch_california_housing

X, y = fetch_california_housing(return_X_y=True, as_frame=True)
n_samples, n_features = X.shape

HGBT对分箱特征值使用基于直方图的算法，可以有效处理具有大量特征的大型数据集（数万个样本或更多）（请参见为什么它更快 ). RF的scikit-learn实现不使用分类，而是依赖于精确的拆分，这在计算上可能很昂贵。

print(f"The dataset consists of {n_samples} samples and {n_features} features")

The dataset consists of 20640 samples and 8 features

计算分数和计算时间#

请注意，实现的许多部分 HistGradientBoostingClassifier 和 HistGradientBoostingRegressor 默认情况下是并行的。

执行 RandomForestRegressor 和 RandomForestClassifier 也可以通过使用在多个核心上运行 n_jobs 参数，此处设置为匹配主机上的物理核心数量。看到并行性 for more information.

import joblib

N_CORES = joblib.cpu_count(only_physical_cores=True)
print(f"Number of physical cores: {N_CORES}")

Number of physical cores: 1

与RF不同，HGBT型号提供提前停止选项（请参见 Gradient Boosting中的提前停止）以避免添加新的不必要的树木。在内部，该算法使用样本外集来计算每次添加树时模型的概括性能。因此，如果概括性能没有改善超过 n_iter_no_change 迭代时，它停止添加树。

两个模型的其他参数都进行了调整，但为了保持示例简单，此处没有显示该过程。

import pandas as pd

from sklearn.ensemble import HistGradientBoostingRegressor, RandomForestRegressor
from sklearn.model_selection import GridSearchCV, KFold

models = {
    "Random Forest": RandomForestRegressor(
        min_samples_leaf=5, random_state=0, n_jobs=N_CORES
    ),
    "Hist Gradient Boosting": HistGradientBoostingRegressor(
        max_leaf_nodes=15, random_state=0, early_stopping=False
    ),
}
param_grids = {
    "Random Forest": {"n_estimators": [10, 20, 50, 100]},
    "Hist Gradient Boosting": {"max_iter": [10, 20, 50, 100, 300, 500]},
}
cv = KFold(n_splits=4, shuffle=True, random_state=0)

results = []
for name, model in models.items():
    grid_search = GridSearchCV(
        estimator=model,
        param_grid=param_grids[name],
        return_train_score=True,
        cv=cv,
    ).fit(X, y)
    result = {"model": name, "cv_results": pd.DataFrame(grid_search.cv_results_)}
    results.append(result)

备注

调谐 n_estimators 对于RF通常会导致计算机功率的浪费。在实践中，只需确保它足够大，以便其值翻倍不会导致测试分数的显着提高。

图结果#

我们可以使用一个 plotly.express.scatter 以可视化计算时间和平均测试分数之间的权衡。将光标移到给定点上会显示相应的参数。误差条对应于在交叉验证的不同折叠中计算的一个标准偏差。

import plotly.colors as colors
import plotly.express as px
from plotly.subplots import make_subplots

fig = make_subplots(
    rows=1,
    cols=2,
    shared_yaxes=True,
    subplot_titles=["Train time vs score", "Predict time vs score"],
)
model_names = [result["model"] for result in results]
colors_list = colors.qualitative.Plotly * (
    len(model_names) // len(colors.qualitative.Plotly) + 1
)

for idx, result in enumerate(results):
    cv_results = result["cv_results"].round(3)
    model_name = result["model"]
    param_name = next(iter(param_grids[model_name].keys()))
    cv_results[param_name] = cv_results["param_" + param_name]
    cv_results["model"] = model_name

    scatter_fig = px.scatter(
        cv_results,
        x="mean_fit_time",
        y="mean_test_score",
        error_x="std_fit_time",
        error_y="std_test_score",
        hover_data=param_name,
        color="model",
    )
    line_fig = px.line(
        cv_results,
        x="mean_fit_time",
        y="mean_test_score",
    )

    scatter_trace = scatter_fig["data"][0]
    line_trace = line_fig["data"][0]
    scatter_trace.update(marker=dict(color=colors_list[idx]))
    line_trace.update(line=dict(color=colors_list[idx]))
    fig.add_trace(scatter_trace, row=1, col=1)
    fig.add_trace(line_trace, row=1, col=1)

    scatter_fig = px.scatter(
        cv_results,
        x="mean_score_time",
        y="mean_test_score",
        error_x="std_score_time",
        error_y="std_test_score",
        hover_data=param_name,
    )
    line_fig = px.line(
        cv_results,
        x="mean_score_time",
        y="mean_test_score",
    )

    scatter_trace = scatter_fig["data"][0]
    line_trace = line_fig["data"][0]
    scatter_trace.update(marker=dict(color=colors_list[idx]))
    line_trace.update(line=dict(color=colors_list[idx]))
    fig.add_trace(scatter_trace, row=1, col=2)
    fig.add_trace(line_trace, row=1, col=2)

fig.update_layout(
    xaxis=dict(title="Train time (s) - lower is better"),
    yaxis=dict(title="Test R2 score - higher is better"),
    xaxis2=dict(title="Predict time (s) - lower is better"),
    legend=dict(x=0.72, y=0.05, traceorder="normal", borderwidth=1),
    title=dict(x=0.5, text="Speed-score trade-off of tree-based ensembles"),
)

HGBT和RF模型在增加集合中的树木数量时都会得到改善。然而，分数达到了一个平台，添加新树只会使匹配和评分变慢。RF模型更早达到这样的平台期，永远无法达到最大HGBDT模型的测试分数。

请注意，上图中显示的结果在运行中可能会略有变化，在其他机器上运行时甚至更显着：尝试在您自己的本地机器上运行此示例。

总体而言，人们应该经常观察到，在“测试分数与训练速度权衡”中，基于柱状图的梯度提升模型均匀地主导随机森林模型（HGBDT曲线应该位于RF曲线的左上角，从未交叉）。“测试分数与预测速度”的权衡也可能更具争议，但它通常对HGBDT有利。检查两种模型（通过超参数调优）并比较它们在特定问题上的性能始终是一个好主意，以确定哪种模型最适合， HGBT almost always offers a more favorable speed-accuracy trade-off than RF ，可以使用默认超参数，也可以包括超参数调整成本。

不过，这一经验法则有一个例外：当训练具有大量可能类别的多类分类模型时，HGBDT在每次增强迭代中为每个类别内部匹配一棵树，而RF模型使用的树自然是多类的，这应该会提高速度准确性权衡RF模型在这种情况下。

Total running time of the script: （1分21.297秒）