备注

Go to the end 下载完整的示例代码。或者通过浏览器中的MysterLite或Binder运行此示例

梯度增强树的梯度中的功能#

基于直方图的梯度提升（HGBT）模型可能是scikit-learn中最有用的监督学习模型之一。它们基于与LightGBM和XGboost相当的现代梯度增强实施。因此，HGBT模型比随机森林等替代模型特征更丰富，并且往往优于随机森林等替代模型，特别是当样本数量超过一万个时（请参阅比较随机森林和柱状图梯度增强模型 ).

HGBT型号的顶级可用性功能包括：

均值和分位数回归任务的几个可用损失函数，请参阅 Quantile loss .
类别功能支持，看到了梯度提升中的分类特征支持 .
提前停止。
缺失值支持，这避免了对估算器的需要。
单调约束 .
互动限制 .

此示例旨在展示现实生活中除2和6之外的所有点。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

准备数据#

的 electricity dataset 由从澳大利亚新南威尔士州电力市场收集的数据组成。在这个市场中，价格不是固定的，受到供需的影响。它们每五分钟设置一次。往返邻近维多利亚州的电力转移是为了缓解波动。

该数据集最初命名为ELEC2，包含1996年5月7日至1998年12月5日的45，312个实例。数据集的每个样本是指30分钟的时间段，即一天的每个时间段有48个实例。数据集上的每个样本有7列：

日期：1996年5月7日至1998年12月5日。标准化在0和1之间;
天：一周中的哪一天（1-7）;
周期：24小时内每隔半小时。标准化在0和1之间;
nswprice/nswdemand：新南威尔士州的电价/需求;
vicprice/vicdemand：维多利亚州的电价/需求。

最初，这是一项分类任务，但这里我们将其用于回归任务来预测各州之间的预定电力传输。

from sklearn.datasets import fetch_openml

electricity = fetch_openml(
    name="electricity", version=1, as_frame=True, parser="pandas"
)
df = electricity.frame

这个特定的数据集对于前17，760个样本有一个逐步恒定的目标：

df["transfer"][:17_760].unique()

array([0.414912, 0.500526])

让我们放弃这些条目，探索一周中不同日子的每小时电力传输：

import matplotlib.pyplot as plt
import seaborn as sns

df = electricity.frame.iloc[17_760:]
X = df.drop(columns=["transfer", "class"])
y = df["transfer"]

fig, ax = plt.subplots(figsize=(15, 10))
pointplot = sns.lineplot(x=df["period"], y=df["transfer"], hue=df["day"], ax=ax)
handles, labels = ax.get_legend_handles_labels()
ax.set(
    title="Hourly energy transfer for different days of the week",
    xlabel="Normalized time of the day",
    ylabel="Normalized energy transfer",
)
_ = ax.legend(handles, ["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"])

Hourly energy transfer for different days of the week

注意，能量转移在周末系统地增加。

树木数量和提前停止的影响#

为了说明（最大）树木数量的影响，我们训练了一个 HistGradientBoostingRegressor 使用整个数据集进行日常电力传输。然后我们根据 max_iter 参数.在这里，我们不会试图评估模型的性能及其概括能力，而是评估其从训练数据中学习的能力。

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, shuffle=False)

print(f"Training sample size: {X_train.shape[0]}")
print(f"Test sample size: {X_test.shape[0]}")
print(f"Number of features: {X_train.shape[1]}")

Training sample size: 16531
Test sample size: 11021
Number of features: 7

max_iter_list = [5, 50]
average_week_demand = (
    df.loc[X_test.index].groupby(["day", "period"], observed=False)["transfer"].mean()
)
colors = sns.color_palette("colorblind")
fig, ax = plt.subplots(figsize=(10, 5))
average_week_demand.plot(color=colors[0], label="recorded average", linewidth=2, ax=ax)

for idx, max_iter in enumerate(max_iter_list):
    hgbt = HistGradientBoostingRegressor(
        max_iter=max_iter, categorical_features=None, random_state=42
    )
    hgbt.fit(X_train, y_train)

    y_pred = hgbt.predict(X_test)
    prediction_df = df.loc[X_test.index].copy()
    prediction_df["y_pred"] = y_pred
    average_pred = prediction_df.groupby(["day", "period"], observed=False)[
        "y_pred"
    ].mean()
    average_pred.plot(
        color=colors[idx + 1], label=f"max_iter={max_iter}", linewidth=2, ax=ax
    )

ax.set(
    title="Predicted average energy transfer during the week",
    xticks=[(i + 0.2) * 48 for i in range(7)],
    xticklabels=["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"],
    xlabel="Time of the week",
    ylabel="Normalized energy transfer",
)
_ = ax.legend()

Predicted average energy transfer during the week

只需几次迭代，HGBT模型就可以实现融合（请参阅比较随机森林和柱状图梯度增强模型），这意味着添加更多树不会再改进模型。在上图中，5次迭代不足以获得良好的预测。经过50次迭代，我们已经能够做得很好。

设置 max_iter 太高可能降低预测质量并花费大量可避免的计算资源。因此，scikit-learn中的HGBT实现提供了一个自动的 early stopping 战略通过它，该模型使用一小部分训练数据作为内部验证集 (validation_fraction ），如果验证分数没有提高（或下降），则停止训练 n_iter_no_change 迭代达到一定公差 (tol ).

请注意，两者之间存在权衡 learning_rate 和 max_iter ：一般来说，更小的学习率是可取的，但需要更多的迭代才能收敛到最小损失，而更大的学习率收敛得更快（需要更少的迭代/树），但以更大的最小损失为代价。

由于学习率与迭代次数之间的高度相关性，一个好的做法是调整学习率以及所有（重要的）其他超参数，将HBGT适应训练集中的足够大的值 max_iter 并确定最佳的 max_iter 通过提前停止和一些明确的 validation_fraction .

common_params = {
    "max_iter": 1_000,
    "learning_rate": 0.3,
    "validation_fraction": 0.2,
    "random_state": 42,
    "categorical_features": None,
    "scoring": "neg_root_mean_squared_error",
}

hgbt = HistGradientBoostingRegressor(early_stopping=True, **common_params)
hgbt.fit(X_train, y_train)

_, ax = plt.subplots()
plt.plot(-hgbt.validation_score_)
_ = ax.set(
    xlabel="number of iterations",
    ylabel="root mean squared error",
    title=f"Loss of hgbt with early stopping (n_iter={hgbt.n_iter_})",
)

Loss of hgbt with early stopping (n_iter=392)

然后我们可以重写的值 max_iter 达到合理的值并避免内部验证的额外计算成本。四舍五入迭代次数可能会解释训练集的可变性：

import math

common_params["max_iter"] = math.ceil(hgbt.n_iter_ / 100) * 100
common_params["early_stopping"] = False
hgbt = HistGradientBoostingRegressor(**common_params)

备注

提前停止期间进行的内部验证对于时间序列来说并不是最佳的。

对缺失值的支持#

HGBT模型对缺失值有原生支持。在训练过程中，树生长者根据潜在的增益决定在每次拆分时缺失值的样本应该去哪里（左或右孩子）。当预测时，这些样本被相应地发送给学习的孩子。如果一个特征在训练过程中没有缺失值，那么对于预测，该特征具有缺失值的样本将被发送给具有最多样本的子样本（如在拟合过程中所见）。

本示例显示了HGBT回归如何处理完全随机缺失值（MCAR），即缺失不依赖于观察数据或未观察数据。我们可以通过将随机选择的要素中的值随机替换为 nan 价值观

import numpy as np

from sklearn.metrics import root_mean_squared_error

rng = np.random.RandomState(42)
first_week = slice(0, 336)  # first week in the test set as 7 * 48 = 336
missing_fraction_list = [0, 0.01, 0.03]


def generate_missing_values(X, missing_fraction):
    total_cells = X.shape[0] * X.shape[1]
    num_missing_cells = int(total_cells * missing_fraction)
    row_indices = rng.choice(X.shape[0], num_missing_cells, replace=True)
    col_indices = rng.choice(X.shape[1], num_missing_cells, replace=True)
    X_missing = X.copy()
    X_missing.iloc[row_indices, col_indices] = np.nan
    return X_missing


fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(y_test.values[first_week], label="Actual transfer")

for missing_fraction in missing_fraction_list:
    X_train_missing = generate_missing_values(X_train, missing_fraction)
    X_test_missing = generate_missing_values(X_test, missing_fraction)
    hgbt.fit(X_train_missing, y_train)
    y_pred = hgbt.predict(X_test_missing[first_week])
    rmse = root_mean_squared_error(y_test[first_week], y_pred)
    ax.plot(
        y_pred[first_week],
        label=f"missing_fraction={missing_fraction}, RMSE={rmse:.3f}",
        alpha=0.5,
    )
ax.set(
    title="Daily energy transfer predictions on data with MCAR values",
    xticks=[(i + 0.2) * 48 for i in range(7)],
    xticklabels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
    xlabel="Time of the week",
    ylabel="Normalized energy transfer",
)
_ = ax.legend(loc="lower right")

Daily energy transfer predictions on data with MCAR values

正如预期的那样，模型随着缺失值比例的增加而退化。

支持分位数损失#

回归中的分位数损失可以查看目标变量的变异性或不确定性。例如，预测第5百分位数和第95百分位数可以提供90%的预测区间，即我们预计新观察值以90%的可能性落在该范围内的范围。

from sklearn.metrics import mean_pinball_loss

quantiles = [0.95, 0.05]
predictions = []

fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(y_test.values[first_week], label="Actual transfer")

for quantile in quantiles:
    hgbt_quantile = HistGradientBoostingRegressor(
        loss="quantile", quantile=quantile, **common_params
    )
    hgbt_quantile.fit(X_train, y_train)
    y_pred = hgbt_quantile.predict(X_test[first_week])

    predictions.append(y_pred)
    score = mean_pinball_loss(y_test[first_week], y_pred)
    ax.plot(
        y_pred[first_week],
        label=f"quantile={quantile}, pinball loss={score:.2f}",
        alpha=0.5,
    )

ax.fill_between(
    range(len(predictions[0][first_week])),
    predictions[0][first_week],
    predictions[1][first_week],
    color=colors[0],
    alpha=0.1,
)
ax.set(
    title="Daily energy transfer predictions with quantile loss",
    xticks=[(i + 0.2) * 48 for i in range(7)],
    xticklabels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
    xlabel="Time of the week",
    ylabel="Normalized energy transfer",
)
_ = ax.legend(loc="lower right")

Daily energy transfer predictions with quantile loss

我们观察到高估能量转移的倾向。这可以通过计算经验覆盖率（如在 calibration of confidence intervals section .请记住，这些预测的百分位数只是模型的估计。人们仍然可以通过以下方式提高此类估计的质量：

收集更多数据点;
更好地调整模型超参数，请参阅梯度Boosting回归的预测区间 ;
从相同的数据中设计更多的预测特征，请参阅与时间相关的特征工程 .

单调约束#

给定要求特征与目标之间的关系单调增加或减少的特定领域知识，可以使用单调约束在HGBT模型的预测中强制执行此类行为。这使得模型更具可解释性，并且可以在存在偏差增加的风险的情况下减少其方差（并可能减轻过度匹配）。单调约束还可用于执行特定的监管要求、确保合规性并符合道德考虑。

In the present example, the policy of transferring energy from Victoria to New South Wales is meant to alleviate price fluctuations, meaning that the model predictions have to enforce such goal, i.e. transfer should increase with price and demand in New South Wales, but also decrease with price and demand in Victoria, in order to benefit both populations.

如果训练数据具有特征名称，则可以通过传递具有约定的字典来指定单调约束：

1：单调增加
0：无约束
-1：单调减少

或者，可以传递按位置编码上述约定的类数组对象。

from sklearn.inspection import PartialDependenceDisplay

monotonic_cst = {
    "date": 0,
    "day": 0,
    "period": 0,
    "nswdemand": 1,
    "nswprice": 1,
    "vicdemand": -1,
    "vicprice": -1,
}
hgbt_no_cst = HistGradientBoostingRegressor(
    categorical_features=None, random_state=42
).fit(X, y)
hgbt_cst = HistGradientBoostingRegressor(
    monotonic_cst=monotonic_cst, categorical_features=None, random_state=42
).fit(X, y)

fig, ax = plt.subplots(nrows=2, figsize=(15, 10))
disp = PartialDependenceDisplay.from_estimator(
    hgbt_no_cst,
    X,
    features=["nswdemand", "nswprice"],
    line_kw={"linewidth": 2, "label": "unconstrained", "color": "tab:blue"},
    ax=ax[0],
)
PartialDependenceDisplay.from_estimator(
    hgbt_cst,
    X,
    features=["nswdemand", "nswprice"],
    line_kw={"linewidth": 2, "label": "constrained", "color": "tab:orange"},
    ax=disp.axes_,
)
disp = PartialDependenceDisplay.from_estimator(
    hgbt_no_cst,
    X,
    features=["vicdemand", "vicprice"],
    line_kw={"linewidth": 2, "label": "unconstrained", "color": "tab:blue"},
    ax=ax[1],
)
PartialDependenceDisplay.from_estimator(
    hgbt_cst,
    X,
    features=["vicdemand", "vicprice"],
    line_kw={"linewidth": 2, "label": "constrained", "color": "tab:orange"},
    ax=disp.axes_,
)
_ = plt.legend()

观察到 nswdemand 和 vicdemand 看起来已经单调无限制了。这是一个很好的例子，表明具有单调性约束的模型是“过度约束”的。

此外，我们可以验证模型的预测质量不会通过引入单调约束而显着降低。为此目的，我们使用 TimeSeriesSplit 交叉验证以估计测试分数的方差。通过这样做，我们可以保证训练数据不会接替测试数据，这在处理具有时间关系的数据时至关重要。

from sklearn.metrics import make_scorer, root_mean_squared_error
from sklearn.model_selection import TimeSeriesSplit, cross_validate

ts_cv = TimeSeriesSplit(n_splits=5, gap=48, test_size=336)  # a week has 336 samples
scorer = make_scorer(root_mean_squared_error)

cv_results = cross_validate(hgbt_no_cst, X, y, cv=ts_cv, scoring=scorer)
rmse = cv_results["test_score"]
print(f"RMSE without constraints = {rmse.mean():.3f} +/- {rmse.std():.3f}")

cv_results = cross_validate(hgbt_cst, X, y, cv=ts_cv, scoring=scorer)
rmse = cv_results["test_score"]
print(f"RMSE with constraints    = {rmse.mean():.3f} +/- {rmse.std():.3f}")

RMSE without constraints = 0.103 +/- 0.030
RMSE with constraints    = 0.107 +/- 0.034

话虽如此，请注意，比较是在两个不同模型之间进行的，这些模型可以通过不同的超参数组合进行优化。这就是为什么我们不使用 common_params 在本节中与之前一样。

Total running time of the script: （0分15.239秒）