备注

Go to the end 下载完整的示例代码。或者通过浏览器中的MysterLite或Binder运行此示例

scikit-learn 1.4的发布亮点#

我们很高兴宣布scikit-learn 1.4的发布！添加了许多错误修复和改进，以及一些新的关键功能。我们在下面详细介绍了该版本的一些主要功能。 For an exhaustive list of all the changes ，请参阅 release notes .

安装最新版本（使用pip）：

pip install --upgrade scikit-learn

或带有conda：：

conda install -c conda-forge scikit-learn

HistoryentBoosting在DataFrame中原生支持分类D类型#

ensemble.HistGradientBoostingClassifier 和 ensemble.HistGradientBoostingRegressor 现在直接支持具有分类功能的收件箱。这里我们有一个混合了分类和数字特征的数据集：

from sklearn.datasets import fetch_openml

X_adult, y_adult = fetch_openml("adult", version=2, return_X_y=True)

# Remove redundant and non-feature columns
X_adult = X_adult.drop(["education-num", "fnlwgt"], axis="columns")
X_adult.dtypes

age                  int64
workclass         category
education         category
marital-status    category
occupation        category
relationship      category
race              category
sex               category
capital-gain         int64
capital-loss         int64
hours-per-week       int64
native-country    category
dtype: object

通过设置 categorical_features="from_dtype" ，梯度提升分类器将具有类别dtypes的列视为算法中的类别特征：

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_adult, y_adult, random_state=0)
hist = HistGradientBoostingClassifier(categorical_features="from_dtype")

hist.fit(X_train, y_train)
y_decision = hist.decision_function(X_test)
print(f"ROC AUC score is {roc_auc_score(y_test, y_decision)}")

ROC AUC score is 0.9285143440735038

两极输出 `set_output`#

scikit-learn的变形金刚现在通过 set_output API.

import polars as pl

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

df = pl.DataFrame(
    {"height": [120, 140, 150, 110, 100], "pet": ["dog", "cat", "dog", "cat", "cat"]}
)
preprocessor = ColumnTransformer(
    [
        ("numerical", StandardScaler(), ["height"]),
        ("categorical", OneHotEncoder(sparse_output=False), ["pet"]),
    ],
    verbose_feature_names_out=False,
)
preprocessor.set_output(transform="polars")

df_out = preprocessor.fit_transform(df)
df_out

shape: (5, 3)

height	pet_cat	pet_dog
f64	f64	f64
-0.215666	0.0	1.0
0.862662	1.0	0.0
1.401826	0.0	1.0
-0.754829	1.0	0.0
-1.293993	1.0	0.0

print(f"Output type: {type(df_out)}")

Output type: <class 'polars.dataframe.frame.DataFrame'>

Random Forest缺少价值支持#

的类 ensemble.RandomForestClassifier 和 ensemble.RandomForestRegressor 现在支持缺失的值。当训练每棵树时，拆分器会评估每个潜在阈值，并将缺失值发送到左侧和右侧节点。更多详情请参阅 User Guide .

import numpy as np

from sklearn.ensemble import RandomForestClassifier

X = np.array([0, 1, 6, np.nan]).reshape(-1, 1)
y = [0, 0, 1, 1]

forest = RandomForestClassifier(random_state=0).fit(X, y)
forest.predict(X)

array([0, 0, 1, 1])

在基于树的模型中添加对单调约束的支持#

虽然我们在scikit-learn 0.23中添加了对基于柱状图的梯度增强中单调约束的支持，但我们现在支持所有其他基于树的模型（例如树、随机森林、额外树和精确梯度增强）的该功能。在这里，我们在回归问题上展示了随机森林的这一特征。

import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import PartialDependenceDisplay

n_samples = 500
rng = np.random.RandomState(0)
X = rng.randn(n_samples, 2)
noise = rng.normal(loc=0.0, scale=0.01, size=n_samples)
y = 5 * X[:, 0] + np.sin(10 * np.pi * X[:, 0]) - noise

rf_no_cst = RandomForestRegressor().fit(X, y)
rf_cst = RandomForestRegressor(monotonic_cst=[1, 0]).fit(X, y)

disp = PartialDependenceDisplay.from_estimator(
    rf_no_cst,
    X,
    features=[0],
    feature_names=["feature 0"],
    line_kw={"linewidth": 4, "label": "unconstrained", "color": "tab:blue"},
)
PartialDependenceDisplay.from_estimator(
    rf_cst,
    X,
    features=[0],
    line_kw={"linewidth": 4, "label": "constrained", "color": "tab:orange"},
    ax=disp.axes_,
)
disp.axes_[0, 0].plot(
    X[:, 0], y, "o", alpha=0.5, zorder=-1, label="samples", color="tab:green"
)
disp.axes_[0, 0].set_ylim(-3, 3)
disp.axes_[0, 0].set_xlim(-1, 1)
disp.axes_[0, 0].legend()
plt.show()

丰富的估计器显示#

估计器的显示已经丰富：如果我们看看 forest ，定义如上：

forest

RandomForestClassifier(random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

可以通过点击图标“？”来访问估算者的文档”在图表的右上角。

此外，当估算器被适配时，显示屏会改变颜色，从橙色变为蓝色。您还可以通过将鼠标悬停在图标“i”上来获取此信息。

from sklearn.base import clone

clone(forest)  # the clone is not fitted

RandomForestClassifier(random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

元数据路由支持#

许多元估计器和交叉验证例行程序现在支持元数据路由，这些数据列在 user guide .例如，这就是您如何使用样本权重和 GroupKFold :

import sklearn
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
from sklearn.metrics import get_scorer
from sklearn.model_selection import GridSearchCV, GroupKFold, cross_validate

# For now by default metadata routing is disabled, and need to be explicitly
# enabled.
sklearn.set_config(enable_metadata_routing=True)

n_samples = 100
X, y = make_regression(n_samples=n_samples, n_features=5, noise=0.5)
rng = np.random.RandomState(7)
groups = rng.randint(0, 10, size=n_samples)
sample_weights = rng.rand(n_samples)
estimator = Lasso().set_fit_request(sample_weight=True)
hyperparameter_grid = {"alpha": [0.1, 0.5, 1.0, 2.0]}
scoring_inner_cv = get_scorer("neg_mean_squared_error").set_score_request(
    sample_weight=True
)
inner_cv = GroupKFold(n_splits=5)

grid_search = GridSearchCV(
    estimator=estimator,
    param_grid=hyperparameter_grid,
    cv=inner_cv,
    scoring=scoring_inner_cv,
)

outer_cv = GroupKFold(n_splits=5)
scorers = {
    "mse": get_scorer("neg_mean_squared_error").set_score_request(sample_weight=True)
}
results = cross_validate(
    grid_search,
    X,
    y,
    cv=outer_cv,
    scoring=scorers,
    return_estimator=True,
    params={"sample_weight": sample_weights, "groups": groups},
)
print("cv error on test sets:", results["test_mse"])

# Setting the flag to the default `False` to avoid interference with other
# scripts.
sklearn.set_config(enable_metadata_routing=False)

cv error on test sets: [-0.59599627 -0.35906833 -0.35244508 -0.1604721  -0.15021137]

改进稀疏数据PCA的内存和运行时效率#

PCA is now able to handle sparse matrices natively for the arpack solver by levaraging scipy.sparse.linalg.LinearOperator to avoid materializing large sparse matrices when performing the eigenvalue decomposition of the data set covariance matrix.

from time import time

import scipy.sparse as sp

from sklearn.decomposition import PCA

X_sparse = sp.random(m=1000, n=1000, random_state=0)
X_dense = X_sparse.toarray()

t0 = time()
PCA(n_components=10, svd_solver="arpack").fit(X_sparse)
time_sparse = time() - t0

t0 = time()
PCA(n_components=10, svd_solver="arpack").fit(X_dense)
time_dense = time() - t0

print(f"Speedup: {time_dense / time_sparse:.1f}x")

Speedup: 11.4x

Total running time of the script: （0分1.674秒）