备注
Go to the end 下载完整的示例代码。或者通过浏览器中的MysterLite或Binder运行此示例
scikit-learn 1.4的发布亮点#
我们很高兴宣布scikit-learn 1.4的发布!添加了许多错误修复和改进,以及一些新的关键功能。我们在下面详细介绍了该版本的一些主要功能。 For an exhaustive list of all the changes ,请参阅 release notes .
安装最新版本(使用pip):
pip install --upgrade scikit-learn
或带有conda::
conda install -c conda-forge scikit-learn
HistoryentBoosting在DataFrame中原生支持分类D类型#
ensemble.HistGradientBoostingClassifier
和 ensemble.HistGradientBoostingRegressor
现在直接支持具有分类功能的收件箱。 这里我们有一个混合了分类和数字特征的数据集:
from sklearn.datasets import fetch_openml
X_adult, y_adult = fetch_openml("adult", version=2, return_X_y=True)
# Remove redundant and non-feature columns
X_adult = X_adult.drop(["education-num", "fnlwgt"], axis="columns")
X_adult.dtypes
age int64
workclass category
education category
marital-status category
occupation category
relationship category
race category
sex category
capital-gain int64
capital-loss int64
hours-per-week int64
native-country category
dtype: object
通过设置 categorical_features="from_dtype"
,梯度提升分类器将具有类别dtypes的列视为算法中的类别特征:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_adult, y_adult, random_state=0)
hist = HistGradientBoostingClassifier(categorical_features="from_dtype")
hist.fit(X_train, y_train)
y_decision = hist.decision_function(X_test)
print(f"ROC AUC score is {roc_auc_score(y_test, y_decision)}")
ROC AUC score is 0.9285143440735038
两极输出 set_output
#
scikit-learn的变形金刚现在通过 set_output
API.
import polars as pl
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
df = pl.DataFrame(
{"height": [120, 140, 150, 110, 100], "pet": ["dog", "cat", "dog", "cat", "cat"]}
)
preprocessor = ColumnTransformer(
[
("numerical", StandardScaler(), ["height"]),
("categorical", OneHotEncoder(sparse_output=False), ["pet"]),
],
verbose_feature_names_out=False,
)
preprocessor.set_output(transform="polars")
df_out = preprocessor.fit_transform(df)
df_out
print(f"Output type: {type(df_out)}")
Output type: <class 'polars.dataframe.frame.DataFrame'>
Random Forest缺少价值支持#
的类 ensemble.RandomForestClassifier
和 ensemble.RandomForestRegressor
现在支持缺失的值。当训练每棵树时,拆分器会评估每个潜在阈值,并将缺失值发送到左侧和右侧节点。更多详情请参阅 User Guide .
import numpy as np
from sklearn.ensemble import RandomForestClassifier
X = np.array([0, 1, 6, np.nan]).reshape(-1, 1)
y = [0, 0, 1, 1]
forest = RandomForestClassifier(random_state=0).fit(X, y)
forest.predict(X)
array([0, 0, 1, 1])
在基于树的模型中添加对单调约束的支持#
虽然我们在scikit-learn 0.23中添加了对基于柱状图的梯度增强中单调约束的支持,但我们现在支持所有其他基于树的模型(例如树、随机森林、额外树和精确梯度增强)的该功能。在这里,我们在回归问题上展示了随机森林的这一特征。
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import PartialDependenceDisplay
n_samples = 500
rng = np.random.RandomState(0)
X = rng.randn(n_samples, 2)
noise = rng.normal(loc=0.0, scale=0.01, size=n_samples)
y = 5 * X[:, 0] + np.sin(10 * np.pi * X[:, 0]) - noise
rf_no_cst = RandomForestRegressor().fit(X, y)
rf_cst = RandomForestRegressor(monotonic_cst=[1, 0]).fit(X, y)
disp = PartialDependenceDisplay.from_estimator(
rf_no_cst,
X,
features=[0],
feature_names=["feature 0"],
line_kw={"linewidth": 4, "label": "unconstrained", "color": "tab:blue"},
)
PartialDependenceDisplay.from_estimator(
rf_cst,
X,
features=[0],
line_kw={"linewidth": 4, "label": "constrained", "color": "tab:orange"},
ax=disp.axes_,
)
disp.axes_[0, 0].plot(
X[:, 0], y, "o", alpha=0.5, zorder=-1, label="samples", color="tab:green"
)
disp.axes_[0, 0].set_ylim(-3, 3)
disp.axes_[0, 0].set_xlim(-1, 1)
disp.axes_[0, 0].legend()
plt.show()

丰富的估计器显示#
估计器的显示已经丰富:如果我们看看 forest
,定义如上:
forest
可以通过点击图标“?”来访问估算者的文档”在图表的右上角。
此外,当估算器被适配时,显示屏会改变颜色,从橙色变为蓝色。您还可以通过将鼠标悬停在图标“i”上来获取此信息。
from sklearn.base import clone
clone(forest) # the clone is not fitted
元数据路由支持#
许多元估计器和交叉验证例行程序现在支持元数据路由,这些数据列在 user guide .例如,这就是您如何使用样本权重和 GroupKFold
:
import sklearn
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
from sklearn.metrics import get_scorer
from sklearn.model_selection import GridSearchCV, GroupKFold, cross_validate
# For now by default metadata routing is disabled, and need to be explicitly
# enabled.
sklearn.set_config(enable_metadata_routing=True)
n_samples = 100
X, y = make_regression(n_samples=n_samples, n_features=5, noise=0.5)
rng = np.random.RandomState(7)
groups = rng.randint(0, 10, size=n_samples)
sample_weights = rng.rand(n_samples)
estimator = Lasso().set_fit_request(sample_weight=True)
hyperparameter_grid = {"alpha": [0.1, 0.5, 1.0, 2.0]}
scoring_inner_cv = get_scorer("neg_mean_squared_error").set_score_request(
sample_weight=True
)
inner_cv = GroupKFold(n_splits=5)
grid_search = GridSearchCV(
estimator=estimator,
param_grid=hyperparameter_grid,
cv=inner_cv,
scoring=scoring_inner_cv,
)
outer_cv = GroupKFold(n_splits=5)
scorers = {
"mse": get_scorer("neg_mean_squared_error").set_score_request(sample_weight=True)
}
results = cross_validate(
grid_search,
X,
y,
cv=outer_cv,
scoring=scorers,
return_estimator=True,
params={"sample_weight": sample_weights, "groups": groups},
)
print("cv error on test sets:", results["test_mse"])
# Setting the flag to the default `False` to avoid interference with other
# scripts.
sklearn.set_config(enable_metadata_routing=False)
cv error on test sets: [-0.59599627 -0.35906833 -0.35244508 -0.1604721 -0.15021137]
改进稀疏数据PCA的内存和运行时效率#
PCA is now able to handle sparse matrices natively for the arpack
solver by levaraging scipy.sparse.linalg.LinearOperator
to avoid
materializing large sparse matrices when performing the
eigenvalue decomposition of the data set covariance matrix.
from time import time
import scipy.sparse as sp
from sklearn.decomposition import PCA
X_sparse = sp.random(m=1000, n=1000, random_state=0)
X_dense = X_sparse.toarray()
t0 = time()
PCA(n_components=10, svd_solver="arpack").fit(X_sparse)
time_sparse = time() - t0
t0 = time()
PCA(n_components=10, svd_solver="arpack").fit(X_dense)
time_dense = time() - t0
print(f"Speedup: {time_dense / time_sparse:.1f}x")
Speedup: 11.4x
Total running time of the script: (0分1.674秒)
相关实例
Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>
_