备注
Go to the end 下载完整的示例代码。或者通过浏览器中的MysterLite或Binder运行此示例
scikit-learn 1.1的发布亮点#
我们很高兴宣布scikit-learn 1.1的发布!添加了许多错误修复和改进,以及一些新的关键功能。我们在下面详细介绍了该版本的一些主要功能。 For an exhaustive list of all the changes ,请参阅 release notes .
安装最新版本(使用pip):
pip install --upgrade scikit-learn
或带有conda::
conda install -c conda-forge scikit-learn
分位数损失 HistGradientBoostingRegressor
#
HistGradientBoostingRegressor
可以用以下方式建模分位数 loss="quantile"
和新参数 quantile
.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import HistGradientBoostingRegressor
# Simple regression function for X * cos(X)
rng = np.random.RandomState(42)
X_1d = np.linspace(0, 10, num=2000)
X = X_1d.reshape(-1, 1)
y = X_1d * np.cos(X_1d) + rng.normal(scale=X_1d / 3)
quantiles = [0.95, 0.5, 0.05]
parameters = dict(loss="quantile", max_bins=32, max_iter=50)
hist_quantiles = {
f"quantile={quantile:.2f}": HistGradientBoostingRegressor(
**parameters, quantile=quantile
).fit(X, y)
for quantile in quantiles
}
fig, ax = plt.subplots()
ax.plot(X_1d, y, "o", alpha=0.5, markersize=1)
for quantile, hist in hist_quantiles.items():
ax.plot(X_1d, hist.predict(X), label=quantile)
_ = ax.legend(loc="lower left")

有关用例示例,请参阅 梯度增强树的梯度中的功能
get_feature_names_out
Available in all Transformers#
get_feature_names_out 现已在所有变压器上可用,从而完成 SLEP007 .这使得 Pipeline
要为更复杂的管道构造输出要素名称:
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_selection import SelectKBest
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
X, y = fetch_openml(
"titanic", version=1, as_frame=True, return_X_y=True, parser="pandas"
)
numeric_features = ["age", "fare"]
numeric_transformer = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
categorical_features = ["embarked", "pclass"]
preprocessor = ColumnTransformer(
[
("num", numeric_transformer, numeric_features),
(
"cat",
OneHotEncoder(handle_unknown="ignore", sparse_output=False),
categorical_features,
),
],
verbose_feature_names_out=False,
)
log_reg = make_pipeline(preprocessor, SelectKBest(k=7), LogisticRegression())
log_reg.fit(X, y)
这里我们对管道进行切片,以包括除最后一个步骤之外的所有步骤。这个流水线切片的输出特征名称是放入逻辑回归的特征。这些名称直接对应于逻辑回归中的系数:
import pandas as pd
log_reg_input_features = log_reg[:-1].get_feature_names_out()
pd.Series(log_reg[-1].coef_.ravel(), index=log_reg_input_features).plot.bar()
plt.tight_layout()

删除不常见的类别 OneHotEncoder
#
OneHotEncoder
支持将不常见的类别聚合到每个功能的单个输出中。支持收集不常见类别的参数包括 min_frequency
和 max_categories
.看到 User Guide 了解更多详细信息。
import numpy as np
from sklearn.preprocessing import OneHotEncoder
X = np.array(
[["dog"] * 5 + ["cat"] * 20 + ["rabbit"] * 10 + ["snake"] * 3], dtype=object
).T
enc = OneHotEncoder(min_frequency=6, sparse_output=False).fit(X)
enc.infrequent_categories_
[array(['dog', 'snake'], dtype=object)]
由于狗和蛇是罕见的类别,因此在转换时它们被分组在一起:
encoded = enc.transform(np.array([["dog"], ["snake"], ["cat"], ["rabbit"]]))
pd.DataFrame(encoded, columns=enc.get_feature_names_out())
性能改进#
对密集float64数据集的成对距离进行了重构,以更好地利用非阻塞线程并行性。例如, neighbors.NearestNeighbors.kneighbors
和 neighbors.NearestNeighbors.radius_neighbors
分别比以前快20倍和5倍。总而言之,以下函数和估计器现在受益于改进的性能:
要了解有关这项工作的更多技术细节,您可以阅读 this suite of blog posts .
此外,使用Cython重新分解了损失函数的计算,从而提高了以下估计器的性能:
MiniBatchNMF
:NMF的在线版本#
新类 MiniBatchNMF
实现更快但不太准确的非负矩阵分解版本 (NMF
). MiniBatchNMF
将数据分为小批量,并通过循环小批量以在线方式优化NMF模型,使其更适合大型数据集。特别是,它实现了 partial_fit
,当数据从一开始就不容易获得或数据不适合存储器时,可以用于在线学习。
import numpy as np
from sklearn.decomposition import MiniBatchNMF
rng = np.random.RandomState(0)
n_samples, n_features, n_components = 10, 10, 5
true_W = rng.uniform(size=(n_samples, n_components))
true_H = rng.uniform(size=(n_components, n_features))
X = true_W @ true_H
nmf = MiniBatchNMF(n_components=n_components, random_state=0)
for _ in range(10):
nmf.partial_fit(X)
W = nmf.transform(X)
H = nmf.components_
X_reconstructed = W @ H
print(
"relative reconstruction error: ",
f"{np.sum((X - X_reconstructed) ** 2) / np.sum(X**2):.5f}",
)
relative reconstruction error: 0.00364
BisectingKMeans
:划分并集群#
新类 BisectingKMeans
的变体 KMeans
,使用分裂的分层集群。不是一次创建所有重心,而是根据之前的集群逐步选择重心:一个集群被重复分成两个新集群,直到达到目标集群数量,从而为集群提供分层结构。
import matplotlib.pyplot as plt
from sklearn.cluster import BisectingKMeans, KMeans
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=1000, centers=2, random_state=0)
km = KMeans(n_clusters=5, random_state=0, n_init="auto").fit(X)
bisect_km = BisectingKMeans(n_clusters=5, random_state=0).fit(X)
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].scatter(X[:, 0], X[:, 1], s=10, c=km.labels_)
ax[0].scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], s=20, c="r")
ax[0].set_title("KMeans")
ax[1].scatter(X[:, 0], X[:, 1], s=10, c=bisect_km.labels_)
ax[1].scatter(
bisect_km.cluster_centers_[:, 0], bisect_km.cluster_centers_[:, 1], s=20, c="r"
)
_ = ax[1].set_title("BisectingKMeans")

Total running time of the script: (0分0.716秒)
相关实例

Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>
_