scikit-learn 1.1的发布亮点#

我们很高兴宣布scikit-learn 1.1的发布!添加了许多错误修复和改进,以及一些新的关键功能。我们在下面详细介绍了该版本的一些主要功能。 For an exhaustive list of all the changes ,请参阅 release notes .

安装最新版本(使用pip):

pip install --upgrade scikit-learn

或带有conda::

conda install -c conda-forge scikit-learn

分位数损失 HistGradientBoostingRegressor#

HistGradientBoostingRegressor 可以用以下方式建模分位数 loss="quantile" 和新参数 quantile .

import matplotlib.pyplot as plt
import numpy as np

from sklearn.ensemble import HistGradientBoostingRegressor

# Simple regression function for X * cos(X)
rng = np.random.RandomState(42)
X_1d = np.linspace(0, 10, num=2000)
X = X_1d.reshape(-1, 1)
y = X_1d * np.cos(X_1d) + rng.normal(scale=X_1d / 3)

quantiles = [0.95, 0.5, 0.05]
parameters = dict(loss="quantile", max_bins=32, max_iter=50)
hist_quantiles = {
    f"quantile={quantile:.2f}": HistGradientBoostingRegressor(
        **parameters, quantile=quantile
    ).fit(X, y)
    for quantile in quantiles
}

fig, ax = plt.subplots()
ax.plot(X_1d, y, "o", alpha=0.5, markersize=1)
for quantile, hist in hist_quantiles.items():
    ax.plot(X_1d, hist.predict(X), label=quantile)
_ = ax.legend(loc="lower left")
plot release highlights 1 1 0

有关用例示例,请参阅 梯度增强树的梯度中的功能

get_feature_names_out Available in all Transformers#

get_feature_names_out 现已在所有变压器上可用,从而完成 SLEP007 .这使得 Pipeline 要为更复杂的管道构造输出要素名称:

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_selection import SelectKBest
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

X, y = fetch_openml(
    "titanic", version=1, as_frame=True, return_X_y=True, parser="pandas"
)
numeric_features = ["age", "fare"]
numeric_transformer = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
categorical_features = ["embarked", "pclass"]

preprocessor = ColumnTransformer(
    [
        ("num", numeric_transformer, numeric_features),
        (
            "cat",
            OneHotEncoder(handle_unknown="ignore", sparse_output=False),
            categorical_features,
        ),
    ],
    verbose_feature_names_out=False,
)
log_reg = make_pipeline(preprocessor, SelectKBest(k=7), LogisticRegression())
log_reg.fit(X, y)
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['age', 'fare']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False),
                                                  ['embarked', 'pclass'])],
                                   verbose_feature_names_out=False)),
                ('selectkbest', SelectKBest(k=7)),
                ('logisticregression', LogisticRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


这里我们对管道进行切片,以包括除最后一个步骤之外的所有步骤。这个流水线切片的输出特征名称是放入逻辑回归的特征。这些名称直接对应于逻辑回归中的系数:

import pandas as pd

log_reg_input_features = log_reg[:-1].get_feature_names_out()
pd.Series(log_reg[-1].coef_.ravel(), index=log_reg_input_features).plot.bar()
plt.tight_layout()
plot release highlights 1 1 0

删除不常见的类别 OneHotEncoder#

OneHotEncoder 支持将不常见的类别聚合到每个功能的单个输出中。支持收集不常见类别的参数包括 min_frequencymax_categories .看到 User Guide 了解更多详细信息。

import numpy as np

from sklearn.preprocessing import OneHotEncoder

X = np.array(
    [["dog"] * 5 + ["cat"] * 20 + ["rabbit"] * 10 + ["snake"] * 3], dtype=object
).T
enc = OneHotEncoder(min_frequency=6, sparse_output=False).fit(X)
enc.infrequent_categories_
[array(['dog', 'snake'], dtype=object)]

由于狗和蛇是罕见的类别,因此在转换时它们被分组在一起:

encoded = enc.transform(np.array([["dog"], ["snake"], ["cat"], ["rabbit"]]))
pd.DataFrame(encoded, columns=enc.get_feature_names_out())
x0_cat x0_rabbit x0_infrequent_sklearn
0 0.0 0.0 1.0
1 0.0 0.0 1.0
2 1.0 0.0 0.0
3 0.0 1.0 0.0


性能改进#

对密集float64数据集的成对距离进行了重构,以更好地利用非阻塞线程并行性。例如, neighbors.NearestNeighbors.kneighborsneighbors.NearestNeighbors.radius_neighbors 分别比以前快20倍和5倍。总而言之,以下函数和估计器现在受益于改进的性能:

要了解有关这项工作的更多技术细节,您可以阅读 this suite of blog posts .

此外,使用Cython重新分解了损失函数的计算,从而提高了以下估计器的性能:

MiniBatchNMF :NMF的在线版本#

新类 MiniBatchNMF 实现更快但不太准确的非负矩阵分解版本 (NMF ). MiniBatchNMF 将数据分为小批量,并通过循环小批量以在线方式优化NMF模型,使其更适合大型数据集。特别是,它实现了 partial_fit ,当数据从一开始就不容易获得或数据不适合存储器时,可以用于在线学习。

import numpy as np

from sklearn.decomposition import MiniBatchNMF

rng = np.random.RandomState(0)
n_samples, n_features, n_components = 10, 10, 5
true_W = rng.uniform(size=(n_samples, n_components))
true_H = rng.uniform(size=(n_components, n_features))
X = true_W @ true_H

nmf = MiniBatchNMF(n_components=n_components, random_state=0)

for _ in range(10):
    nmf.partial_fit(X)

W = nmf.transform(X)
H = nmf.components_
X_reconstructed = W @ H

print(
    "relative reconstruction error: ",
    f"{np.sum((X - X_reconstructed) ** 2) / np.sum(X**2):.5f}",
)
relative reconstruction error:  0.00364

BisectingKMeans :划分并集群#

新类 BisectingKMeans 的变体 KMeans ,使用分裂的分层集群。不是一次创建所有重心,而是根据之前的集群逐步选择重心:一个集群被重复分成两个新集群,直到达到目标集群数量,从而为集群提供分层结构。

import matplotlib.pyplot as plt

from sklearn.cluster import BisectingKMeans, KMeans
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=1000, centers=2, random_state=0)

km = KMeans(n_clusters=5, random_state=0, n_init="auto").fit(X)
bisect_km = BisectingKMeans(n_clusters=5, random_state=0).fit(X)

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].scatter(X[:, 0], X[:, 1], s=10, c=km.labels_)
ax[0].scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], s=20, c="r")
ax[0].set_title("KMeans")

ax[1].scatter(X[:, 0], X[:, 1], s=10, c=bisect_km.labels_)
ax[1].scatter(
    bisect_km.cluster_centers_[:, 0], bisect_km.cluster_centers_[:, 1], s=20, c="r"
)
_ = ax[1].set_title("BisectingKMeans")
KMeans, BisectingKMeans

Total running time of the script: (0分0.716秒)

相关实例

scikit-learn 1.0的发布亮点

Release Highlights for scikit-learn 1.0

scikit-learn 0.23的发布亮点

Release Highlights for scikit-learn 0.23

scikit-learn 1.3的发布亮点

Release Highlights for scikit-learn 1.3

非负矩阵分解和潜在Dirichlet分配的主题提取

Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation

Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io> _