入门#

本指南的目的是说明 scikit-learn 提供.它假设机器学习实践（模型匹配、预测、交叉验证等）的非常基本的工作知识。请参阅我们 installation instructions 用于安装 scikit-learn .

Scikit-learn 是一个开源机器学习库，支持监督和无监督学习。它还提供各种工具用于模型匹配、数据预处理、模型选择、模型评估和许多其他实用工具。

匹配和预测：估计器基础知识#

Scikit-learn 提供了数十种内置机器学习算法和模型，称为 estimators .每个估计器都可以使用其 fit 法

这是一个简单的例子，我们在其中适应 RandomForestClassifier 一些非常基本的数据：：

>>> from sklearn.ensemble import RandomForestClassifier
>>> clf = RandomForestClassifier(random_state=0)
>>> X = [[ 1,  2,  3],  # 2 samples, 3 features
...      [11, 12, 13]]
>>> y = [0, 1]  # classes of each sample
>>> clf.fit(X, y)
RandomForestClassifier(random_state=0)

The fit method generally accepts 2 inputs:

样品矩阵（或设计矩阵） X .的大小 X 通常 (n_samples, n_features) ，这意味着样本表示为行，特征表示为列。
目标值 y 对于回归任务，它们是真实的数字，或者对于分类来说，它们是整数（或任何其他离散值集）。对于无监督学习任务， y 不需要指定。 y 通常是一个1D数组，其中 i 第一个条目对应于 i 的第一个样本（行） X .

两 X 和 y 通常预计是numpy数组或等效数组 array-like 数据类型，尽管一些估计器使用其他格式，如稀疏矩阵。

一旦估计量被拟合，它就可以用于预测新数据的目标值。你不需要重新训练估计器：

>>> clf.predict(X)  # predict classes of the training data
array([0, 1])
>>> clf.predict([[4, 5, 6], [14, 15, 16]])  # predict classes of new data
array([0, 1])

您可以检查选择正确的估计器了解如何为您的用例选择正确的模型。

变压器和预处理器#

机器学习工作流程通常由不同部分组成。典型的管道由转换或估算数据的预处理步骤和预测目标值的最终预测器组成。

在 scikit-learn ，预处理器和转换器遵循与估计器对象相同的API（它们实际上都继承自相同的 BaseEstimator 类）。Transformer对象没有 predict 方法，而是一种 transform 输出新转换的样本矩阵的方法 X

>>> from sklearn.preprocessing import StandardScaler
>>> X = [[0, 15],
...      [1, -10]]
>>> # scale data according to computed scaling values
>>> StandardScaler().fit(X).transform(X)
array([[-1.,  1.],
       [ 1., -1.]])

有时，您希望将不同的转换应用于不同的功能： ColumnTransformer 是专为这些用例设计的。

管道：链接预处理器和估计器#

变换器和估计器（预测器）可以组合在一起成为一个统一对象： Pipeline .管道提供了与常规估计器相同的API：它可以拟合并用于预测， fit 和 predict .正如我们稍后将看到的那样，使用管道还可以防止数据泄露，即泄露训练数据中的一些测试数据。

在下面的例子中，我们 load the Iris dataset ，将其拆分为训练集和测试集，并根据测试数据计算管道的准确性得分：：

>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import accuracy_score
...
>>> # create a pipeline object
>>> pipe = make_pipeline(
...     StandardScaler(),
...     LogisticRegression()
... )
...
>>> # load the iris dataset and split it into train and test sets
>>> X, y = load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
...
>>> # fit the whole pipeline
>>> pipe.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression())])
>>> # we can now use it like any other estimator
>>> accuracy_score(pipe.predict(X_test), y_test)
0.97...

模型评估#

将模型与某些数据进行匹配并不意味着它能够对未见的数据进行很好的预测。这需要直接评估。我们刚刚看到了 train_test_split 将数据集拆分为训练集和测试集的助手，但 scikit-learn 提供了许多其他用于模型评估的工具，特别是 cross-validation .

我们在这里简要展示了如何使用 cross_validate helper.请注意，还可以手动叠加折叠、使用不同的数据拆分策略以及使用自定义评分功能。请参阅我们 User Guide 欲了解更多详细信息：：

>>> from sklearn.datasets import make_regression
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.model_selection import cross_validate
...
>>> X, y = make_regression(n_samples=1000, random_state=0)
>>> lr = LinearRegression()
...
>>> result = cross_validate(lr, X, y)  # defaults to 5-fold CV
>>> result['test_score']  # r_squared score is high because dataset is easy
array([1., 1., 1., 1., 1.])

自动参数搜索#

所有估计器都有可以调整的参数（在文献中通常称为超参数）。估计器的概括能力通常严重取决于几个参数。例如 RandomForestRegressor 具有 n_estimators 确定森林中树木数量的参数，以及 max_depth 确定每棵树最大深度的参数。通常，不清楚这些参数的确切值应该是多少，因为它们取决于手头的数据。

Scikit-learn 提供自动查找最佳参数组合（通过交叉验证）的工具。在下面的示例中，我们随机搜索随机森林的参数空间 RandomizedSearchCV object.搜索结束后， RandomizedSearchCV 表现为 RandomForestRegressor 它已经用最好的参数集进行了匹配。阅读更多的 User Guide

>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.model_selection import RandomizedSearchCV
>>> from sklearn.model_selection import train_test_split
>>> from scipy.stats import randint
...
>>> X, y = fetch_california_housing(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
...
>>> # define the parameter space that will be searched over
>>> param_distributions = {'n_estimators': randint(1, 5),
...                        'max_depth': randint(5, 10)}
...
>>> # now create a searchCV object and fit it to the data
>>> search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),
...                             n_iter=5,
...                             param_distributions=param_distributions,
...                             random_state=0)
>>> search.fit(X_train, y_train)
RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5,
                   param_distributions={'max_depth': ...,
                                        'n_estimators': ...},
                   random_state=0)
>>> search.best_params_
{'max_depth': 9, 'n_estimators': 4}

>>> # the search object now acts like a normal random forest estimator
>>> # with max_depth=9 and n_estimators=4
>>> search.score(X_test, y_test)
0.73...

备注

在实践中，您几乎总是想 search over a pipeline ，而不是单个估计器。主要原因之一是，如果您在不使用管道的情况下对整个数据集应用预处理步骤，然后执行任何类型的交叉验证，那么您将打破训练和测试数据之间独立性的基本假设。事实上，由于您使用整个数据集预处理了数据，因此有关测试集的一些信息可供火车集使用。这将导致过度估计估计器的概括能力（您可以在此阅读更多内容 Kaggle post ).

使用管道进行交叉验证和搜索将在很大程度上避免您遇到这种常见的陷阱。

后续步骤#

我们简要介绍了估计器匹配和预测、预处理步骤、管道、交叉验证工具和自动超参数搜索。本指南应该为您提供该库的一些主要功能的概述，但还有更多功能 scikit-learn !

请参阅我们用户指南了解我们提供的所有工具的详细信息。您还可以在 API参考 .

您还可以看看我们众多的 examples 这说明了 scikit-learn 在许多不同的背景下。