HalvingRandomSearchCV#

class sklearn.model_selection.HalvingRandomSearchCV(estimator, param_distributions, *, n_candidates='exhaust', factor=3, resource='n_samples', max_resources='auto', min_resources='smallest', aggressive_elimination=False, cv=5, scoring=None, refit=True, error_score=nan, return_train_score=True, random_state=None, n_jobs=None, verbose=0)[源代码]#

随机搜索超参数。

搜索策略开始使用少量资源评估所有候选人，并使用越来越多的资源迭代选择最佳候选人。

从参数空间中随机采样候选项，采样候选项的数量由下式确定 n_candidates .

阅读更多的 User guide .

备注

这个估计器仍然是 experimental 目前：预测和API可能会在没有任何弃用周期的情况下发生变化。要使用它，您需要显式导入 enable_halving_search_cv

>>> # explicitly require this experimental feature
>>> from sklearn.experimental import enable_halving_search_cv # noqa
>>> # now you can import normally from model_selection
>>> from sklearn.model_selection import HalvingRandomSearchCV

参数:

estimator估计器对象

假设这将实现scikit-learn估计器接口。任何一个估计器都需要提供 score 功能或 scoring 必须通过。

param_distributions法令或法令清单

包含参数名称的字典 (str ）作为要尝试的密钥和分布或参数列表。分发必须提供 rvs 抽样方法（例如来自scipy.stats. disposals的方法）。如果给出了列表，则会对其进行统一采样。如果给出了一个dict列表，首先对dict进行均匀采样，然后如上所述使用该dict对参数进行采样。

n_candidates“exhaust”或int，默认=“exhaust”

第一次迭代时要采样的候选参数数量。使用“exhaust”将采样足够的候选项，以便最后一次迭代使用尽可能多的资源，基于 min_resources , max_resources 和 factor .在这种情况下， min_resources 不能“精疲力尽”。

factorint或float，默认=3

“减半”参数，确定为每次后续迭代选择的候选项比例。例如， factor=3 意味着只有三分之一的候选人被选中。

resource : 'n_samples' 或字符串，默认=' n_samples '“n_samples”或字符串，默认=“n_samples”

定义随着每次迭代而增加的资源。默认情况下，资源是样本数。它也可以被设置为接受正整数值的基础估计器的任何参数，例如用于梯度提升估计器的“n_iterations”或“n_estimators”。在这种情况下 max_resources 不能是“自动”并且必须显式设置。

max_resourcesint，默认=' Auto '

允许任何候选人用于给定迭代的最大资源数量。默认情况下，这是设置的 n_samples 当 resource='n_samples' （默认），否则将引发错误。

min_resources'exhaust'，'smallest'}或int，default='smallest'

允许任何候选人用于给定迭代的最小资源量。同样，这定义了资源量 r0 它们在第一次迭代时分配给每个候选人。

“最小”是一种启发式，它设置 r0 小值：
- n_splits * 2 当 resource='n_samples' 对于回归问题
- n_classes * n_splits * 2 当 resource='n_samples' 对于分类问题
- 1 当 resource != 'n_samples'
'exhaust' will set r0 such that the last iteration uses as much resources as possible. Namely, the last iteration will use the highest value smaller than max_resources that is a multiple of both min_resources and factor. In general, using 'exhaust' leads to a more accurate estimator, but is slightly more time consuming. 'exhaust' isn't available when n_candidates='exhaust'.

请注意，每次迭代使用的资源量始终是 min_resources .

aggressive_elimination布尔，默认=假

这仅在没有足够的资源将剩余候选人减少到最多的情况下适用 factor 在最后一次迭代之后。如果 True ，然后搜索过程将根据需要“重播”第一次迭代，直到候选数量足够小。这是 False 默认情况下，这意味着最后一次迭代的计算值可能超过 factor 候选人看到积极淘汰候选人了解更多详细信息。

cvint，交叉验证生成器或可迭代对象，默认=5

确定交叉验证拆分策略。简历的可能输入包括：

integer，指定中的折叠数 (Stratified)KFold ,
CV splitter ,
可迭代产出（训练、测试）分裂为索引数组。

对于integer/Non-输入，如果估计器是分类器并且 y 是二元或多类， StratifiedKFold 采用了在所有其他情况下， KFold 采用了这些拆分器实例化为 shuffle=False 因此不同呼叫之间的拆分将是相同的。

指 User Guide 这里可以使用的各种交叉验证策略。

备注

由于实施细节， cv 多次调用时必须相同 cv.split() .对于内置 scikit-learn 迭代器，这可以通过停用洗牌来实现 (shuffle=False ），或者通过设置 cv 的 random_state 参数转换为一个integer。

scoring字符串，可调用，或无，默认=无

A single string (see 的 scoring 参数：定义模型评估规则) or a callable (see 可召唤得分手) to evaluate the predictions on the test set. If None, the estimator's score method is used.

refit布尔，默认=True

如果为True，请使用整个数据集中找到的最佳参数重新调整估计器。

重新调整的估计器可在 best_estimator_ 属性和许可使用 predict 直接在本 HalvingRandomSearchCV instance.

error_score“提高”或数字

如果估计量匹配中出现错误，则指定给分数的值。如果设置为“raise”，则会引发错误。如果给出了数字值，则会引发FitUtiledWarning。此参数不影响重新调整步骤，重新调整步骤始终会引发错误。默认值为 np.nan .

return_train_score布尔，默认=假

如果 False ， cv_results_ 属性不包括培训分数。计算训练分数用于深入了解不同的参数设置如何影响过适应/欠适应权衡。然而，计算训练集中的分数可能在计算上很昂贵，并且并不严格要求选择产生最佳概括性能的参数。

random_stateint，RandomState实例或无，默认=无

伪随机数生成器状态用于在以下情况下对数据集进行二次采样 resources != 'n_samples' .还用于从可能值列表中随机均匀抽样，而不是scipy.stats分布。传递int以获得跨多个函数调用的可重复输出。看到 Glossary .

n_jobsint或无，默认=无

要并行运行的作业数。 None 意思是1，除非在a中 joblib.parallel_backend 上下文 -1 意味着使用所有处理器。看到 Glossary 了解更多详细信息。

verboseint

控制详细程度：越高，消息越多。

属性:

n_resources_int列表

每次迭代使用的资源量。

n_candidates_int列表

每次迭代时评估的候选参数的数量。

n_remaining_candidates_int

最后一次迭代后剩余的候选参数数量。它对应于 ceil(n_candidates[-1] / factor)

max_resources_int

允许任何候选人用于给定迭代的最大资源数量。请注意，由于每次迭代使用的资源数量必须是 min_resources_ ，在最后一次迭代中使用的实际资源数量可能小于 max_resources_ .

min_resources_int

第一次迭代时为每个候选人分配的资源量。

n_iterations_int

实际运行的迭代次数。这等于 n_required_iterations_ 如果 aggressive_elimination 是 True .否则，这就等于 min(n_possible_iterations_, n_required_iterations_) .

n_possible_iterations_int

可能的迭代次数， min_resources_ 资源且不超出 max_resources_ .

n_required_iterations_int

最终结果少于所需的迭代次数 factor 最后一次迭代的候选人，从 min_resources_ 资源这将小于 n_possible_iterations_ 当没有足够的资源时。

cv_results_麻木（蒙面）ndarrays的法令

一个以键作为列标题、以值作为列的dict，可以导入到pandas中 DataFrame .它包含了大量的信息，用于分析搜索结果。请参阅 User guide 有关详细信息

best_estimator_估计者或预测者

搜索选择的估计值，即对遗漏的数据给出最高评分（或最小损失，如果指定）的估计值。时不可用 refit=False .

best_score_浮子

best_estimator的平均交叉验证分数。

best_params_dict

在保留数据上提供最佳结果的参数设置。

best_index_int

指数（的 cv_results_ 数组），其对应于最佳候选参数设置。

法令在 search.cv_results_['params'][search.best_index_] 提供最佳模型的参数设置，该模型提供最高的平均分 (search.best_score_ ).

scorer_职能或法令

评分器函数用于保留的数据，以选择模型的最佳参数。

n_splits_int

交叉验证拆分（折叠/迭代）的数量。

refit_time_浮子

用于重新调整整个数据集上的最佳模型的秒数。

只有当 refit 不是假的。

multimetric_bool

评分者是否计算多个指标。

classes_形状的nd数组（n_classes，）

班级标签。

n_features_in_int

期间看到的功能数量 fit .

feature_names_in_ ：nd形状数组 (n_features_in_ ,)nd数组形状（

Names of features seen during fit. Only defined if best_estimator_ is defined (see the documentation for the refit parameter for more details) and that best_estimator_ exposes feature_names_in_ when fit.

Added in version 1.0.

参见

HalvingGridSearchCV: 使用连续减半搜索参数网格。

注意到

根据评分参数，选择的参数是最大化持有数据的评分的参数。

所有使用NaN评分的参数组合将共享最低排名。

示例

>>> from sklearn.datasets import load_iris
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.experimental import enable_halving_search_cv  # noqa
>>> from sklearn.model_selection import HalvingRandomSearchCV
>>> from scipy.stats import randint
>>> import numpy as np
...
>>> X, y = load_iris(return_X_y=True)
>>> clf = RandomForestClassifier(random_state=0)
>>> np.random.seed(0)
...
>>> param_distributions = {"max_depth": [3, None],
...                        "min_samples_split": randint(2, 11)}
>>> search = HalvingRandomSearchCV(clf, param_distributions,
...                                resource='n_estimators',
...                                max_resources=10,
...                                random_state=0).fit(X, y)
>>> search.best_params_
{'max_depth': None, 'min_samples_split': 10, 'n_estimators': 9}

decision_function(X)[源代码]#

调用具有最佳参数的估计器上的decision_函数。

仅在以下情况下可用 refit=True 基础估计器支持 decision_function .

参数:

X可索引，长度n_samples: Must fulfill the input assumptions of the underlying estimator.

返回:

y_score形状的ndarray（n_samples，）或（n_samples，n_classes）或（n_samples，n_classes *（n_classes-1）/ 2）: 决策功能的结果 X 基于具有最佳发现参数的估计器。

fit(X, y=None, **params)[源代码]#

使用所有参数集运行fit。

参数:

X类数组，形状（n_samples，n_features）: 训练载体，在哪里 n_samples 是样本数量和 n_features 是功能的数量。
y类似于阵列，形状（n_samples，）或（n_samples，n_put），可选: 用于分类或回归的目标相对于X;无监督学习。
**params字符串->对象的字典: 参数传递给 fit 估计者的方法。

返回:

self对象: 匹配估计量的实例。

get_metadata_routing()[源代码]#

获取此对象的元数据路由。

请检查 User Guide 关于路由机制如何工作。

Added in version 1.4.

返回:

routingMetadataRouter: A MetadataRouter 封装路由信息。

get_params(deep=True)[源代码]#

获取此估计器的参数。

参数:

deep布尔，默认=True: 如果为True，将返回此估计量和包含的作为估计量的子对象的参数。

返回:

paramsdict: 参数名称映射到其值。

inverse_transform(X=None, Xt=None)[源代码]#

在具有最佳参数的估计器上调用inverse_transform。

仅在基础估计器实现时才可用 inverse_transform 和 refit=True .

参数:

X可索引，长度n_samples: Must fulfill the input assumptions of the underlying estimator.
Xt可索引，长度n_samples: Must fulfill the input assumptions of the underlying estimator.

自 1.5 版本弃用: Xt 在1.5中已废弃，并将在1.7中删除。使用 X 而不是.

返回:

X{ndarray，sparse matrix}的形状（n_samples，n_features）: 结果 inverse_transform 功能 Xt 基于具有最佳发现参数的估计器。

predict(X)[源代码]#

调用具有最佳参数的估计器进行预测。

仅在以下情况下可用 refit=True 基础估计器支持 predict .

参数:

X可索引，长度n_samples: Must fulfill the input assumptions of the underlying estimator.

返回:

y_pred形状的nd数组（n_samples，）: 预测的标签或值 X 基于具有最佳发现参数的估计器。

predict_log_proba(X)[源代码]#

在具有最佳参数的估计器上调用predicate_log_proba。

仅在以下情况下可用 refit=True 基础估计器支持 predict_log_proba .

参数:

X可索引，长度n_samples: Must fulfill the input assumptions of the underlying estimator.

返回:

y_pred形状的ndarray（n_samples，）或（n_samples，n_classes）: 预测的类log概率 X 基于具有最佳发现参数的估计器。类的顺序与匹配属性中的顺序相对应 classes_ .

predict_proba(X)[源代码]#

在具有最佳参数的估计器上调用predicate_proba。

仅在以下情况下可用 refit=True 基础估计器支持 predict_proba .

参数:

X可索引，长度n_samples: Must fulfill the input assumptions of the underlying estimator.

返回:

y_pred形状的ndarray（n_samples，）或（n_samples，n_classes）: 预测的类别概率 X 基于具有最佳发现参数的估计器。类的顺序与匹配属性中的顺序相对应 classes_ .

score(X, y=None, **params)[源代码]#

Return the score on the given data, if the estimator has been refit.

这使用由以下定义的分数 scoring 如果有，以及 best_estimator_.score 方法，否则。

参数:

X形状类似阵列（n_samples，n_features）: 输入数据，其中 n_samples 是样本数量和 n_features 是功能的数量。
y形状类似阵列（n_samples，n_put）或（n_samples，），默认=无: 用于分类或回归的目标相对于X;无监督学习。
**paramsdict: 要传递给基础评分器的参数。

Added in version 1.4: 仅在以下情况下可用 enable_metadata_routing=True .看到 Metadata Routing User Guide 了解更多详细信息。

返回:

score浮子: 分数定义为 scoring 如果提供的话，以及 best_estimator_.score 方法，否则。

score_samples(X)[源代码]#

在具有最佳参数的估计器上调用score_samples。

仅在以下情况下可用 refit=True 基础估计器支持 score_samples .

Added in version 0.24.

参数:

Xiterable: 要预测的数据。必须满足基础估计器的输入要求。

返回:

y_score形状的nd数组（n_samples，）: 的 best_estimator_.score_samples 法

set_params(**params)[源代码]#

设置此估计器的参数。

该方法适用于简单估计器以及嵌套对象（例如 Pipeline ).后者具有以下形式的参数 <component>__<parameter> 以便可以更新嵌套对象的每个组件。

参数:

**paramsdict: 估计参数。

返回:

self估计器实例: 估计实例。

transform(X)[源代码]#

使用最佳找到的参数对估计器进行变换。

仅在基础估计器支持的情况下可用 transform 和 refit=True .

参数:

X可索引，长度n_samples: Must fulfill the input assumptions of the underlying estimator.

返回:

Xt{ndarray，sparse matrix}的形状（n_samples，n_features）: X 基于具有最佳找到参数的估计器在新空间中进行转换。