pairwise_distances_chunked#

sklearn.metrics.pairwise_distances_chunked(X, Y=None, *, reduce_func=None, metric='euclidean', n_jobs=None, working_memory=None, **kwds)[源代码]#

逐块生成距离矩阵,并可选择简化。

In cases where not all of a pairwise distance matrix needs to be stored at once, this is used to calculate pairwise distances in working_memory-sized chunks. If reduce_func is given, it is run on each chunk and its return values are concatenated into lists, arrays or sparse matrices.

参数:
X形状(n_samples_X,n_samples_X)的{类数组,稀疏矩阵}或 (n_samples_X,n_features)

样本之间成对距离的数组,或特征数组。如果metric =' precalled ',则数组的形状应为(n_samples_X,n_samples_X),否则应为(n_samples_X,n_features)。

Y{array-like, sparse matrix} of shape (n_samples_Y, n_features), default=None

可选的第二特征阵列。仅在公制时允许!=“预先计算的”。

reduce_func可调用,默认值=无

应用于距离矩阵的每个块的函数,将其减少到所需的值。 reduce_func(D_chunk, start) 被反复调用,其中 D_chunk 是成对距离矩阵的连续垂直切片,从行开始 start .它应该返回以下之一:无;数组、列表或长度稀疏矩阵 D_chunk.shape[0] ;或此类对象的多元组。返回无对于就地操作有用,而不是减少。

如果无,则pairwise_distinctions_chunked返回距离矩阵垂直块的生成器。

metric字符串或可调用,默认='欧几里德'

计算要素数组中实例之间的距离时使用的指标。如果metric是字符串,则它必须是scipy.spatial.Distance.pdist为其指标参数允许的选项之一,或者是成对列出的.PAIRWISE_DISTANCE_FUNCTIONS中的指标。如果度量是“预先计算的”,则假设X是距离矩阵。或者,如果metric是一个可调用的函数,则会对每对实例(行)调用它并记录结果值。可调用对象应该从X中获取两个数组作为输入,并返回一个指示它们之间距离的值。

n_jobsint,默认=无

用于计算的作业数。这是通过将成对矩阵分解为n_jobs偶数切片并并行计算它们来实现的。

None 意思是1,除非在a中 joblib.parallel_backend 上下文 -1 意味着使用所有处理器。看到 Glossary 了解更多详细信息。

working_memoryfloat,默认=无

临时距离矩阵块所寻求的最大内存。如果为“无”(默认值), sklearn.get_config()['working_memory'] 采用了

**kwds可选关键字参数

任何进一步的参数都直接传递给距离函数。如果使用scipy.spatial.Distance指标,则参数仍然依赖于指标。有关使用示例,请参阅scipy文档。

收益率:
D_chunk{nd数组,稀疏矩阵}

距离矩阵的连续切片,可选由以下人员处理 reduce_func .

示例

没有reduce_full:

>>> import numpy as np
>>> from sklearn.metrics import pairwise_distances_chunked
>>> X = np.random.RandomState(0).rand(5, 3)
>>> D_chunk = next(pairwise_distances_chunked(X))
>>> D_chunk
array([[0.  ..., 0.29..., 0.41..., 0.19..., 0.57...],
       [0.29..., 0.  ..., 0.57..., 0.41..., 0.76...],
       [0.41..., 0.57..., 0.  ..., 0.44..., 0.90...],
       [0.19..., 0.41..., 0.44..., 0.  ..., 0.51...],
       [0.57..., 0.76..., 0.90..., 0.51..., 0.  ...]])

指定半径r内的所有邻居和平均距离:

>>> r = .2
>>> def reduce_func(D_chunk, start):
...     neigh = [np.flatnonzero(d < r) for d in D_chunk]
...     avg_dist = (D_chunk * (D_chunk < r)).mean(axis=1)
...     return neigh, avg_dist
>>> gen = pairwise_distances_chunked(X, reduce_func=reduce_func)
>>> neigh, avg_dist = next(gen)
>>> neigh
[array([0, 3]), array([1]), array([2]), array([0, 3]), array([4])]
>>> avg_dist
array([0.039..., 0.        , 0.        , 0.039..., 0.        ])

如果r是根据每个样本定义的,我们需要利用 start :

>>> r = [.2, .4, .4, .3, .1]
>>> def reduce_func(D_chunk, start):
...     neigh = [np.flatnonzero(d < r[i])
...              for i, d in enumerate(D_chunk, start)]
...     return neigh
>>> neigh = next(pairwise_distances_chunked(X, reduce_func=reduce_func))
>>> neigh
[array([0, 3]), array([0, 1]), array([2]), array([0, 3]), array([4])]

Force row-by-row generation by reducing working_memory:

>>> gen = pairwise_distances_chunked(X, reduce_func=reduce_func,
...                                  working_memory=0)
>>> next(gen)
[array([0, 3])]
>>> next(gen)
[array([0, 1])]