KStwo分布¶

这是经验分布函数之间的最大绝对差值的分布，计算公式为 \(n\) 样本或观测值，以及假设为连续的比较(或目标)累积分布函数。(名称中的“两个”是因为这是两个方面的区别。 ksone 是正差异的分布， \(D_n^+\) 因此，它涉及到片面的差异。 kstwobign 的极限分布是 规格化 最大绝对差 \(\sqrt{{n}} D_n\) 。)

写作 \(D_n = \sup_t \left|F_{{empirical,n}}(t)-F_{{target}}(t)\right|\) ， kstwo 是分配给 \(D_n\) 价值。

kstwo 也可以与两个经验分布函数之间的差异一起使用，对于具有 \(m\) 和 \(n\) 分别取样品。写作 \(D_{{m,n}} = \sup_t \left|F_{{1,m}}(t)-F_{{2,n}}(t)\right|\) ，在哪里 \(F_{{1,m}}\) 和 \(F_{{2,n}}\) 是两个经验分布函数，那么 \(Pr(D_{{m,n}} \le x) \approx Pr(D_N \le x)\) 在适当的条件下，其中 \(N = \sqrt{{\left(\frac{{mn}}{{m+n}}\right)}}\) 。

有一个形状参数 \(n\) ，为正整数，支持为 \(x\in\left[0,1\right]\) 。

采用Simard&L‘厄瓜多尔算法，将Durbin和Pmeranz的精确算法与Li-Chien、Pelz和Good的渐近估计相结合，计算出5-15位精度的CDF。

示例¶

>>> from scipy.stats import kstwo

对于大小为5的样本，显示间距至少为0、0.5和1.0的概率

>>> kstwo.sf([0, 0.5, 1.0], 5)
array([1.   , 0.112, 0.   ])

将从源N(0.5，1)分布中提取的大小为5的样本与目标N(0，1)CDF进行比较。

>>> from scipy.stats import norm
>>> n = 5
>>> gendist = norm(0.5, 1)       # Normal distribution, mean 0.5, stddev 1
>>> x = np.sort(gendist.rvs(size=n, random_state=np.random.default_rng()))
>>> x
array([-1.59113056, -0.66335147,  0.54791569,  0.78009321,  1.27641365])
>>> target = norm(0, 1)
>>> cdfs = target.cdf(x)
>>> cdfs
array([0.0557901 , 0.25355274, 0.7081251 , 0.78233199, 0.89909533])
# Construct the Empirical CDF and the K-S statistics (Dn+, Dn-, Dn)
>>> ecdfs = np.arange(n+1, dtype=float)/n
>>> cols = np.column_stack([x, ecdfs[1:], cdfs, cdfs - ecdfs[:n], ecdfs[1:] - cdfs])
>>> np.set_printoptions(precision=3)
>>> cols
array([[-1.591,  0.2  ,  0.056,  0.056,  0.144],
       [-0.663,  0.4  ,  0.254,  0.054,  0.146],
       [ 0.548,  0.6  ,  0.708,  0.308, -0.108],
       [ 0.78 ,  0.8  ,  0.782,  0.182,  0.018],
       [ 1.276,  1.   ,  0.899,  0.099,  0.101]])
>>> gaps = cols[:, -2:]
>>> Dnpm = np.max(gaps, axis=0)
>>> Dn = np.max(Dnpm)
>>> iminus, iplus = np.argmax(gaps, axis=0)
>>> print('Dn- = %f (at x=%.2f)' % (Dnpm[0], x[iminus]))
Dn- = 0.308125 (at x=0.55)
>>> print('Dn+ = %f (at x=%.2f)' % (Dnpm[1], x[iplus]))
Dn+ = 0.146447 (at x=-0.66)
>>> print('Dn  = %f' % (Dn))
Dn  = 0.308125

>>> probs = kstwo.sf(Dn, n)
>>> print(chr(10).join(['For a sample of size %d drawn from a N(0, 1) distribution:' % n,
...      ' Kolmogorov-Smirnov 2-sided n=%d: Prob(Dn >= %f) = %.4f' % (n, Dn, probs)]))
For a sample of size 5 drawn from a N(0, 1) distribution:
 Kolmogorov-Smirnov 2-sided n=5: Prob(Dn >= 0.308125) = 0.6319

根据目标N(0，1)CDF绘制经验CDF

>>> import matplotlib.pyplot as plt
>>> plt.step(np.concatenate([[-3], x]), ecdfs, where='post', label='Empirical CDF')
>>> x3 = np.linspace(-3, 3, 100)
>>> plt.plot(x3, target.cdf(x3), label='CDF for N(0, 1)')
>>> plt.ylim([0, 1]); plt.grid(True); plt.legend();
>>> plt.vlines([x[iminus]], ecdfs[iminus], cdfs[iminus], color='r', linestyle='solid', lw=4)
>>> plt.vlines([x[iplus]], cdfs[iplus], ecdfs[iplus+1], color='m', linestyle='solid', lw=4)
>>> plt.annotate('Dn-', xy=(x[iminus], (ecdfs[iminus]+ cdfs[iminus])/2),
...              xytext=(x[iminus]+1, (ecdfs[iminus]+ cdfs[iminus])/2 - 0.02),
...              arrowprops=dict(facecolor='white', edgecolor='r', shrink=0.05), size=15, color='r');
>>> plt.annotate('Dn+', xy=(x[iplus], (ecdfs[iplus+1]+ cdfs[iplus])/2),
...             xytext=(x[iplus]-2, (ecdfs[iplus+1]+ cdfs[iplus])/2 - 0.02),
...             arrowprops=dict(facecolor='white', edgecolor='m', shrink=0.05), size=15, color='m');
>>> plt.show()

参考文献¶

“科尔莫戈罗夫-斯米尔诺夫测试”，维基百科https://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test
德宾J。“样本分布函数位于两条平行直线之间的概率。” 安。数学课。统计学家 ，39(1968)39,398-411。
作者声明：Pmeranz J.“小样本的Kolmogorov-Smirnov统计量的精确累积分布(算法487)。” ACM的通信 ，17(12)，(1974)703-704。
Li-Chien、C。“关于A.N.Kolmogorov统计量的精确分布及其渐近展开。” 中国数学学报 ，6，(1956)55-81。
佩尔兹·W，好IJ。“逼近Kolmogorov-Smirnov单样本统计量的下尾区。” 皇家统计学会杂志 ，B系列，(1976)38(2)，152-156。
Simard，R.，L‘厄瓜多尔，P.“计算双边Kolmogorov-Smirnov分布”， 统计软件杂志 ，第39卷，(2011)11.

实施： scipy.stats.kstwo

KSONE分布

KStwobign分布