>>> from env_helper import info; info()
页面更新时间: 2023-07-09 19:06:41
运行环境:
    Linux发行版本: Debian GNU/Linux 12 (bookworm)
    操作系统内核: Linux-6.1.0-10-amd64-x86_64-with-glibc2.36
    Python版本: 3.11.2

7.7. Pandas聚合

当有了滚动,扩展和ewm对象创建了以后,就有几种方法可以对数据执行聚合。

7.7.1. DataFrame应用聚合

让我们创建一个DataFrame并在其上应用聚合。

>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(10, 4),
>>>       index = pd.date_range('1/1/2019', periods=10),
>>>       columns = ['A', 'B', 'C', 'D'])
>>>
>>> print (df)
>>> print("=======================================")
>>> r = df.rolling(window=3,min_periods=1)
>>> print (r)
                   A         B         C         D
2019-01-01  0.409221  0.366230  0.811195 -0.625772
2019-01-02 -0.797305 -0.167884 -0.747666  0.319435
2019-01-03 -0.224666 -0.755982 -0.018736  0.017640
2019-01-04 -0.541295 -0.878131  0.437026  0.268451
2019-01-05 -0.307103 -1.058285  2.348750 -0.399490
2019-01-06  0.448267 -0.487242 -1.544957  1.734672
2019-01-07 -1.226280 -1.205490 -0.555893 -0.950993
2019-01-08  0.018771 -0.388378  1.314139 -0.727108
2019-01-09 -0.722677  0.254524 -0.502587 -0.254569
2019-01-10  1.530029  0.839409  0.587075 -0.306621
=======================================
Rolling [window=3,min_periods=1,center=False,axis=0,method=single]

可以通过向整个DataFrame传递一个函数来进行聚合,或者通过标准的获取项目方法来选择一个列。 在整个数据框上应用聚合

>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(10, 4),
>>>       index = pd.date_range('1/1/2000', periods=10),
>>>       columns = ['A', 'B', 'C', 'D'])
>>> print(df)
>>>
>>> r = df.rolling(window=3,min_periods=1)
>>> print(r.aggregate(np.sum))
                   A         B         C         D
2000-01-01 -1.191295 -2.159337  1.657109 -0.782601
2000-01-02 -0.419242  0.399465  0.381738 -0.212911
2000-01-03  1.052838  0.561114 -1.704994  0.503670
2000-01-04  0.493795  1.457303 -1.945349  0.488218
2000-01-05  0.056389 -0.168218  0.021157  1.903208
2000-01-06 -0.922407 -0.041264 -0.390733  0.184032
2000-01-07  2.167636 -1.502741 -0.854621  0.237260
2000-01-08 -1.574361 -0.982496  1.451461  1.313502
2000-01-09 -0.055895  2.102627  1.120571  0.622188
2000-01-10  0.240397  0.637192  0.429423  0.654151
                   A         B         C         D
2000-01-01 -1.191295 -2.159337  1.657109 -0.782601
2000-01-02 -1.610538 -1.759872  2.038847 -0.995512
2000-01-03 -0.557699 -1.198758  0.333853 -0.491843
2000-01-04  1.127391  2.417882 -3.268604  0.778976
2000-01-05  1.603022  1.850199 -3.629186  2.895096
2000-01-06 -0.372223  1.247821 -2.314925  2.575458
2000-01-07  1.301618 -1.712223 -1.224197  2.324500
2000-01-08 -0.329132 -2.526501  0.206107  1.734793
2000-01-09  0.537380 -0.382610  1.717411  2.172949
2000-01-10 -1.389859  1.757323  3.001455  2.589841

7.7.2. 在数据框的单个列上应用聚合

示例代码

>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(10, 4),
>>>       index = pd.date_range('1/1/2000', periods=10),
>>>       columns = ['A', 'B', 'C', 'D'])
>>> print (df)
>>> print("====================================")
>>> r = df.rolling(window=3,min_periods=1)
>>> print (r['A'].aggregate(np.sum))
                   A         B         C         D
2000-01-01  0.899614  1.904443  0.620118 -1.196649
2000-01-02 -0.816152  1.095829  0.526991  0.077817
2000-01-03 -0.728594  0.907431  1.122774  1.202263
2000-01-04 -0.339719  1.229053  0.833520 -1.121457
2000-01-05 -0.066104 -0.177040 -1.296439  0.868574
2000-01-06 -1.024873 -0.415325 -2.467596  0.027099
2000-01-07  0.388763 -0.720107  0.586934  0.599713
2000-01-08  1.114513 -0.775295  1.241403 -1.623362
2000-01-09 -0.864455  1.063592  1.433591  0.119835
2000-01-10 -0.304665  2.047228  0.151332 -1.078892
====================================
2000-01-01    0.899614
2000-01-02    0.083462
2000-01-03   -0.645132
2000-01-04   -1.884465
2000-01-05   -1.134418
2000-01-06   -1.430696
2000-01-07   -0.702214
2000-01-08    0.478404
2000-01-09    0.638822
2000-01-10   -0.054606
Freq: D, Name: A, dtype: float64

在DataFrame的多列上应用聚合

>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(10, 4),
>>>       index = pd.date_range('1/1/2018', periods=10),
>>>       columns = ['A', 'B', 'C', 'D'])
>>> print (df)
>>> print ("==========================================")
>>> r = df.rolling(window=3,min_periods=1)
>>> print (r[['A','B']].aggregate(np.sum))
                   A         B         C         D
2018-01-01  2.768425 -0.497260 -0.430518 -0.390839
2018-01-02 -1.017491 -0.478303 -2.269947  2.662471
2018-01-03  2.135067 -0.363699  1.756624  0.529155
2018-01-04  1.865168 -1.485002  1.045523  0.323405
2018-01-05  1.021525  1.022261 -0.209354 -1.515144
2018-01-06  0.655466  0.436306  1.572498 -0.631222
2018-01-07  1.756663 -0.963968 -0.040532  0.142463
2018-01-08 -0.526707  1.167008  0.881830 -0.394477
2018-01-09 -2.439931  0.353109  0.653103 -1.738840
2018-01-10  0.554504  1.012518  1.751990  0.428687
==========================================
                   A         B
2018-01-01  2.768425 -0.497260
2018-01-02  1.750934 -0.975563
2018-01-03  3.886001 -1.339262
2018-01-04  2.982743 -2.327003
2018-01-05  5.021760 -0.826440
2018-01-06  3.542159 -0.026435
2018-01-07  3.433654  0.494599
2018-01-08  1.885422  0.639346
2018-01-09 -1.209975  0.556149
2018-01-10 -2.412134  2.532636

在DataFrame的单个列上应用多个函数

>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(10, 4),
>>>       index = pd.date_range('2019/01/01', periods=10),
>>>       columns = ['A', 'B', 'C', 'D'])
>>> print (df)
>>>
>>> print("==========================================")
>>>
>>> r = df.rolling(window=3,min_periods=1)
>>> print (r['A'].aggregate([np.sum,np.mean]))
                   A         B         C         D
2019-01-01  1.289568 -0.150813  0.819723  0.333205
2019-01-02 -0.777984 -1.418614  0.166166 -0.173467
2019-01-03 -1.478802 -0.522707  1.214267 -0.376789
2019-01-04  0.213583 -0.434050 -0.669066  0.875161
2019-01-05 -0.505610  0.403382  0.449065  1.161292
2019-01-06  1.255886  0.138499 -0.669481  0.009081
2019-01-07  1.504702  0.710545  0.102133  0.123100
2019-01-08  0.616911 -0.559758  0.399398  0.690950
2019-01-09  0.781351  0.925963 -0.634495  0.734642
2019-01-10  0.066432  0.540052  1.074223  0.479327
==========================================
                 sum      mean
2019-01-01  1.289568  1.289568
2019-01-02  0.511584  0.255792
2019-01-03 -0.967218 -0.322406
2019-01-04 -2.043203 -0.681068
2019-01-05 -1.770829 -0.590276
2019-01-06  0.963859  0.321286
2019-01-07  2.254979  0.751660
2019-01-08  3.377500  1.125833
2019-01-09  2.902964  0.967655
2019-01-10  1.464693  0.488231

7.7.3. 在DataFrame的多列上应用多个函数

>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(10, 4),
>>>       index = pd.date_range('2020/01/01', periods=10),
>>>       columns = ['A', 'B', 'C', 'D'])
>>>
>>> print (df)
>>> print("==========================================")
>>> r = df.rolling(window=3,min_periods=1)
>>> print (r[['A','B']].aggregate([np.sum,np.mean]))
                   A         B         C         D
2020-01-01 -0.738247 -1.631835  1.505123  1.898867
2020-01-02 -1.382268  1.860687 -0.500002  0.652996
2020-01-03 -0.001419  1.867492  0.789780 -1.060373
2020-01-04 -0.926137 -1.108752 -1.351586 -1.110365
2020-01-05 -0.834301 -1.775205  0.517520  0.620481
2020-01-06 -0.080035 -0.110765  0.078427 -1.494267
2020-01-07 -1.323718 -1.042942  0.913473  0.627442
2020-01-08 -0.830897 -0.415440 -0.119706 -1.384088
2020-01-09  0.701403  0.205893  0.331462 -0.102985
2020-01-10  0.802444  0.592147  0.417783  0.742442
==========================================
                   A                   B
                 sum      mean       sum      mean
2020-01-01 -0.738247 -0.738247 -1.631835 -1.631835
2020-01-02 -2.120515 -1.060257  0.228851  0.114426
2020-01-03 -2.121934 -0.707311  2.096343  0.698781
2020-01-04 -2.309824 -0.769941  2.619426  0.873142
2020-01-05 -1.761857 -0.587286 -1.016466 -0.338822
2020-01-06 -1.840473 -0.613491 -2.994722 -0.998241
2020-01-07 -2.238054 -0.746018 -2.928912 -0.976304
2020-01-08 -2.234650 -0.744883 -1.569147 -0.523049
2020-01-09 -1.453212 -0.484404 -1.252490 -0.417497
2020-01-10  0.672950  0.224317  0.382600  0.127533

将不同的函数应用于DataFrame的不同列

>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(3, 4),
>>>       index = pd.date_range('2020/01/01', periods=3),
>>>       columns = ['A', 'B', 'C', 'D'])
>>> print (df)
>>> print("==========================================")
>>> r = df.rolling(window=3,min_periods=1)
>>> print (r.aggregate({'A' : np.sum,'B' : np.mean}))
                   A         B         C         D
2020-01-01 -0.760202 -0.315296 -1.161479  0.742776
2020-01-02  0.536496 -2.534675  0.472423 -1.866547
2020-01-03  1.186027 -0.767979 -1.401393 -0.319669
==========================================
                   A         B
2020-01-01 -0.760202 -0.315296
2020-01-02 -0.223706 -1.424986
2020-01-03  0.962321 -1.205983