目录

上一个主题

6.6. Pandas窗口函数

下一个主题

6.8. Pandas缺失数据


>>> from env_helper import info; info()
待更新

6.7. Pandas聚合

当有了滚动,扩展和ewm对象创建了以后,就有几种方法可以对数据执行聚合。

DataFrame应用聚合

让我们创建一个DataFrame并在其上应用聚合。

>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(10, 4),
>>>       index = pd.date_range('1/1/2019', periods=10),
>>>       columns = ['A', 'B', 'C', 'D'])
>>>
>>> print (df)
>>> print("=======================================")
>>> r = df.rolling(window=3,min_periods=1)
>>> print (r)
                   A         B         C         D
2019-01-01 -0.297161  0.083323 -0.974163 -2.315540
2019-01-02 -1.109668 -0.187732  1.702003 -0.634991
2019-01-03  0.028106  4.080762 -0.353134  0.529624
2019-01-04 -0.766226  0.140721 -1.683656 -0.177402
2019-01-05  0.865192  0.400481 -1.123899  1.497569
2019-01-06  1.202430  0.149648 -0.443906 -0.849716
2019-01-07 -0.222641  1.438695  0.559307  0.180533
2019-01-08 -0.007848  1.071640 -1.223527 -0.313666
2019-01-09 -0.202040 -0.167778 -0.065897  1.358541
2019-01-10 -0.328471  0.260195 -0.757174  0.544447
=======================================
Rolling [window=3,min_periods=1,center=False,axis=0]

可以通过向整个DataFrame传递一个函数来进行聚合,或者通过标准的获取项目方法来选择一个列。 在整个数据框上应用聚合

>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(10, 4),
>>>       index = pd.date_range('1/1/2000', periods=10),
>>>       columns = ['A', 'B', 'C', 'D'])
>>> print(df)
>>>
>>> r = df.rolling(window=3,min_periods=1)
>>> print(r.aggregate(np.sum))
                   A         B         C         D
2000-01-01 -1.057510 -1.319452  1.438169  0.354097
2000-01-02 -1.665754  0.732910  0.895425 -0.170069
2000-01-03 -0.425298 -1.728492  0.614785 -1.327529
2000-01-04  1.641398  1.054154 -0.307831  1.634812
2000-01-05  0.692571 -1.762662  0.959677 -0.404613
2000-01-06  0.601088  0.391329 -0.338605  0.366265
2000-01-07 -2.333054 -1.024928 -2.117509 -2.236876
2000-01-08  0.441563  0.084132 -0.210743 -0.425921
2000-01-09  0.291801 -0.401348 -0.854690 -0.492878
2000-01-10 -1.457577  0.634589  0.853480  0.617184
                   A         B         C         D
2000-01-01 -1.057510 -1.319452  1.438169  0.354097
2000-01-02 -2.723264 -0.586543  2.333593  0.184028
2000-01-03 -3.148562 -2.315034  2.948378 -1.143501
2000-01-04 -0.449654  0.058572  1.202379  0.137214
2000-01-05  1.908671 -2.436999  1.266631 -0.097330
2000-01-06  2.935057 -0.317179  0.313241  1.596464
2000-01-07 -1.039395 -2.396261 -1.496437 -2.275224
2000-01-08 -1.290403 -0.549467 -2.666857 -2.296533
2000-01-09 -1.599690 -1.342143 -3.182942 -3.155675
2000-01-10 -0.724213  0.317374 -0.211953 -0.301615

在数据框的单个列上应用聚合

示例代码

>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(10, 4),
>>>       index = pd.date_range('1/1/2000', periods=10),
>>>       columns = ['A', 'B', 'C', 'D'])
>>> print (df)
>>> print("====================================")
>>> r = df.rolling(window=3,min_periods=1)
>>> print (r['A'].aggregate(np.sum))
                   A         B         C         D
2000-01-01 -0.143118 -1.240426  0.436630 -0.444383
2000-01-02  0.628114 -1.783910 -0.175060 -0.343566
2000-01-03 -1.109600  1.086021  1.626791  0.800103
2000-01-04  0.793569 -0.472492  1.034773 -0.601865
2000-01-05  0.191128 -2.123454 -0.177607 -0.947208
2000-01-06 -0.218268  1.726078  0.074371 -0.828348
2000-01-07 -1.723376  1.690724 -0.086129  1.057527
2000-01-08 -0.240308  0.695771 -0.076730  1.058622
2000-01-09  0.519189 -0.543348 -0.912696 -1.193401
2000-01-10  0.381399  1.135461  1.238793  1.627239
====================================
2000-01-01   -0.143118
2000-01-02    0.484995
2000-01-03   -0.624605
2000-01-04    0.312082
2000-01-05   -0.124903
2000-01-06    0.766429
2000-01-07   -1.750516
2000-01-08   -2.181952
2000-01-09   -1.444495
2000-01-10    0.660280
Freq: D, Name: A, dtype: float64

在DataFrame的多列上应用聚合

>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(10, 4),
>>>       index = pd.date_range('1/1/2018', periods=10),
>>>       columns = ['A', 'B', 'C', 'D'])
>>> print (df)
>>> print ("==========================================")
>>> r = df.rolling(window=3,min_periods=1)
>>> print (r[['A','B']].aggregate(np.sum))
                   A         B         C         D
2018-01-01  1.161324  0.882373  0.525064 -0.004068
2018-01-02  0.565859  0.457815  0.879255 -0.750593
2018-01-03 -1.139034 -0.927946  0.186392  0.015014
2018-01-04 -1.227686  0.395136 -0.064800 -0.157918
2018-01-05  1.086757 -0.214619  0.134422  0.504477
2018-01-06 -0.845733  0.557141  0.312587 -1.226506
2018-01-07  1.228012 -0.210966 -1.160787 -0.724112
2018-01-08  0.213551  1.456926  1.207446 -1.483803
2018-01-09 -0.398389  0.798960 -0.194087  0.175509
2018-01-10  0.168422 -0.723523 -1.462455 -0.605016
==========================================
                   A         B
2018-01-01  1.161324  0.882373
2018-01-02  1.727184  1.340188
2018-01-03  0.588150  0.412242
2018-01-04 -1.800861 -0.074996
2018-01-05 -1.279963 -0.747429
2018-01-06 -0.986662  0.737658
2018-01-07  1.469036  0.131556
2018-01-08  0.595829  1.803101
2018-01-09  1.043173  2.044921
2018-01-10 -0.016416  1.532363

在DataFrame的单个列上应用多个函数

>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(10, 4),
>>>       index = pd.date_range('2019/01/01', periods=10),
>>>       columns = ['A', 'B', 'C', 'D'])
>>> print (df)
>>>
>>> print("==========================================")
>>>
>>> r = df.rolling(window=3,min_periods=1)
>>> print (r['A'].aggregate([np.sum,np.mean]))
                   A         B         C         D
2019-01-01 -1.417711  0.509442 -1.495424  1.027103
2019-01-02  0.962894  0.455878  0.785055  0.163292
2019-01-03  1.275789  1.077626  0.092726 -0.273991
2019-01-04 -0.574358  1.312393  0.138201  0.017391
2019-01-05  0.694693  0.406216 -0.589011 -1.944672
2019-01-06  0.939143 -0.987033 -2.131934  0.867440
2019-01-07 -1.286568 -1.014230  0.419372 -0.247521
2019-01-08 -0.252256 -0.431310 -1.350533  0.463370
2019-01-09  0.051010  0.063909  0.029560 -0.139782
2019-01-10  1.941964 -1.049217  1.630443 -1.600684
==========================================
                 sum      mean
2019-01-01 -1.417711 -1.417711
2019-01-02 -0.454818 -0.227409
2019-01-03  0.820971  0.273657
2019-01-04  1.664324  0.554775
2019-01-05  1.396123  0.465374
2019-01-06  1.059477  0.353159
2019-01-07  0.347268  0.115756
2019-01-08 -0.599681 -0.199894
2019-01-09 -1.487814 -0.495938
2019-01-10  1.740718  0.580239

在DataFrame的多列上应用多个函数

>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(10, 4),
>>>       index = pd.date_range('2020/01/01', periods=10),
>>>       columns = ['A', 'B', 'C', 'D'])
>>>
>>> print (df)
>>> print("==========================================")
>>> r = df.rolling(window=3,min_periods=1)
>>> print (r[['A','B']].aggregate([np.sum,np.mean]))
                   A         B         C         D
2020-01-01 -0.690567  0.793077  1.584981 -0.359210
2020-01-02  1.213710 -1.628300 -1.503879  1.389749
2020-01-03 -1.522211  0.376837  2.072661 -2.662566
2020-01-04 -0.974572 -0.190313  0.746054 -0.062266
2020-01-05  0.893743  0.771649  0.100636  0.857923
2020-01-06 -1.924732 -0.263355 -0.358141 -0.361869
2020-01-07 -1.387412  0.106727  1.869404 -1.408780
2020-01-08 -1.762411  0.945474 -1.610881 -0.181529
2020-01-09 -0.640737  1.468332 -0.735777  1.154540
2020-01-10  0.353086 -0.107727 -0.514243  0.787290
==========================================
                   A                   B
                 sum      mean       sum      mean
2020-01-01 -0.690567 -0.690567  0.793077  0.793077
2020-01-02  0.523144  0.261572 -0.835223 -0.417611
2020-01-03 -0.999068 -0.333023 -0.458386 -0.152795
2020-01-04 -1.283074 -0.427691 -1.441775 -0.480592
2020-01-05 -1.603041 -0.534347  0.958174  0.319391
2020-01-06 -2.005562 -0.668521  0.317981  0.105994
2020-01-07 -2.418401 -0.806134  0.615021  0.205007
2020-01-08 -5.074555 -1.691518  0.788846  0.262949
2020-01-09 -3.790560 -1.263520  2.520533  0.840178
2020-01-10 -2.050062 -0.683354  2.306079  0.768693

将不同的函数应用于DataFrame的不同列

>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(3, 4),
>>>       index = pd.date_range('2020/01/01', periods=3),
>>>       columns = ['A', 'B', 'C', 'D'])
>>> print (df)
>>> print("==========================================")
>>> r = df.rolling(window=3,min_periods=1)
>>> print (r.aggregate({'A' : np.sum,'B' : np.mean}))
                   A         B         C         D
2020-01-01 -1.015361 -2.742839 -0.223676 -1.516577
2020-01-02 -0.769286  1.455454 -1.774750  0.454370
2020-01-03 -0.139466  0.166692  0.272592 -0.273266
==========================================
                   A         B
2020-01-01 -1.015361 -2.742839
2020-01-02 -1.784648 -0.643692
2020-01-03 -1.924113 -0.373564