>>> from env_helper import info; info()
页面更新时间: 2024-01-17 14:51:34
运行环境:
    Linux发行版本: Debian GNU/Linux 12 (bookworm)
    操作系统内核: Linux-6.1.0-17-amd64-x86_64-with-glibc2.36
    Python版本: 3.11.2

7.9. Pandas分组(GroupBy)

任何分组(groupby)操作都涉及原始对象的以下操作之一。它们是 -

分割对象
应用一个函数
结合的结果

在许多情况下,我们将数据分成多个集合,并在每个子集上应用一些函数。在应用函数中,可以执行以下操作 -

聚合 - 计算汇总统计
转换 - 执行一些特定于组的操作
过滤 - 在某些情况下丢弃数据

下面来看看创建一个DataFrame对象并对其执行所有操作 -

>>> import pandas as pd
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> print (df)
      Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
2   Devils     2  2014     863
3   Devils     3  2015     673
4    Kings     3  2014     741
5    kings     4  2015     812
6    Kings     1  2016     756
7    Kings     1  2017     788
8   Riders     2  2016     694
9   Royals     4  2014     701
10  Royals     1  2015     804
11  Riders     2  2017     690

7.9.1. 将数据拆分成组

Pandas对象可以分成任何对象。有多种方式来拆分对象,如 -

obj.groupby(‘key’)
obj.groupby([‘key1’,’key2’])
obj.groupby(key,axis=1)

现在来看看如何将分组对象应用于DataFrame对象

示例

>>> import pandas as pd
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> print (df.groupby('Team'))
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe30915ab10>

查看分组

>>> import pandas as pd
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],           'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> print (df.groupby('Team').groups)
{'Devils': [2, 3], 'Kings': [4, 6, 7], 'Riders': [0, 1, 8, 11], 'Royals': [9, 10], 'kings': [5]}

示例

按多列分组 -

>>> import pandas as pd
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>> print (df.groupby(['Team','Year']).groups)
{('Devils', 2014): [2], ('Devils', 2015): [3], ('Kings', 2014): [4], ('Kings', 2016): [6], ('Kings', 2017): [7], ('Riders', 2014): [0], ('Riders', 2015): [1], ('Riders', 2016): [8], ('Riders', 2017): [11], ('Royals', 2014): [9], ('Royals', 2015): [10], ('kings', 2015): [5]}

7.9.2. 迭代遍历分组

使用groupby对象,可以遍历类似itertools.obj的对象。

>>> import pandas as pd
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> grouped = df.groupby('Year')
>>>
>>> for name,group in grouped:
>>>     print (name)
>>>     print (group)
2014
     Team  Rank  Year  Points
0  Riders     1  2014     876
2  Devils     2  2014     863
4   Kings     3  2014     741
9  Royals     4  2014     701
2015
      Team  Rank  Year  Points
1   Riders     2  2015     789
3   Devils     3  2015     673
5    kings     4  2015     812
10  Royals     1  2015     804
2016
     Team  Rank  Year  Points
6   Kings     1  2016     756
8  Riders     2  2016     694
2017
      Team  Rank  Year  Points
7    Kings     1  2017     788
11  Riders     2  2017     690

默认情况下,groupby对象具有与分组名相同的标签名称。

7.9.3. 选择一个分组

使用get_group()方法,可以选择一个组。参考以下示例代码 -

>>> import pandas as pd
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> grouped = df.groupby('Year')
>>> print (grouped.get_group(2014))
     Team  Rank  Year  Points
0  Riders     1  2014     876
2  Devils     2  2014     863
4   Kings     3  2014     741
9  Royals     4  2014     701

聚合

聚合函数为每个组返回单个聚合值。当创建了分组(group by)对象,就可以对分组数据执行多个聚合操作。

一个比较常用的是通过聚合或等效的agg方法聚合 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> grouped = df.groupby('Year')
>>> print (grouped['Points'].agg(np.mean))
Year
2014    795.25
2015    769.50
2016    725.00
2017    739.00
Name: Points, dtype: float64

另一种查看每个分组的大小的方法是应用size()函数 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>> grouped = df.groupby('Team')
>>> print (grouped.agg(np.size))
        Rank  Year  Points
Team
Devils     2     2       2
Kings      3     3       3
Riders     4     4       4
Royals     2     2       2
kings      1     1       1

7.9.4. 一次应用多个聚合函数

通过分组系列,还可以传递函数的列表或字典来进行聚合,并生成DataFrame作为输出 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> grouped = df.groupby('Team')
>>> agg = grouped['Points'].agg([np.sum, np.mean, np.std])
>>> print (agg)
         sum        mean         std
Team
Devils  1536  768.000000  134.350288
Kings   2285  761.666667   24.006943
Riders  3049  762.250000   88.567771
Royals  1505  752.500000   72.831998
kings    812  812.000000         NaN

7.9.5. 转换

分组或列上的转换返回索引大小与被分组的索引相同的对象。因此,转换应该返回与组块大小相同的结果。

>>> import pandas as pd
>>> import numpy as np
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> grouped = df.groupby('Team')
>>> score = lambda x: (x - x.mean()) / x.std()*10
>>> print (grouped.transform(score))
         Rank       Year     Points
0  -15.000000 -11.618950  12.843272
1    5.000000  -3.872983   3.020286
2   -7.071068  -7.071068   7.071068
3    7.071068   7.071068  -7.071068
4   11.547005 -10.910895  -8.608621
5         NaN        NaN        NaN
6   -5.773503   2.182179  -2.360428
7   -5.773503   8.728716  10.969049
8    5.000000   3.872983  -7.705963
9    7.071068  -7.071068  -7.071068
10  -7.071068   7.071068   7.071068
11   5.000000  11.618950  -8.157595

7.9.6. 过滤

过滤根据定义的标准过滤数据并返回数据的子集。filter()函数用于过滤数据。

>>> import pandas as pd
>>> import numpy as np
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>> filter = df.groupby('Team').filter(lambda x: len(x) >= 3)
>>>
>>> print (filter)
      Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
4    Kings     3  2014     741
6    Kings     1  2016     756
7    Kings     1  2017     788
8   Riders     2  2016     694
11  Riders     2  2017     690

在上述过滤条件下,要求返回三次以上参加IPL的队伍。