>>> from env_helper import info; info()
页面更新时间: 2024-01-17 14:51:34
运行环境:
Linux发行版本: Debian GNU/Linux 12 (bookworm)
操作系统内核: Linux-6.1.0-17-amd64-x86_64-with-glibc2.36
Python版本: 3.11.2
7.9. Pandas分组(GroupBy)¶
任何分组(groupby)操作都涉及原始对象的以下操作之一。它们是 -
分割对象
应用一个函数
结合的结果
在许多情况下,我们将数据分成多个集合,并在每个子集上应用一些函数。在应用函数中,可以执行以下操作 -
聚合 - 计算汇总统计
转换 - 执行一些特定于组的操作
过滤 - 在某些情况下丢弃数据
下面来看看创建一个DataFrame对象并对其执行所有操作 -
>>> import pandas as pd
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>> 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>> 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>> 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>> 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> print (df)
Team Rank Year Points
0 Riders 1 2014 876
1 Riders 2 2015 789
2 Devils 2 2014 863
3 Devils 3 2015 673
4 Kings 3 2014 741
5 kings 4 2015 812
6 Kings 1 2016 756
7 Kings 1 2017 788
8 Riders 2 2016 694
9 Royals 4 2014 701
10 Royals 1 2015 804
11 Riders 2 2017 690
7.9.1. 将数据拆分成组¶
Pandas对象可以分成任何对象。有多种方式来拆分对象,如 -
obj.groupby(‘key’)
obj.groupby([‘key1’,’key2’])
obj.groupby(key,axis=1)
现在来看看如何将分组对象应用于DataFrame对象
示例¶
>>> import pandas as pd
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>> 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>> 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>> 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>> 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> print (df.groupby('Team'))
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe30915ab10>
查看分组
>>> import pandas as pd
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>> 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>> 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>> 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> print (df.groupby('Team').groups)
{'Devils': [2, 3], 'Kings': [4, 6, 7], 'Riders': [0, 1, 8, 11], 'Royals': [9, 10], 'kings': [5]}
示例¶
按多列分组 -
>>> import pandas as pd
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>> 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>> 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>> 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>> 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>> print (df.groupby(['Team','Year']).groups)
{('Devils', 2014): [2], ('Devils', 2015): [3], ('Kings', 2014): [4], ('Kings', 2016): [6], ('Kings', 2017): [7], ('Riders', 2014): [0], ('Riders', 2015): [1], ('Riders', 2016): [8], ('Riders', 2017): [11], ('Royals', 2014): [9], ('Royals', 2015): [10], ('kings', 2015): [5]}
7.9.2. 迭代遍历分组¶
使用groupby对象,可以遍历类似itertools.obj的对象。
>>> import pandas as pd
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>> 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>> 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>> 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>> 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> grouped = df.groupby('Year')
>>>
>>> for name,group in grouped:
>>> print (name)
>>> print (group)
2014
Team Rank Year Points
0 Riders 1 2014 876
2 Devils 2 2014 863
4 Kings 3 2014 741
9 Royals 4 2014 701
2015
Team Rank Year Points
1 Riders 2 2015 789
3 Devils 3 2015 673
5 kings 4 2015 812
10 Royals 1 2015 804
2016
Team Rank Year Points
6 Kings 1 2016 756
8 Riders 2 2016 694
2017
Team Rank Year Points
7 Kings 1 2017 788
11 Riders 2 2017 690
默认情况下,groupby对象具有与分组名相同的标签名称。
7.9.3. 选择一个分组¶
使用get_group()方法,可以选择一个组。参考以下示例代码 -
>>> import pandas as pd
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>> 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>> 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>> 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>> 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> grouped = df.groupby('Year')
>>> print (grouped.get_group(2014))
Team Rank Year Points
0 Riders 1 2014 876
2 Devils 2 2014 863
4 Kings 3 2014 741
9 Royals 4 2014 701
聚合
聚合函数为每个组返回单个聚合值。当创建了分组(group by)对象,就可以对分组数据执行多个聚合操作。
一个比较常用的是通过聚合或等效的agg方法聚合 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>> 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>> 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>> 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>> 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> grouped = df.groupby('Year')
>>> print (grouped['Points'].agg(np.mean))
Year
2014 795.25
2015 769.50
2016 725.00
2017 739.00
Name: Points, dtype: float64
另一种查看每个分组的大小的方法是应用size()函数 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>> 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>> 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>> 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>> 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>> grouped = df.groupby('Team')
>>> print (grouped.agg(np.size))
Rank Year Points
Team
Devils 2 2 2
Kings 3 3 3
Riders 4 4 4
Royals 2 2 2
kings 1 1 1
7.9.4. 一次应用多个聚合函数¶
通过分组系列,还可以传递函数的列表或字典来进行聚合,并生成DataFrame作为输出 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>> 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>> 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>> 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>> 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> grouped = df.groupby('Team')
>>> agg = grouped['Points'].agg([np.sum, np.mean, np.std])
>>> print (agg)
sum mean std
Team
Devils 1536 768.000000 134.350288
Kings 2285 761.666667 24.006943
Riders 3049 762.250000 88.567771
Royals 1505 752.500000 72.831998
kings 812 812.000000 NaN
7.9.5. 转换¶
分组或列上的转换返回索引大小与被分组的索引相同的对象。因此,转换应该返回与组块大小相同的结果。
>>> import pandas as pd
>>> import numpy as np
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>> 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>> 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>> 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>> 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> grouped = df.groupby('Team')
>>> score = lambda x: (x - x.mean()) / x.std()*10
>>> print (grouped.transform(score))
Rank Year Points
0 -15.000000 -11.618950 12.843272
1 5.000000 -3.872983 3.020286
2 -7.071068 -7.071068 7.071068
3 7.071068 7.071068 -7.071068
4 11.547005 -10.910895 -8.608621
5 NaN NaN NaN
6 -5.773503 2.182179 -2.360428
7 -5.773503 8.728716 10.969049
8 5.000000 3.872983 -7.705963
9 7.071068 -7.071068 -7.071068
10 -7.071068 7.071068 7.071068
11 5.000000 11.618950 -8.157595
7.9.6. 过滤¶
过滤根据定义的标准过滤数据并返回数据的子集。filter()函数用于过滤数据。
>>> import pandas as pd
>>> import numpy as np
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>> 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>> 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>> 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>> 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>> filter = df.groupby('Team').filter(lambda x: len(x) >= 3)
>>>
>>> print (filter)
Team Rank Year Points
0 Riders 1 2014 876
1 Riders 2 2015 789
4 Kings 3 2014 741
6 Kings 1 2016 756
7 Kings 1 2017 788
8 Riders 2 2016 694
11 Riders 2 2017 690
在上述过滤条件下,要求返回三次以上参加IPL的队伍。