目录

上一个主题

6.8. Pandas缺失数据

下一个主题

6.10. Pandas合并/连接


>>> from env_helper import info; info()
待更新

6.9. Pandas分组(GroupBy)

任何分组(groupby)操作都涉及原始对象的以下操作之一。它们是 -

分割对象
应用一个函数
结合的结果

在许多情况下,我们将数据分成多个集合,并在每个子集上应用一些函数。在应用函数中,可以执行以下操作 -

聚合 - 计算汇总统计
转换 - 执行一些特定于组的操作
过滤 - 在某些情况下丢弃数据

下面来看看创建一个DataFrame对象并对其执行所有操作 -

>>> import pandas as pd
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> print (df)
    Points  Rank    Team  Year
0      876     1  Riders  2014
1      789     2  Riders  2015
2      863     2  Devils  2014
3      673     3  Devils  2015
4      741     3   Kings  2014
5      812     4   kings  2015
6      756     1   Kings  2016
7      788     1   Kings  2017
8      694     2  Riders  2016
9      701     4  Royals  2014
10     804     1  Royals  2015
11     690     2  Riders  2017

将数据拆分成组

Pandas对象可以分成任何对象。有多种方式来拆分对象,如 -

obj.groupby(‘key’)
obj.groupby([‘key1’,’key2’])
obj.groupby(key,axis=1)

现在来看看如何将分组对象应用于DataFrame对象

示例

>>> import pandas as pd
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> print (df.groupby('Team'))
<pandas.core.groupby.DataFrameGroupBy object at 0x7f0a68536ac8>

查看分组

>>> import pandas as pd
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],           'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> print (df.groupby('Team').groups)
{'kings': Int64Index([5], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Devils': Int64Index([2, 3], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64')}

示例

按多列分组 -

>>> import pandas as pd
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>> print (df.groupby(['Team','Year']).groups)
{('Devils', 2014): Int64Index([2], dtype='int64'), ('Kings', 2016): Int64Index([6], dtype='int64'), ('Devils', 2015): Int64Index([3], dtype='int64'), ('Riders', 2014): Int64Index([0], dtype='int64'), ('Kings', 2014): Int64Index([4], dtype='int64'), ('Riders', 2015): Int64Index([1], dtype='int64'), ('Royals', 2015): Int64Index([10], dtype='int64'), ('Riders', 2017): Int64Index([11], dtype='int64'), ('Royals', 2014): Int64Index([9], dtype='int64'), ('Riders', 2016): Int64Index([8], dtype='int64'), ('Kings', 2017): Int64Index([7], dtype='int64'), ('kings', 2015): Int64Index([5], dtype='int64')}

迭代遍历分组

使用groupby对象,可以遍历类似itertools.obj的对象。

>>> import pandas as pd
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> grouped = df.groupby('Year')
>>>
>>> for name,group in grouped:
>>>     print (name)
>>>     print (group)
2014
   Points  Rank    Team  Year
0     876     1  Riders  2014
2     863     2  Devils  2014
4     741     3   Kings  2014
9     701     4  Royals  2014
2015
    Points  Rank    Team  Year
1      789     2  Riders  2015
3      673     3  Devils  2015
5      812     4   kings  2015
10     804     1  Royals  2015
2016
   Points  Rank    Team  Year
6     756     1   Kings  2016
8     694     2  Riders  2016
2017
    Points  Rank    Team  Year
7      788     1   Kings  2017
11     690     2  Riders  2017

默认情况下,groupby对象具有与分组名相同的标签名称。

选择一个分组

使用get_group()方法,可以选择一个组。参考以下示例代码 -

>>> import pandas as pd
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> grouped = df.groupby('Year')
>>> print (grouped.get_group(2014))
   Points  Rank    Team  Year
0     876     1  Riders  2014
2     863     2  Devils  2014
4     741     3   Kings  2014
9     701     4  Royals  2014

聚合

聚合函数为每个组返回单个聚合值。当创建了分组(group by)对象,就可以对分组数据执行多个聚合操作。

一个比较常用的是通过聚合或等效的agg方法聚合 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> grouped = df.groupby('Year')
>>> print (grouped['Points'].agg(np.mean))
Year
2014    795.25
2015    769.50
2016    725.00
2017    739.00
Name: Points, dtype: float64

另一种查看每个分组的大小的方法是应用size()函数 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>> grouped = df.groupby('Team')
>>> print (grouped.agg(np.size))
        Points  Rank  Year
Team
Devils       2     2     2
Kings        3     3     3
Riders       4     4     4
Royals       2     2     2
kings        1     1     1

一次应用多个聚合函数

通过分组系列,还可以传递函数的列表或字典来进行聚合,并生成DataFrame作为输出 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> grouped = df.groupby('Team')
>>> agg = grouped['Points'].agg([np.sum, np.mean, np.std])
>>> print (agg)
         sum        mean         std
Team
Devils  1536  768.000000  134.350288
Kings   2285  761.666667   24.006943
Riders  3049  762.250000   88.567771
Royals  1505  752.500000   72.831998
kings    812  812.000000         NaN

转换

分组或列上的转换返回索引大小与被分组的索引相同的对象。因此,转换应该返回与组块大小相同的结果。

>>> import pandas as pd
>>> import numpy as np
>>>
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>>
>>> grouped = df.groupby('Team')
>>> score = lambda x: (x - x.mean()) / x.std()*10
>>> print (grouped.transform(score))
       Points       Rank       Year
0   12.843272 -15.000000 -11.618950
1    3.020286   5.000000  -3.872983
2    7.071068  -7.071068  -7.071068
3   -7.071068   7.071068   7.071068
4   -8.608621  11.547005 -10.910895
5         NaN        NaN        NaN
6   -2.360428  -5.773503   2.182179
7   10.969049  -5.773503   8.728716
8   -7.705963   5.000000   3.872983
9   -7.071068   7.071068  -7.071068
10   7.071068  -7.071068   7.071068
11  -8.157595   5.000000  11.618950

过滤

过滤根据定义的标准过滤数据并返回数据的子集。filter()函数用于过滤数据。

>>> import pandas as pd
>>> import numpy as np
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
>>>          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
>>>          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
>>>          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
>>>          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>> filter = df.groupby('Team').filter(lambda x: len(x) >= 3)
>>>
>>> print (filter)
    Points  Rank    Team  Year
0      876     1  Riders  2014
1      789     2  Riders  2015
4      741     3   Kings  2014
6      756     1   Kings  2016
7      788     1   Kings  2017
8      694     2  Riders  2016
11     690     2  Riders  2017

在上述过滤条件下,要求返回三次以上参加IPL的队伍。