目录

上一个主题

5.1. Pandas教程简介

下一个主题

5.3. Pandas 序列(Series)


>>> from env_helper import info; info()
页面更新时间: 2023-04-15 21:30:30
运行环境:
    Linux发行版本: Debian GNU/Linux 12 (bookworm)
    操作系统内核: Linux-6.1.0-7-amd64-x86_64-with-glibc2.36
    Python版本: 3.11.2

5.2. Pandas快速入门

这是一个Pandas快速入门教程,主要面向新用户。这里主要是为那些喜欢“短平快”的读者准备的,有兴趣的读者可通过其它教程文章来一步一步地更复杂的应用知识。

首先,假设您安装好了 Pands。

测试工作环境是否有安装好了Pandas,导入相关包如下:

>>> import pandas as pd
>>> import numpy as np
>>> import matplotlib.pyplot as plt

对象创建

通过传递值列表来创建一个系列,让Pandas创建一个默认的整数索引:

>>> s = pd.Series([1,3,5,np.nan,6,8])
>>> s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

通过传递 Numpy 数组,使用datetime索引和标记列来创建DataFrame:

>>> dates = pd.date_range('20210101', periods=7)
>>> dates
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06', '2021-01-07'],
              dtype='datetime64[ns]', freq='D')

下面是创建 DataFrame 的方法:

>>> df = pd.DataFrame(np.random.randn(7,4), index=dates, columns=list('ABCD'))
>>> df
A B C D
2021-01-01 0.606979 1.609177 -0.843393 -0.380063
2021-01-02 1.797330 0.141385 -2.511412 1.790808
2021-01-03 1.021336 -0.170673 1.401057 -2.861379
2021-01-04 0.775513 -0.181982 0.840247 -1.671743
2021-01-05 0.773924 -1.086668 0.068625 -1.547599
2021-01-06 0.773845 -2.245264 -0.482183 -1.479018
2021-01-07 -0.365987 -0.987623 0.105982 1.071964

通过传递可以转换为类似系列的对象的字典来创建DataFrame。参考以下示例代码 -

>>> df2 = pd.DataFrame({ 'A' : 1.,
>>>                      'B' : pd.Timestamp('20170102'),
>>>                      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
>>>                      'D' : np.array([3] * 4,dtype='int32'),
>>>                      'E' : pd.Categorical(["test","train","test","train"]),
>>>                      'F' : 'foo' })
>>> df2
A B C D E F
0 1.0 2017-01-02 1.0 3 test foo
1 1.0 2017-01-02 1.0 3 train foo
2 1.0 2017-01-02 1.0 3 test foo
3 1.0 2017-01-02 1.0 3 train foo

有指定dtypes,参考以下示例代码 -

>>> df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

如果使用 IPython 、 Jupyter notebook 等工具,则会自动启用列名(以及公共属性)的 Tab 补全完成

查看数据

查看数据框的顶部和底部的数据行。参考以下示例代码 -

>>> df.head()
A B C D
2021-01-01 0.606979 1.609177 -0.843393 -0.380063
2021-01-02 1.797330 0.141385 -2.511412 1.790808
2021-01-03 1.021336 -0.170673 1.401057 -2.861379
2021-01-04 0.775513 -0.181982 0.840247 -1.671743
2021-01-05 0.773924 -1.086668 0.068625 -1.547599
>>> df.tail(3)
A B C D
2021-01-05 0.773924 -1.086668 0.068625 -1.547599
2021-01-06 0.773845 -2.245264 -0.482183 -1.479018
2021-01-07 -0.365987 -0.987623 0.105982 1.071964

显示索引,列和底层numpy数据,参考以下代码 -

>>> print("index is :" )
>>> df.index
index is :
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06', '2021-01-07'],
              dtype='datetime64[ns]', freq='D')
>>> print("columns is :" )
>>> df.columns
columns is :
Index(['A', 'B', 'C', 'D'], dtype='object')
>>> print("values is :" )
>>> df.values
values is :
array([[ 0.6069793 ,  1.60917727, -0.84339272, -0.38006313],
       [ 1.79732993,  0.14138518, -2.51141225,  1.79080822],
       [ 1.02133645, -0.17067348,  1.40105708, -2.86137874],
       [ 0.77551279, -0.18198206,  0.8402473 , -1.67174332],
       [ 0.7739237 , -1.08666813,  0.06862547, -1.54759886],
       [ 0.77384463, -2.24526442, -0.48218348, -1.47901752],
       [-0.36598721, -0.98762342,  0.10598163,  1.07196402]])

描述显示数据的快速统计摘要,参考以下示例代码 -

>>> df.describe()
A B C D
count 7.000000 7.000000 7.000000 7.000000
mean 0.768991 -0.417378 -0.203011 -0.725290
std 0.637139 1.200674 1.268032 1.652422
min -0.365987 -2.245264 -2.511412 -2.861379
25% 0.690412 -1.037146 -0.662788 -1.609671
50% 0.773924 -0.181982 0.068625 -1.479018
75% 0.898425 -0.014644 0.473114 0.345950
max 1.797330 1.609177 1.401057 1.790808

调换数据,参考以下示例代码 -

>>> dates3 = pd.date_range('20170101', periods=6)
>>> df3 = pd.DataFrame(np.random.randn(6,4), index=dates3, columns=list('ABCD'))
>>> df3.T
2017-01-01 2017-01-02 2017-01-03 2017-01-04 2017-01-05 2017-01-06
A 1.476265 -1.322840 0.422917 0.485588 0.075828 0.070545
B -0.523253 -1.598288 0.877139 0.607529 -0.224069 0.351965
C -0.477934 -1.195992 -0.081699 0.256771 -2.131678 1.267888
D -0.091211 -0.135799 0.633322 -0.606045 -1.197480 -0.335860

通过轴排序,参考以下示例程序 -

>>> df3.sort_index(axis=1, ascending=False)
D C B A
2017-01-01 -0.091211 -0.477934 -0.523253 1.476265
2017-01-02 -0.135799 -1.195992 -1.598288 -1.322840
2017-01-03 0.633322 -0.081699 0.877139 0.422917
2017-01-04 -0.606045 0.256771 0.607529 0.485588
2017-01-05 -1.197480 -2.131678 -0.224069 0.075828
2017-01-06 -0.335860 1.267888 0.351965 0.070545

按值排序,参考以下示例程序 -

>>> df3.sort_values(by='B')
A B C D
2017-01-02 -1.322840 -1.598288 -1.195992 -0.135799
2017-01-01 1.476265 -0.523253 -0.477934 -0.091211
2017-01-05 0.075828 -0.224069 -2.131678 -1.197480
2017-01-06 0.070545 0.351965 1.267888 -0.335860
2017-01-04 0.485588 0.607529 0.256771 -0.606045
2017-01-03 0.422917 0.877139 -0.081699 0.633322

选择区块

注意虽然用于选择和设置的标准Python/Numpy表达式是直观的,可用于交互式工作,但对于生产代码,但建议使用优化的Pandas数据访问方法.at,.iat,.loc,.iloc和.ix。

获取

选择一列,产生一个系列,相当于df.A,参考以下示例程序 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> df['A']
2017-01-01   -0.440279
2017-01-02    0.275738
2017-01-03    0.683135
2017-01-04    0.042306
2017-01-05   -0.240564
2017-01-06   -0.630516
Freq: D, Name: A, dtype: float64

选择通过[]操作符,选择切片行。参考以下示例程序 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df[0:3])
>>>
>>> print("========= 指定选择日期 ========")
>>>
>>> print(df['20170102':'20170103'])
                   A         B         C         D
2017-01-01 -0.382540 -0.182804 -0.190991 -0.135017
2017-01-02  0.956760  0.356210 -0.671786  0.227365
2017-01-03  0.074734 -1.146267 -0.307118 -0.473373
========= 指定选择日期 ========
                   A         B         C         D
2017-01-02  0.956760  0.356210 -0.671786  0.227365
2017-01-03  0.074734 -1.146267 -0.307118 -0.473373

按标签选择

使用标签获取横截面,参考以下示例程序 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.loc[dates[0]])
A   -1.031819
B   -0.006138
C    1.757311
D   -1.900689
Name: 2017-01-01 00:00:00, dtype: float64

通过标签选择多轴,参考以下示例程序 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.loc[:,['A','B']])
                   A         B
2017-01-01  0.164280  0.276552
2017-01-02 -0.421966  0.221478
2017-01-03  1.281613  0.041556
2017-01-04 -0.088802 -0.257067
2017-01-05 -0.049226 -1.957657
2017-01-06  0.162513  1.648133

显示标签切片,包括两个端点,参考以下示例程序 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.loc['20170102':'20170104',['A','B']])
                   A         B
2017-01-02  0.830484 -0.518361
2017-01-03  0.169390  0.817312
2017-01-04  0.399126 -1.292619

减少返回对象的尺寸(大小),参考以下示例程序 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.loc['20170102',['A','B']])
A    0.529362
B   -1.100597
Name: 2017-01-02 00:00:00, dtype: float64

获得标量值,参考以下示例程序 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.loc[dates[0],'A'])
2.442415457892376

快速访问标量(等同于先前的方法),参考以下示例程序 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.at[dates[0],'A'])
0.9653168648436002

通过位置选择

通过传递的整数的位置选择,参考以下示例程序 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.iloc[3])
A   -0.650924
B    0.138809
C    2.545768
D   -0.536561
Name: 2017-01-04 00:00:00, dtype: float64

通过整数切片,类似于numpy/python,参考以下示例程序 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.iloc[3:5,0:2])
                   A         B
2017-01-04 -0.388126 -1.531424
2017-01-05 -0.729532 -0.705215

通过整数位置的列表,类似于numpy/python样式,参考以下示例程序 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.iloc[[1,2,4],[0,2]])
                   A         C
2017-01-02 -0.747364 -0.594414
2017-01-03 -0.793403 -1.793447
2017-01-05 -0.662157 -0.417067

明确切片行,参考以下示例程序 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.iloc[1:3,:])
                   A         B         C         D
2017-01-02 -0.530230 -0.773371  1.413877 -0.970272
2017-01-03  0.431505 -0.587018 -0.898735  1.663966

明确切片列,参考以下示例程序 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.iloc[:,1:3])
                   B         C
2017-01-01 -1.453717  1.052346
2017-01-02  0.276483 -0.327941
2017-01-03  0.686838  0.864911
2017-01-04  0.268747 -0.135483
2017-01-05  0.167914 -1.809039
2017-01-06 -1.915393  0.402191

要明确获取值,参考以下示例程序 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.iloc[1,1])
1.3062897937902163

要快速访问标量(等同于先前的方法),参考以下示例程序 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.iat[1,1])
-1.353471076642419

布尔索引

使用单列的值来选择数据,参考以下示例程序 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df[df.A > 0])
                   A         B         C         D
2017-01-02  0.209709  0.557072  0.576529  0.474762
2017-01-03  1.263861  1.128960 -0.738488  1.382371
2017-01-04  0.511101 -1.109398 -0.289064 -0.371348

从满足布尔条件的DataFrame中选择值。参考以下示例程序 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df[df > 0])
                   A         B         C         D
2017-01-01       NaN  0.029491       NaN       NaN
2017-01-02       NaN       NaN       NaN  0.906864
2017-01-03       NaN  2.048842  0.449479       NaN
2017-01-04  2.213507  0.725957       NaN  0.602421
2017-01-05  0.743683  0.575747  0.368332       NaN
2017-01-06  1.876558       NaN  1.223221  0.477795

使用 isin()方法进行过滤,参考以下示例程序:

>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> df2 = df.copy()
>>> df2['E'] = ['one', 'one','two','three','four','three']
>>>
>>> print(df2)
>>>
>>> print("============= start to filter =============== ")
>>>
>>> print(df2[df2['E'].isin(['two','four'])])
                   A         B         C         D      E
2017-01-01 -1.551054  0.074099  2.836054 -1.328567    one
2017-01-02 -1.324657  2.074419 -0.793078  0.159044    one
2017-01-03  0.522375 -0.834793 -1.485328  1.420185    two
2017-01-04  0.459851  0.421881  0.907386  0.110532  three
2017-01-05 -0.192005  1.595409 -0.988345 -0.215954   four
2017-01-06  0.563005  0.241732  1.096168  0.665604  three
============= start to filter ===============
                   A         B         C         D     E
2017-01-03  0.522375 -0.834793 -1.485328  1.420185   two
2017-01-05 -0.192005  1.595409 -0.988345 -0.215954  four