>>> from env_helper import info; info()
页面更新时间: 2023-04-15 21:30:30
运行环境:
Linux发行版本: Debian GNU/Linux 12 (bookworm)
操作系统内核: Linux-6.1.0-7-amd64-x86_64-with-glibc2.36
Python版本: 3.11.2
5.2. Pandas快速入门¶
这是一个Pandas快速入门教程,主要面向新用户。这里主要是为那些喜欢“短平快”的读者准备的,有兴趣的读者可通过其它教程文章来一步一步地更复杂的应用知识。
首先,假设您安装好了 Pands。
测试工作环境是否有安装好了Pandas,导入相关包如下:
>>> import pandas as pd
>>> import numpy as np
>>> import matplotlib.pyplot as plt
对象创建¶
通过传递值列表来创建一个系列,让Pandas创建一个默认的整数索引:
>>> s = pd.Series([1,3,5,np.nan,6,8])
>>> s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
通过传递 Numpy 数组,使用datetime索引和标记列来创建DataFrame:
>>> dates = pd.date_range('20210101', periods=7)
>>> dates
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
'2021-01-05', '2021-01-06', '2021-01-07'],
dtype='datetime64[ns]', freq='D')
下面是创建 DataFrame 的方法:
>>> df = pd.DataFrame(np.random.randn(7,4), index=dates, columns=list('ABCD'))
>>> df
A | B | C | D | |
---|---|---|---|---|
2021-01-01 | 0.606979 | 1.609177 | -0.843393 | -0.380063 |
2021-01-02 | 1.797330 | 0.141385 | -2.511412 | 1.790808 |
2021-01-03 | 1.021336 | -0.170673 | 1.401057 | -2.861379 |
2021-01-04 | 0.775513 | -0.181982 | 0.840247 | -1.671743 |
2021-01-05 | 0.773924 | -1.086668 | 0.068625 | -1.547599 |
2021-01-06 | 0.773845 | -2.245264 | -0.482183 | -1.479018 |
2021-01-07 | -0.365987 | -0.987623 | 0.105982 | 1.071964 |
通过传递可以转换为类似系列的对象的字典来创建DataFrame。参考以下示例代码 -
>>> df2 = pd.DataFrame({ 'A' : 1.,
>>> 'B' : pd.Timestamp('20170102'),
>>> 'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
>>> 'D' : np.array([3] * 4,dtype='int32'),
>>> 'E' : pd.Categorical(["test","train","test","train"]),
>>> 'F' : 'foo' })
>>> df2
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2017-01-02 | 1.0 | 3 | test | foo |
1 | 1.0 | 2017-01-02 | 1.0 | 3 | train | foo |
2 | 1.0 | 2017-01-02 | 1.0 | 3 | test | foo |
3 | 1.0 | 2017-01-02 | 1.0 | 3 | train | foo |
有指定dtypes,参考以下示例代码 -
>>> df2.dtypes
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
如果使用 IPython 、 Jupyter notebook 等工具,则会自动启用列名(以及公共属性)的 Tab 补全完成
查看数据¶
查看数据框的顶部和底部的数据行。参考以下示例代码 -
>>> df.head()
A | B | C | D | |
---|---|---|---|---|
2021-01-01 | 0.606979 | 1.609177 | -0.843393 | -0.380063 |
2021-01-02 | 1.797330 | 0.141385 | -2.511412 | 1.790808 |
2021-01-03 | 1.021336 | -0.170673 | 1.401057 | -2.861379 |
2021-01-04 | 0.775513 | -0.181982 | 0.840247 | -1.671743 |
2021-01-05 | 0.773924 | -1.086668 | 0.068625 | -1.547599 |
>>> df.tail(3)
A | B | C | D | |
---|---|---|---|---|
2021-01-05 | 0.773924 | -1.086668 | 0.068625 | -1.547599 |
2021-01-06 | 0.773845 | -2.245264 | -0.482183 | -1.479018 |
2021-01-07 | -0.365987 | -0.987623 | 0.105982 | 1.071964 |
显示索引,列和底层numpy数据,参考以下代码 -
>>> print("index is :" )
>>> df.index
index is :
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
'2021-01-05', '2021-01-06', '2021-01-07'],
dtype='datetime64[ns]', freq='D')
>>> print("columns is :" )
>>> df.columns
columns is :
Index(['A', 'B', 'C', 'D'], dtype='object')
>>> print("values is :" )
>>> df.values
values is :
array([[ 0.6069793 , 1.60917727, -0.84339272, -0.38006313],
[ 1.79732993, 0.14138518, -2.51141225, 1.79080822],
[ 1.02133645, -0.17067348, 1.40105708, -2.86137874],
[ 0.77551279, -0.18198206, 0.8402473 , -1.67174332],
[ 0.7739237 , -1.08666813, 0.06862547, -1.54759886],
[ 0.77384463, -2.24526442, -0.48218348, -1.47901752],
[-0.36598721, -0.98762342, 0.10598163, 1.07196402]])
描述显示数据的快速统计摘要,参考以下示例代码 -
>>> df.describe()
A | B | C | D | |
---|---|---|---|---|
count | 7.000000 | 7.000000 | 7.000000 | 7.000000 |
mean | 0.768991 | -0.417378 | -0.203011 | -0.725290 |
std | 0.637139 | 1.200674 | 1.268032 | 1.652422 |
min | -0.365987 | -2.245264 | -2.511412 | -2.861379 |
25% | 0.690412 | -1.037146 | -0.662788 | -1.609671 |
50% | 0.773924 | -0.181982 | 0.068625 | -1.479018 |
75% | 0.898425 | -0.014644 | 0.473114 | 0.345950 |
max | 1.797330 | 1.609177 | 1.401057 | 1.790808 |
调换数据,参考以下示例代码 -
>>> dates3 = pd.date_range('20170101', periods=6)
>>> df3 = pd.DataFrame(np.random.randn(6,4), index=dates3, columns=list('ABCD'))
>>> df3.T
2017-01-01 | 2017-01-02 | 2017-01-03 | 2017-01-04 | 2017-01-05 | 2017-01-06 | |
---|---|---|---|---|---|---|
A | 1.476265 | -1.322840 | 0.422917 | 0.485588 | 0.075828 | 0.070545 |
B | -0.523253 | -1.598288 | 0.877139 | 0.607529 | -0.224069 | 0.351965 |
C | -0.477934 | -1.195992 | -0.081699 | 0.256771 | -2.131678 | 1.267888 |
D | -0.091211 | -0.135799 | 0.633322 | -0.606045 | -1.197480 | -0.335860 |
通过轴排序,参考以下示例程序 -
>>> df3.sort_index(axis=1, ascending=False)
D | C | B | A | |
---|---|---|---|---|
2017-01-01 | -0.091211 | -0.477934 | -0.523253 | 1.476265 |
2017-01-02 | -0.135799 | -1.195992 | -1.598288 | -1.322840 |
2017-01-03 | 0.633322 | -0.081699 | 0.877139 | 0.422917 |
2017-01-04 | -0.606045 | 0.256771 | 0.607529 | 0.485588 |
2017-01-05 | -1.197480 | -2.131678 | -0.224069 | 0.075828 |
2017-01-06 | -0.335860 | 1.267888 | 0.351965 | 0.070545 |
按值排序,参考以下示例程序 -
>>> df3.sort_values(by='B')
A | B | C | D | |
---|---|---|---|---|
2017-01-02 | -1.322840 | -1.598288 | -1.195992 | -0.135799 |
2017-01-01 | 1.476265 | -0.523253 | -0.477934 | -0.091211 |
2017-01-05 | 0.075828 | -0.224069 | -2.131678 | -1.197480 |
2017-01-06 | 0.070545 | 0.351965 | 1.267888 | -0.335860 |
2017-01-04 | 0.485588 | 0.607529 | 0.256771 | -0.606045 |
2017-01-03 | 0.422917 | 0.877139 | -0.081699 | 0.633322 |
选择区块¶
注意虽然用于选择和设置的标准Python/Numpy表达式是直观的,可用于交互式工作,但对于生产代码,但建议使用优化的Pandas数据访问方法.at,.iat,.loc,.iloc和.ix。
获取¶
选择一列,产生一个系列,相当于df.A,参考以下示例程序 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> df['A']
2017-01-01 -0.440279
2017-01-02 0.275738
2017-01-03 0.683135
2017-01-04 0.042306
2017-01-05 -0.240564
2017-01-06 -0.630516
Freq: D, Name: A, dtype: float64
选择通过[]操作符,选择切片行。参考以下示例程序 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df[0:3])
>>>
>>> print("========= 指定选择日期 ========")
>>>
>>> print(df['20170102':'20170103'])
A B C D
2017-01-01 -0.382540 -0.182804 -0.190991 -0.135017
2017-01-02 0.956760 0.356210 -0.671786 0.227365
2017-01-03 0.074734 -1.146267 -0.307118 -0.473373
========= 指定选择日期 ========
A B C D
2017-01-02 0.956760 0.356210 -0.671786 0.227365
2017-01-03 0.074734 -1.146267 -0.307118 -0.473373
按标签选择¶
使用标签获取横截面,参考以下示例程序 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.loc[dates[0]])
A -1.031819
B -0.006138
C 1.757311
D -1.900689
Name: 2017-01-01 00:00:00, dtype: float64
通过标签选择多轴,参考以下示例程序 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.loc[:,['A','B']])
A B
2017-01-01 0.164280 0.276552
2017-01-02 -0.421966 0.221478
2017-01-03 1.281613 0.041556
2017-01-04 -0.088802 -0.257067
2017-01-05 -0.049226 -1.957657
2017-01-06 0.162513 1.648133
显示标签切片,包括两个端点,参考以下示例程序 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.loc['20170102':'20170104',['A','B']])
A B
2017-01-02 0.830484 -0.518361
2017-01-03 0.169390 0.817312
2017-01-04 0.399126 -1.292619
减少返回对象的尺寸(大小),参考以下示例程序 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.loc['20170102',['A','B']])
A 0.529362
B -1.100597
Name: 2017-01-02 00:00:00, dtype: float64
获得标量值,参考以下示例程序 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.loc[dates[0],'A'])
2.442415457892376
快速访问标量(等同于先前的方法),参考以下示例程序 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.at[dates[0],'A'])
0.9653168648436002
通过位置选择¶
通过传递的整数的位置选择,参考以下示例程序 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.iloc[3])
A -0.650924
B 0.138809
C 2.545768
D -0.536561
Name: 2017-01-04 00:00:00, dtype: float64
通过整数切片,类似于numpy/python,参考以下示例程序 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.iloc[3:5,0:2])
A B
2017-01-04 -0.388126 -1.531424
2017-01-05 -0.729532 -0.705215
通过整数位置的列表,类似于numpy/python样式,参考以下示例程序 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.iloc[[1,2,4],[0,2]])
A C
2017-01-02 -0.747364 -0.594414
2017-01-03 -0.793403 -1.793447
2017-01-05 -0.662157 -0.417067
明确切片行,参考以下示例程序 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.iloc[1:3,:])
A B C D
2017-01-02 -0.530230 -0.773371 1.413877 -0.970272
2017-01-03 0.431505 -0.587018 -0.898735 1.663966
明确切片列,参考以下示例程序 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.iloc[:,1:3])
B C
2017-01-01 -1.453717 1.052346
2017-01-02 0.276483 -0.327941
2017-01-03 0.686838 0.864911
2017-01-04 0.268747 -0.135483
2017-01-05 0.167914 -1.809039
2017-01-06 -1.915393 0.402191
要明确获取值,参考以下示例程序 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.iloc[1,1])
1.3062897937902163
要快速访问标量(等同于先前的方法),参考以下示例程序 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df.iat[1,1])
-1.353471076642419
布尔索引¶
使用单列的值来选择数据,参考以下示例程序 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df[df.A > 0])
A B C D
2017-01-02 0.209709 0.557072 0.576529 0.474762
2017-01-03 1.263861 1.128960 -0.738488 1.382371
2017-01-04 0.511101 -1.109398 -0.289064 -0.371348
从满足布尔条件的DataFrame中选择值。参考以下示例程序 -
>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> print(df[df > 0])
A B C D
2017-01-01 NaN 0.029491 NaN NaN
2017-01-02 NaN NaN NaN 0.906864
2017-01-03 NaN 2.048842 0.449479 NaN
2017-01-04 2.213507 0.725957 NaN 0.602421
2017-01-05 0.743683 0.575747 0.368332 NaN
2017-01-06 1.876558 NaN 1.223221 0.477795
使用 isin()
方法进行过滤,参考以下示例程序:
>>> import pandas as pd
>>> import numpy as np
>>>
>>> dates = pd.date_range('20170101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>>
>>> df2 = df.copy()
>>> df2['E'] = ['one', 'one','two','three','four','three']
>>>
>>> print(df2)
>>>
>>> print("============= start to filter =============== ")
>>>
>>> print(df2[df2['E'].isin(['two','four'])])
A B C D E
2017-01-01 -1.551054 0.074099 2.836054 -1.328567 one
2017-01-02 -1.324657 2.074419 -0.793078 0.159044 one
2017-01-03 0.522375 -0.834793 -1.485328 1.420185 two
2017-01-04 0.459851 0.421881 0.907386 0.110532 three
2017-01-05 -0.192005 1.595409 -0.988345 -0.215954 four
2017-01-06 0.563005 0.241732 1.096168 0.665604 three
============= start to filter ===============
A B C D E
2017-01-03 0.522375 -0.834793 -1.485328 1.420185 two
2017-01-05 -0.192005 1.595409 -0.988345 -0.215954 four