注解

此笔记本可在此处下载: 00_Introduction_Panda.ipynb

熊猫模块简介

代码作者:Emile Roux emile.roux@univ-smb.fr

RISE 幻灯片

范围

本笔记本提供了一些使用panda模块(https://pandas.pydata.org/)处理数据库的关键功能。

网络为您提供了许多关于本模块的示例和文档:

http://pandas.pydata.org/pandas-docs/stable/10min.html

http://www.python-simple.com/python-pandas/panda-intro.php

#Setup
%load_ext autoreload
%matplotlib nbagg
%autoreload 2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

加载数据并从csv文件创建数据帧

更多的解释可以在这里找到:https://chrisalbon.com/python/data_wrangling/pandas_dataframe_导入_csv/

df = pd.read_csv('./_DATA/Note_csv.csv',delimiter=";")

显示数据帧

# return the beginning of the dataframe
df.head()
section groupe name ET CC
0 MM A ami 14.5 11.75
1 MM A joyce 8.5 11.50
2 MM C lola 9.5 13.25
3 MM B irma 7.5 6.00
4 IAI D florence 14.5 13.25
# return the end of the dataframe
df.tail()
section groupe name ET CC
90 MM A james 13.75 12.75
91 IAI D richard 15.25 7.00
92 MM A caprice 18.25 15.00
93 IAI D al 12.50 9.75
94 MM B constance 3.00 7.00

在数据帧中选择数据

 # get data from index 2
df.loc[2]
section       MM
groupe         C
name        lola
ET           9.5
CC         13.25
Name: 2, dtype: object
# get name from index 2
df.name[2]
'lola'
# Sliccing is also working
df.name[2:6]
2        lola
3        irma
4    florence
5          vi
Name: name, dtype: object

获取数据帧的一行

df.name
0             ami
1           joyce
2            lola
3            irma
4        florence
5              vi
6           brian
7      antoinette
8            fred
9          gaston
10         samuel
11         arnaud
12          annie
13      roosevelt
14          sarah
15          simon
16          louis
17             an
18        jacques
19        charles
20         sigrid
21          lasse
22           king
23          marco
24        patrick
25            liv
26          diane
27           bill
28        jessica
29         gilles
         ...
65        jeannot
66        fernand
67           lise
68         ursula
69           dona
70      dominique
71         platon
72          eugen
73          pedro
74            bob
75        marquis
76          j곩mie
77           karl
78       lucienne
79    timothꥻ4.75
80           avis
81           mari
82           rose
83         porter
84       philippe
85            vin
86       jeunesse
87       victoire
88         joseph
89            fꭩx
90          james
91        richard
92        caprice
93             al
94      constance
Name: name, Length: 95, dtype: object

获取A组和B组的学生人数

df.groupe.value_counts()
B    25
A    24
D    23
C    23
Name: groupe, dtype: int64

获取A组和B组学生的比例

df.groupe.value_counts(normalize=True)
B    0.263158
A    0.252632
D    0.242105
C    0.242105
Name: groupe, dtype: float64

显示A组和B组学生的比例

使用panda的绘图功能:

熊猫的可视化选项可以在以下网址找到:http://pandas.pydata.org/pandas-docs/version/0.18/visualization.html

fig = plt.figure()
df.groupe.value_counts(normalize=True).plot.pie(labels=['A', 'B', 'C', 'D'], colors= ['r', 'g', 'b', 'y'], autopct='%.1f')
plt.show()
<IPython.core.display.Javascript object>

使用matplotlib的plot函数:

val = df.groupe.value_counts(normalize=True).values
explode = (0.1, 0, 0, 0)
labels = 'A', 'B', 'C', 'D'
fig1, ax1 = plt.subplots()
ax1.pie(val, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()
<IPython.core.display.Javascript object>

获取A组的学生名单

df[df.groupe=="A"]
section groupe name ET CC
0 MM A ami 14.50 11.75
1 MM A joyce 8.50 11.50
23 MM A marco 12.50 13.00
27 MM A bill 11.00 12.75
28 MM A jessica 16.50 12.50
37 MM A denis 13.25 16.00
38 MM A jenny 12.75 17.50
40 MM A christian 12.50 12.50
43 MM A rita 13.75 8.50
44 MM A orlando 14.00 15.25
48 MM A chant 4.50 9.00
50 MM A val 15.00 11.25
53 MM A ana 15.00 13.50
59 MM A clarisse 12.50 13.50
63 MM A isabelle 14.00 7.50
65 MM A jeannot 14.75 14.00
66 MM A fernand 8.00 10.00
75 MM A marquis 8.50 13.00
85 MM A vin 11.00 13.00
86 MM A jeunesse 12.00 10.50
87 MM A victoire 11.75 12.00
89 MM A fꭩx 13.00 14.50
90 MM A james 13.75 12.75
92 MM A caprice 18.25 15.00

对数据进行计算

df.ET.mean() # the mean of ET note over all student
11.043010752688172
df.ET[df.groupe=="A"].mean() # the mean of note1 over student from A groupe
12.552083333333334
df.groupby(['groupe']).mean() # compte the mean of each note for each groupe
ET CC
groupe
A 12.552083 12.531250
B 9.720000 10.093750
C 10.630435 11.913043
D 11.345238 9.076087
df.groupby(['section']).mean() # compte the mean of each note for each section
ET CC
section
IAI 10.804688 9.786765
MM 11.168033 11.550000

用柱状图显示注释

# CC notes
fig = plt.figure()
df.CC.plot.hist(alpha=0.5, bins=np.arange(1,20))
plt.show()
<IPython.core.display.Javascript object>
# ET notes
fig = plt.figure()
df.ET.plot.hist(alpha=0.5, bins=np.arange(1,20))
plt.show()
<IPython.core.display.Javascript object>
fig = plt.figure()
df.plot.hist(alpha=.5, bins=np.arange(1,20))
plt.show()
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>

让我们计算两个音符的平均值

我们需要先在数据框中添加新行

df["FinalNote"] = 0.0 # add  row filled with 0.0
df.head()
section groupe name ET CC FinalNote
0 MM A ami 14.5 11.75 0.0
1 MM A joyce 8.5 11.50 0.0
2 MM C lola 9.5 13.25 0.0
3 MM B irma 7.5 6.00 0.0
4 IAI D florence 14.5 13.25 0.0

我们来计算平均数

df["FinalNote"]=df.mean(axis=1)
# the axis option alows comptuting the mean over lines or rows
df.head()
section groupe name ET CC FinalNote
0 MM A ami 14.5 11.75 8.750000
1 MM A joyce 8.5 11.50 6.666667
2 MM C lola 9.5 13.25 7.583333
3 MM B irma 7.5 6.00 4.500000
4 IAI D florence 14.5 13.25 9.250000
fig = plt.figure()
df.FinalNote.plot.hist(alpha=0.5, bins=np.arange(1,20))
plt.show()
<IPython.core.display.Javascript object>

总体意思是什么?

df.FinalNote.mean()
10.812762277994366