注解

此笔记本可在此处下载： 00_Introduction_Panda.ipynb

熊猫模块简介

代码作者：Emile Roux emile.roux@univ-smb.fr

RISE 幻灯片

范围

本笔记本提供了一些使用panda模块（https://pandas.pydata.org/）处理数据库的关键功能。

网络为您提供了许多关于本模块的示例和文档：

http://pandas.pydata.org/pandas-docs/stable/10min.html

http://www.python-simple.com/python-pandas/panda-intro.php

#Setup
%load_ext autoreload
%matplotlib nbagg
%autoreload 2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

加载数据并从csv文件创建数据帧

更多的解释可以在这里找到：https://chrisalbon.com/python/data_wrangling/pandas_dataframe_导入_csv/

df = pd.read_csv('./_DATA/Note_csv.csv',delimiter=";")

显示数据帧

# return the beginning of the dataframe
df.head()

	section	groupe	name	ET	CC
0	MM	A	ami	14.5	11.75
1	MM	A	joyce	8.5	11.50
2	MM	C	lola	9.5	13.25
3	MM	B	irma	7.5	6.00
4	IAI	D	florence	14.5	13.25

# return the end of the dataframe
df.tail()

	section	groupe	name	ET	CC
90	MM	A	james	13.75	12.75
91	IAI	D	richard	15.25	7.00
92	MM	A	caprice	18.25	15.00
93	IAI	D	al	12.50	9.75
94	MM	B	constance	3.00	7.00

在数据帧中选择数据

 # get data from index 2
df.loc[2]

section       MM
groupe         C
name        lola
ET           9.5
CC         13.25
Name: 2, dtype: object

# get name from index 2
df.name[2]

'lola'

# Sliccing is also working
df.name[2:6]

      lola
      irma
  florence
        vi
Name: name, dtype: object

获取数据帧的一行

df.name

           ami
         joyce
          lola
          irma
      florence
            vi
         brian
    antoinette
          fred
        gaston
       samuel
       arnaud
        annie
    roosevelt
        sarah
        simon
        louis
           an
      jacques
      charles
       sigrid
        lasse
         king
        marco
      patrick
          liv
        diane
         bill
      jessica
       gilles
         ...
      jeannot
      fernand
         lise
       ursula
         dona
    dominique
       platon
        eugen
        pedro
          bob
      marquis
        j곩mie
         karl
     lucienne
  timothꥻ4.75
         avis
         mari
         rose
       porter
     philippe
          vin
     jeunesse
     victoire
       joseph
          fꭩx
        james
      richard
      caprice
           al
    constance
Name: name, Length: 95, dtype: object

获取A组和B组的学生人数

df.groupe.value_counts()

B    25
A    24
D    23
C    23
Name: groupe, dtype: int64

获取A组和B组学生的比例

df.groupe.value_counts(normalize=True)

B    0.263158
A    0.252632
D    0.242105
C    0.242105
Name: groupe, dtype: float64

显示A组和B组学生的比例

使用panda的绘图功能：

熊猫的可视化选项可以在以下网址找到：http://pandas.pydata.org/pandas-docs/version/0.18/visualization.html

fig = plt.figure()
df.groupe.value_counts(normalize=True).plot.pie(labels=['A', 'B', 'C', 'D'], colors= ['r', 'g', 'b', 'y'], autopct='%.1f')
plt.show()

<IPython.core.display.Javascript object>

使用matplotlib的plot函数：

val = df.groupe.value_counts(normalize=True).values
explode = (0.1, 0, 0, 0)
labels = 'A', 'B', 'C', 'D'
fig1, ax1 = plt.subplots()
ax1.pie(val, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

<IPython.core.display.Javascript object>

获取A组的学生名单

df[df.groupe=="A"]

	section	groupe	name	ET	CC
0	MM	A	ami	14.50	11.75
1	MM	A	joyce	8.50	11.50
23	MM	A	marco	12.50	13.00
27	MM	A	bill	11.00	12.75
28	MM	A	jessica	16.50	12.50
37	MM	A	denis	13.25	16.00
38	MM	A	jenny	12.75	17.50
40	MM	A	christian	12.50	12.50
43	MM	A	rita	13.75	8.50
44	MM	A	orlando	14.00	15.25
48	MM	A	chant	4.50	9.00
50	MM	A	val	15.00	11.25
53	MM	A	ana	15.00	13.50
59	MM	A	clarisse	12.50	13.50
63	MM	A	isabelle	14.00	7.50
65	MM	A	jeannot	14.75	14.00
66	MM	A	fernand	8.00	10.00
75	MM	A	marquis	8.50	13.00
85	MM	A	vin	11.00	13.00
86	MM	A	jeunesse	12.00	10.50
87	MM	A	victoire	11.75	12.00
89	MM	A	fꭩx	13.00	14.50
90	MM	A	james	13.75	12.75
92	MM	A	caprice	18.25	15.00

对数据进行计算

df.ET.mean() # the mean of ET note over all student

11.043010752688172

df.ET[df.groupe=="A"].mean() # the mean of note1 over student from A groupe

12.552083333333334

df.groupby(['groupe']).mean() # compte the mean of each note for each groupe

	ET	CC
groupe
A	12.552083	12.531250
B	9.720000	10.093750
C	10.630435	11.913043
D	11.345238	9.076087

df.groupby(['section']).mean() # compte the mean of each note for each section

	ET	CC
section
IAI	10.804688	9.786765
MM	11.168033	11.550000

用柱状图显示注释

# CC notes
fig = plt.figure()
df.CC.plot.hist(alpha=0.5, bins=np.arange(1,20))
plt.show()

<IPython.core.display.Javascript object>

# ET notes
fig = plt.figure()
df.ET.plot.hist(alpha=0.5, bins=np.arange(1,20))
plt.show()

<IPython.core.display.Javascript object>

fig = plt.figure()
df.plot.hist(alpha=.5, bins=np.arange(1,20))
plt.show()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

让我们计算两个音符的平均值

我们需要先在数据框中添加新行

df["FinalNote"] = 0.0 # add  row filled with 0.0

df.head()

	section	groupe	name	ET	CC
0	MM	A	ami	14.5	11.75
1	MM	A	joyce	8.5	11.50
2	MM	C	lola	9.5	13.25
3	MM	B	irma	7.5	6.00
4	IAI	D	florence	14.5	13.25

我们来计算平均数

df["FinalNote"]=df.mean(axis=1)
# the axis option alows comptuting the mean over lines or rows

df.head()

	section	groupe	name	ET	CC	FinalNote
0	MM	A	ami	14.5	11.75	8.750000
1	MM	A	joyce	8.5	11.50	6.666667
2	MM	C	lola	9.5	13.25	7.583333
3	MM	B	irma	7.5	6.00	4.500000
4	IAI	D	florence	14.5	13.25	9.250000

fig = plt.figure()
df.FinalNote.plot.hist(alpha=0.5, bins=np.arange(1,20))
plt.show()

<IPython.core.display.Javascript object>

总体意思是什么？

df.FinalNote.mean()

10.812762277994366