In [1]: import pandas as pd
Data used for this tutorial:
  • 本教程使用存储为CSV的泰坦尼克号数据集。数据由以下数据列组成:

    • PassengerID:每个乘客的ID。

    • 生还:表示乘客是否生还。 0 是的而且 1 不是的。

    • Pclass:3个票类中的一个:类 1 ,班级 2 和班级 3

    • 姓名:乘客姓名。

    • 性别:乘客性别。

    • 年龄:乘客的年龄,以年为单位。

    • SibSp:船上兄弟姐妹或配偶的数量。

    • Parch:船上父母或孩子的人数。

    • 车票:旅客车票号码。

    • 票价:标明票价。

    • 客舱:客舱人数。

    • 已装船:装船港。

    To raw data

如何读写表格数据?#

../../_images/02_io_readwrite.svg
  • 我想分析泰坦尼克号的乘客数据,这些数据以CSV文件的形式提供。

    In [2]: titanic = pd.read_csv("data/titanic.csv")
    

    Pandas提供了 read_csv() 函数将存储为CSV文件的数据读取到PANAS中 DataFrame 。Pandas支持多种不同的文件格式或开箱即用的数据源(csv、EXCEL、SQL、JSON、PARQUET、…),每个字符都带有前缀 read_*

确保在读取数据后始终对数据进行检查。在显示 DataFrame ,默认情况下将显示第一行和最后5行:

In [3]: titanic
Out[3]: 
     PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0              1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1              2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
2              3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3              4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
4              5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S
..           ...       ...     ...                                                ...     ...   ...    ...    ...               ...      ...   ...      ...
886          887         0       2                              Montvila, Rev. Juozas    male  27.0      0      0            211536  13.0000   NaN        S
887          888         1       1                       Graham, Miss. Margaret Edith  female  19.0      0      0            112053  30.0000   B42        S
888          889         0       3           Johnston, Miss. Catherine Helen "Carrie"  female   NaN      1      2        W./C. 6607  23.4500   NaN        S
889          890         1       1                              Behr, Mr. Karl Howell    male  26.0      0      0            111369  30.0000  C148        C
890          891         0       3                                Dooley, Mr. Patrick    male  32.0      0      0            370376   7.7500   NaN        Q

[891 rows x 12 columns]
  • 我想看看Pandas数据框的前8行。

    In [4]: titanic.head(8)
    Out[4]: 
       PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
    0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
    1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
    2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
    3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
    4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S
    5            6         0       3                                   Moran, Mr. James    male   NaN      0      0            330877   8.4583   NaN        Q
    6            7         0       1                            McCarthy, Mr. Timothy J    male  54.0      0      0             17463  51.8625   E46        S
    7            8         0       3                     Palsson, Master. Gosta Leonard    male   2.0      3      1            349909  21.0750   NaN        S
    

    的前N行 DataFrame ,请使用 head() 方法,并将所需的行数(在本例中为8行)作为参数。

备注

而是对最后N行感兴趣?Pandas还提供了一个 tail() 方法。例如, titanic.tail(10) 将返回DataFrame的最后10行。

可以通过请求Pandas来检查Pandas如何解释每种列数据类型 dtypes 属性:

In [5]: titanic.dtypes
Out[5]: 
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

对于每一列,都登记了所使用的数据类型。此文件中的数据类型 DataFrame 是整数 (int64 ),漂浮 (float64 )和字符串 (object )。

备注

当请求 dtypes, no brackets are used! dtypes is an attribute of a DataFrame and Series. Attributes of DataFrame or Series do not need brackets. Attributes represent a characteristic of a DataFrame/Series ,而方法(需要方括号) do 有一些关于 DataFrame/Series 中介绍的 first tutorial

  • 我的同事要求将泰坦尼克号的数据作为电子表格。

    In [6]: titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)
    ---------------------------------------------------------------------------
    ModuleNotFoundError                       Traceback (most recent call last)
    Input In [6], in <cell line: 1>()
    ----> 1 titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)
    
    File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/generic.py:2237, in NDFrame.to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes, storage_options)
       2224 from pandas.io.formats.excel import ExcelFormatter
       2226 formatter = ExcelFormatter(
       2227     df,
       2228     na_rep=na_rep,
       (...)
       2235     inf_rep=inf_rep,
       2236 )
    -> 2237 formatter.write(
       2238     excel_writer,
       2239     sheet_name=sheet_name,
       2240     startrow=startrow,
       2241     startcol=startcol,
       2242     freeze_panes=freeze_panes,
       2243     engine=engine,
       2244     storage_options=storage_options,
       2245 )
    
    File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/formats/excel.py:896, in ExcelFormatter.write(self, writer, sheet_name, startrow, startcol, freeze_panes, engine, storage_options)
        892     need_save = False
        893 else:
        894     # error: Cannot instantiate abstract class 'ExcelWriter' with abstract
        895     # attributes 'engine', 'save', 'supported_extensions' and 'write_cells'
    --> 896     writer = ExcelWriter(  # type: ignore[abstract]
        897         writer, engine=engine, storage_options=storage_options
        898     )
        899     need_save = True
        901 try:
    
    File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_openpyxl.py:55, in OpenpyxlWriter.__init__(self, path, engine, date_format, datetime_format, mode, storage_options, if_sheet_exists, engine_kwargs, **kwargs)
         42 def __init__(
         43     self,
         44     path: FilePath | WriteExcelBuffer | ExcelWriter,
       (...)
         53 ) -> None:
         54     # Use the openpyxl module as the Excel writer.
    ---> 55     from openpyxl.workbook import Workbook
         57     engine_kwargs = combine_kwargs(engine_kwargs, kwargs)
         59     super().__init__(
         60         path,
         61         mode=mode,
       (...)
         64         engine_kwargs=engine_kwargs,
         65     )
    
    ModuleNotFoundError: No module named 'openpyxl'
    

    鉴于 read_* 函数用于向Pandas读取数据, to_* 方法用于存储数据。这个 to_excel() 方法将数据存储为Excel文件。在这里的示例中, sheet_name 被命名为 旅客 而不是默认设置 图纸1 。通过设置 index=False 行索引标签不会保存在电子表格中。

等价的读取函数 read_excel() 将数据重新加载到 DataFrame

In [7]: titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/util/_decorators.py:317, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    311 if len(args) > num_allow_args:
    312     warnings.warn(
    313         msg.format(arguments=arguments),
    314         FutureWarning,
    315         stacklevel=stacklevel,
    316     )
--> 317 return func(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:458, in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, decimal, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
    456 if not isinstance(io, ExcelFile):
    457     should_close = True
--> 458     io = ExcelFile(io, storage_options=storage_options, engine=engine)
    459 elif engine and engine != io.engine:
    460     raise ValueError(
    461         "Engine should not be specified when passing "
    462         "an ExcelFile - ExcelFile already has the engine set"
    463     )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:1482, in ExcelFile.__init__(self, path_or_buffer, engine, storage_options)
   1480     ext = "xls"
   1481 else:
-> 1482     ext = inspect_excel_format(
   1483         content_or_path=path_or_buffer, storage_options=storage_options
   1484     )
   1485     if ext is None:
   1486         raise ValueError(
   1487             "Excel file format cannot be determined, you must specify "
   1488             "an engine manually."
   1489         )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:1355, in inspect_excel_format(content_or_path, storage_options)
   1352 if isinstance(content_or_path, bytes):
   1353     content_or_path = BytesIO(content_or_path)
-> 1355 with get_handle(
   1356     content_or_path, "rb", storage_options=storage_options, is_text=False
   1357 ) as handle:
   1358     stream = handle.handle
   1359     stream.seek(0)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/common.py:795, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    786         handle = open(
    787             handle,
    788             ioargs.mode,
   (...)
    791             newline="",
    792         )
    793     else:
    794         # Binary mode
--> 795         handle = open(handle, ioargs.mode)
    796     handles.append(handle)
    798 # Convert BytesIO or file objects passed with an encoding

FileNotFoundError: [Errno 2] No such file or directory: 'titanic.xlsx'
In [8]: titanic.head()
Out[8]: 
   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S
  • 我对一份技术摘要很感兴趣 DataFrame

    In [9]: titanic.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 12 columns):
     #   Column       Non-Null Count  Dtype  
    ---  ------       --------------  -----  
     0   PassengerId  891 non-null    int64  
     1   Survived     891 non-null    int64  
     2   Pclass       891 non-null    int64  
     3   Name         891 non-null    object 
     4   Sex          891 non-null    object 
     5   Age          714 non-null    float64
     6   SibSp        891 non-null    int64  
     7   Parch        891 non-null    int64  
     8   Ticket       891 non-null    object 
     9   Fare         891 non-null    float64
     10  Cabin        204 non-null    object 
     11  Embarked     889 non-null    object 
    dtypes: float64(2), int64(5), object(5)
    memory usage: 83.7+ KB
    

    该方法 info() 提供有关 DataFrame ,让我们更详细地解释一下输出:

    • 它确实是一个 DataFrame

    • 共有891个条目,即约891行。

    • 每行都有一个行标签(也称为 index ),取值范围从0到890。

    • 该表有12列。大多数列的每一行都有一个值(所有891个值都是 non-null )。某些列确实缺少值并且小于891 non-null 价值。

    • 栏目 NameSexCabinEmbarked 由文本数据(字符串,又名 object )。其他列是数字数据,其中一些是整数(又名 integer )和其他是实数(又名 float )。

    • 数据类型(字符、整数、…)在不同的列中,通过列出 dtypes

    • 还提供了用于保存DataFrame的大致RAM大小。

REMEMBER

  • 支持从多种不同的文件格式或数据源将数据导入Pandas read_* 功能。

  • 从Pandas身上导出数据由不同的公司提供 to_* 方法。

  • 这个 head/tail /info``方法和 ``dtypes 属性可以方便地进行第一次检查。

To user guide

有关Pandas的输入和输出可能性的完整概述,请参阅用户指南部分 reader and writer functions