In [1]: import pandas as pd

Data used for this tutorial:

Titanic data
本教程使用存储为CSV的泰坦尼克号数据集。数据由以下数据列组成：
- PassengerID：每个乘客的ID。
- 生还：表示乘客是否生还。 0 是的而且 1 不是的。
- Pclass：3个票类中的一个：类 1 ，班级 2 和班级 3 。
- 姓名：乘客姓名。
- 性别：乘客性别。
- 年龄：乘客的年龄，以年为单位。
- SibSp：船上兄弟姐妹或配偶的数量。
- Parch：船上父母或孩子的人数。
- 车票：旅客车票号码。
- 票价：标明票价。
- 客舱：客舱人数。
- 已装船：装船港。
To raw data

如何读写表格数据？#

我想分析泰坦尼克号的乘客数据，这些数据以CSV文件的形式提供。
```
In [2]: titanic = pd.read_csv("data/titanic.csv")
```
Pandas提供了 read_csv() 函数将存储为CSV文件的数据读取到PANAS中 DataFrame 。Pandas支持多种不同的文件格式或开箱即用的数据源(csv、EXCEL、SQL、JSON、PARQUET、…)，每个字符都带有前缀 read_* 。

确保在读取数据后始终对数据进行检查。在显示 DataFrame ，默认情况下将显示第一行和最后5行：

In [3]: titanic
Out[3]: 
     PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0              1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1              2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
2              3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3              4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
4              5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S
..           ...       ...     ...                                                ...     ...   ...    ...    ...               ...      ...   ...      ...
886          887         0       2                              Montvila, Rev. Juozas    male  27.0      0      0            211536  13.0000   NaN        S
887          888         1       1                       Graham, Miss. Margaret Edith  female  19.0      0      0            112053  30.0000   B42        S
888          889         0       3           Johnston, Miss. Catherine Helen "Carrie"  female   NaN      1      2        W./C. 6607  23.4500   NaN        S
889          890         1       1                              Behr, Mr. Karl Howell    male  26.0      0      0            111369  30.0000  C148        C
890          891         0       3                                Dooley, Mr. Patrick    male  32.0      0      0            370376   7.7500   NaN        Q

[891 rows x 12 columns]

我想看看Pandas数据框的前8行。

In [4]: titanic.head(8)
Out[4]: 
   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
          1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
          2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
          3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
          4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
          5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S
          6         0       3                                   Moran, Mr. James    male   NaN      0      0            330877   8.4583   NaN        Q
          7         0       1                            McCarthy, Mr. Timothy J    male  54.0      0      0             17463  51.8625   E46        S
          8         0       3                     Palsson, Master. Gosta Leonard    male   2.0      3      1            349909  21.0750   NaN        S

的前N行 DataFrame ，请使用 head() 方法，并将所需的行数(在本例中为8行)作为参数。

备注

而是对最后N行感兴趣？Pandas还提供了一个 tail() 方法。例如, titanic.tail(10) 将返回DataFrame的最后10行。

可以通过请求Pandas来检查Pandas如何解释每种列数据类型 dtypes 属性：

In [5]: titanic.dtypes
Out[5]: 
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

对于每一列，都登记了所使用的数据类型。此文件中的数据类型 DataFrame 是整数 (int64 )，漂浮 (float64 )和字符串 (object )。

备注

当请求 dtypes, no brackets are used! dtypes is an attribute of a DataFrame and Series. Attributes of DataFrame or Series do not need brackets. Attributes represent a characteristic of a DataFrame/Series ，而方法(需要方括号) do 有一些关于 DataFrame/Series 中介绍的 first tutorial 。

我的同事要求将泰坦尼克号的数据作为电子表格。

In [6]: titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/generic.py:2237, in NDFrame.to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes, storage_options)
   2224 from pandas.io.formats.excel import ExcelFormatter
   2226 formatter = ExcelFormatter(
   2227     df,
   2228     na_rep=na_rep,
   (...)
   2235     inf_rep=inf_rep,
   2236 )
-> 2237 formatter.write(
   2238     excel_writer,
   2239     sheet_name=sheet_name,
   2240     startrow=startrow,
   2241     startcol=startcol,
   2242     freeze_panes=freeze_panes,
   2243     engine=engine,
   2244     storage_options=storage_options,
   2245 )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/formats/excel.py:896, in ExcelFormatter.write(self, writer, sheet_name, startrow, startcol, freeze_panes, engine, storage_options)
    892     need_save = False
    893 else:
    894     # error: Cannot instantiate abstract class 'ExcelWriter' with abstract
    895     # attributes 'engine', 'save', 'supported_extensions' and 'write_cells'
--> 896     writer = ExcelWriter(  # type: ignore[abstract]
    897         writer, engine=engine, storage_options=storage_options
    898     )
    899     need_save = True
    901 try:

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_openpyxl.py:55, in OpenpyxlWriter.__init__(self, path, engine, date_format, datetime_format, mode, storage_options, if_sheet_exists, engine_kwargs, **kwargs)
     42 def __init__(
     43     self,
     44     path: FilePath | WriteExcelBuffer | ExcelWriter,
   (...)
     53 ) -> None:
     54     # Use the openpyxl module as the Excel writer.
---> 55     from openpyxl.workbook import Workbook
     57     engine_kwargs = combine_kwargs(engine_kwargs, kwargs)
     59     super().__init__(
     60         path,
     61         mode=mode,
   (...)
     64         engine_kwargs=engine_kwargs,
     65     )

ModuleNotFoundError: No module named 'openpyxl'

鉴于 read_* 函数用于向Pandas读取数据， to_* 方法用于存储数据。这个 to_excel() 方法将数据存储为Excel文件。在这里的示例中， sheet_name 被命名为旅客而不是默认设置 图纸1 。通过设置 index=False 行索引标签不会保存在电子表格中。

等价的读取函数 read_excel() 将数据重新加载到 DataFrame ：

In [7]: titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/util/_decorators.py:317, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    311 if len(args) > num_allow_args:
    312     warnings.warn(
    313         msg.format(arguments=arguments),
    314         FutureWarning,
    315         stacklevel=stacklevel,
    316     )
--> 317 return func(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:458, in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, decimal, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
    456 if not isinstance(io, ExcelFile):
    457     should_close = True
--> 458     io = ExcelFile(io, storage_options=storage_options, engine=engine)
    459 elif engine and engine != io.engine:
    460     raise ValueError(
    461         "Engine should not be specified when passing "
    462         "an ExcelFile - ExcelFile already has the engine set"
    463     )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:1482, in ExcelFile.__init__(self, path_or_buffer, engine, storage_options)
   1480     ext = "xls"
   1481 else:
-> 1482     ext = inspect_excel_format(
   1483         content_or_path=path_or_buffer, storage_options=storage_options
   1484     )
   1485     if ext is None:
   1486         raise ValueError(
   1487             "Excel file format cannot be determined, you must specify "
   1488             "an engine manually."
   1489         )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:1355, in inspect_excel_format(content_or_path, storage_options)
   1352 if isinstance(content_or_path, bytes):
   1353     content_or_path = BytesIO(content_or_path)
-> 1355 with get_handle(
   1356     content_or_path, "rb", storage_options=storage_options, is_text=False
   1357 ) as handle:
   1358     stream = handle.handle
   1359     stream.seek(0)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/common.py:795, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    786         handle = open(
    787             handle,
    788             ioargs.mode,
   (...)
    791             newline="",
    792         )
    793     else:
    794         # Binary mode
--> 795         handle = open(handle, ioargs.mode)
    796     handles.append(handle)
    798 # Convert BytesIO or file objects passed with an encoding

FileNotFoundError: [Errno 2] No such file or directory: 'titanic.xlsx'

In [8]: titanic.head()
Out[8]: 
   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S

我对一份技术摘要很感兴趣 DataFrame

In [9]: titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

该方法 info() 提供有关 DataFrame ，让我们更详细地解释一下输出：

它确实是一个 DataFrame 。
共有891个条目，即约891行。
每行都有一个行标签(也称为 index )，取值范围从0到890。
该表有12列。大多数列的每一行都有一个值(所有891个值都是 non-null )。某些列确实缺少值并且小于891 non-null 价值。
栏目 Name ， Sex ， Cabin 和 Embarked 由文本数据(字符串，又名 object )。其他列是数字数据，其中一些是整数(又名 integer )和其他是实数(又名 float )。
数据类型(字符、整数、…)在不同的列中，通过列出 dtypes 。
还提供了用于保存DataFrame的大致RAM大小。

REMEMBER

支持从多种不同的文件格式或数据源将数据导入Pandas read_* 功能。
从Pandas身上导出数据由不同的公司提供 to_* 方法。
这个 head/tail /info``方法和 ``dtypes 属性可以方便地进行第一次检查。

To user guide

有关Pandas的输入和输出可能性的完整概述，请参阅用户指南部分 reader and writer functions 。

Pandas处理的是什么样的数据？

如何选择 DataFrame ？