In [1]: import pandas as pd
- Titanic data
本教程使用存储为CSV的泰坦尼克号数据集。数据由以下数据列组成:
PassengerID:每个乘客的ID。
生还:表示乘客是否生还。
0
是的而且1
不是的。Pclass:3个票类中的一个:类
1
,班级2
和班级3
。姓名:乘客姓名。
性别:乘客性别。
年龄:乘客的年龄,以年为单位。
SibSp:船上兄弟姐妹或配偶的数量。
Parch:船上父母或孩子的人数。
车票:旅客车票号码。
票价:标明票价。
客舱:客舱人数。
已装船:装船港。
如何读写表格数据?#
我想分析泰坦尼克号的乘客数据,这些数据以CSV文件的形式提供。
In [2]: titanic = pd.read_csv("data/titanic.csv")
Pandas提供了
read_csv()
函数将存储为CSV文件的数据读取到PANAS中DataFrame
。Pandas支持多种不同的文件格式或开箱即用的数据源(csv、EXCEL、SQL、JSON、PARQUET、…),每个字符都带有前缀read_*
。
确保在读取数据后始终对数据进行检查。在显示 DataFrame
,默认情况下将显示第一行和最后5行:
In [3]: titanic
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
.. ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q
[891 rows x 12 columns]
我想看看Pandas数据框的前8行。
In [4]: titanic.head(8) Out[4]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q 6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S 7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
的前N行
DataFrame
,请使用head()
方法,并将所需的行数(在本例中为8行)作为参数。
备注
而是对最后N行感兴趣?Pandas还提供了一个 tail()
方法。例如, titanic.tail(10)
将返回DataFrame的最后10行。
可以通过请求Pandas来检查Pandas如何解释每种列数据类型 dtypes
属性:
In [5]: titanic.dtypes
Out[5]:
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
对于每一列,都登记了所使用的数据类型。此文件中的数据类型 DataFrame
是整数 (int64
),漂浮 (float64
)和字符串 (object
)。
备注
当请求 dtypes
, no brackets are used! dtypes
is an attribute of a DataFrame
and Series
. Attributes of DataFrame
or Series
do not need brackets. Attributes represent a characteristic of a DataFrame
/Series
,而方法(需要方括号) do 有一些关于 DataFrame
/Series
中介绍的 first tutorial 。
我的同事要求将泰坦尼克号的数据作为电子表格。
In [6]: titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False) --------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) Input In [6], in <cell line: 1>() ----> 1 titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False) File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/generic.py:2237, in NDFrame.to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes, storage_options) 2224 from pandas.io.formats.excel import ExcelFormatter 2226 formatter = ExcelFormatter( 2227 df, 2228 na_rep=na_rep, (...) 2235 inf_rep=inf_rep, 2236 ) -> 2237 formatter.write( 2238 excel_writer, 2239 sheet_name=sheet_name, 2240 startrow=startrow, 2241 startcol=startcol, 2242 freeze_panes=freeze_panes, 2243 engine=engine, 2244 storage_options=storage_options, 2245 ) File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/formats/excel.py:896, in ExcelFormatter.write(self, writer, sheet_name, startrow, startcol, freeze_panes, engine, storage_options) 892 need_save = False 893 else: 894 # error: Cannot instantiate abstract class 'ExcelWriter' with abstract 895 # attributes 'engine', 'save', 'supported_extensions' and 'write_cells' --> 896 writer = ExcelWriter( # type: ignore[abstract] 897 writer, engine=engine, storage_options=storage_options 898 ) 899 need_save = True 901 try: File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_openpyxl.py:55, in OpenpyxlWriter.__init__(self, path, engine, date_format, datetime_format, mode, storage_options, if_sheet_exists, engine_kwargs, **kwargs) 42 def __init__( 43 self, 44 path: FilePath | WriteExcelBuffer | ExcelWriter, (...) 53 ) -> None: 54 # Use the openpyxl module as the Excel writer. ---> 55 from openpyxl.workbook import Workbook 57 engine_kwargs = combine_kwargs(engine_kwargs, kwargs) 59 super().__init__( 60 path, 61 mode=mode, (...) 64 engine_kwargs=engine_kwargs, 65 ) ModuleNotFoundError: No module named 'openpyxl'
鉴于
read_*
函数用于向Pandas读取数据,to_*
方法用于存储数据。这个to_excel()
方法将数据存储为Excel文件。在这里的示例中,sheet_name
被命名为 旅客 而不是默认设置 图纸1 。通过设置index=False
行索引标签不会保存在电子表格中。
等价的读取函数 read_excel()
将数据重新加载到 DataFrame
:
In [7]: titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers")
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers")
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/util/_decorators.py:317, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
311 if len(args) > num_allow_args:
312 warnings.warn(
313 msg.format(arguments=arguments),
314 FutureWarning,
315 stacklevel=stacklevel,
316 )
--> 317 return func(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:458, in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, decimal, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
456 if not isinstance(io, ExcelFile):
457 should_close = True
--> 458 io = ExcelFile(io, storage_options=storage_options, engine=engine)
459 elif engine and engine != io.engine:
460 raise ValueError(
461 "Engine should not be specified when passing "
462 "an ExcelFile - ExcelFile already has the engine set"
463 )
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:1482, in ExcelFile.__init__(self, path_or_buffer, engine, storage_options)
1480 ext = "xls"
1481 else:
-> 1482 ext = inspect_excel_format(
1483 content_or_path=path_or_buffer, storage_options=storage_options
1484 )
1485 if ext is None:
1486 raise ValueError(
1487 "Excel file format cannot be determined, you must specify "
1488 "an engine manually."
1489 )
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:1355, in inspect_excel_format(content_or_path, storage_options)
1352 if isinstance(content_or_path, bytes):
1353 content_or_path = BytesIO(content_or_path)
-> 1355 with get_handle(
1356 content_or_path, "rb", storage_options=storage_options, is_text=False
1357 ) as handle:
1358 stream = handle.handle
1359 stream.seek(0)
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/common.py:795, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
786 handle = open(
787 handle,
788 ioargs.mode,
(...)
791 newline="",
792 )
793 else:
794 # Binary mode
--> 795 handle = open(handle, ioargs.mode)
796 handles.append(handle)
798 # Convert BytesIO or file objects passed with an encoding
FileNotFoundError: [Errno 2] No such file or directory: 'titanic.xlsx'
In [8]: titanic.head()
Out[8]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
我对一份技术摘要很感兴趣
DataFrame
In [9]: titanic.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
该方法
info()
提供有关DataFrame
,让我们更详细地解释一下输出:它确实是一个
DataFrame
。共有891个条目,即约891行。
每行都有一个行标签(也称为
index
),取值范围从0到890。该表有12列。大多数列的每一行都有一个值(所有891个值都是
non-null
)。某些列确实缺少值并且小于891non-null
价值。栏目
Name
,Sex
,Cabin
和Embarked
由文本数据(字符串,又名object
)。其他列是数字数据,其中一些是整数(又名integer
)和其他是实数(又名float
)。数据类型(字符、整数、…)在不同的列中,通过列出
dtypes
。还提供了用于保存DataFrame的大致RAM大小。
REMEMBER
支持从多种不同的文件格式或数据源将数据导入Pandas
read_*
功能。从Pandas身上导出数据由不同的公司提供
to_*
方法。这个
head
/tail
/info``方法和 ``dtypes
属性可以方便地进行第一次检查。
有关Pandas的输入和输出可能性的完整概述,请参阅用户指南部分 reader and writer functions 。