IO工具(文本、CSV、HDF5等)#

PandasI/O API是一组顶级的 reader 访问的功能如下 pandas.read_csv() 通常会返回一个Pandas对象。相应的 writer 函数是对象方法,其访问方式如下 DataFrame.to_csv() 。下面的表格包含可用 readerswriters

格式类型

数据描述

读者

作家

文本

CSV

read_csv

to_csv

文本

固定宽度文本文件

read_fwf

文本

JSON

read_json

to_json

文本

HTML

read_html

to_html

文本

LaTeX

Styler.to_latex

文本

XML

read_xml

to_xml

文本

本地剪贴板

read_clipboard

to_clipboard

二进制

MS Excel

read_excel

to_excel

二进制

OpenDocument

read_excel

二进制

HDF5 Format

read_hdf

to_hdf

二进制

Feather Format

read_feather

to_feather

二进制

Parquet Format

read_parquet

to_parquet

二进制

ORC Format

read_orc

二进制

Stata

read_stata

to_stata

二进制

SAS

read_sas

二进制

SPSS

read_spss

二进制

Python Pickle Format

read_pickle

to_pickle

SQL

SQL

read_sql

to_sql

SQL

Google BigQuery

read_gbq

to_gbq

Here 是其中一些IO方法的非正式性能比较。

备注

有关使用 StringIO 类,请确保使用 from io import StringIO 对于Python3。

CSV文本文件(&T)#

用于读取文本文件的主要功能(也称为平面文件)是 read_csv() 。请参阅 cookbook 一些先进的策略。

解析选项#

read_csv() 接受以下常见参数:

基本信息#

filepath_or_buffer多种多样

文件的路径(a strpathlib.Path ,或 py:py._path.local.LocalPath )、URL(包括http、ftp和s3位置)或具有 read() 方法(如打开的文件或 StringIO )。

9月 : str, defaults to ',' for read_csv()\tread_table()字符串,默认为

要使用的分隔符。如果9月是 None ,C引擎不能自动检测分隔符,但Python解析引擎可以,这意味着后者将被使用,并通过Python的内置嗅探工具自动检测分隔符, csv.Sniffer 。此外,长度超过1个字符且不同于的分隔符 '\s+' 将被解释为正则表达式,并且还将强制使用Python解析引擎。请注意,正则表达式分隔符很容易忽略引用的数据。正则表达式示例: '\\r\\t'

分隔符:字符串,默认 None字符串,默认

9月的备用参数名称。

delim_whitespace布尔值,默认为FALSE

指定是否使用空白(例如 ' ''\t' )将用作分隔符。相当于设置 sep='\s+' 。如果此选项设置为 True ,则不应为 delimiter 参数。

列和索引位置和名称#

Header:int或int列表,默认 'infer'整型或整型列表,默认

用作列名和数据开始的行号。默认行为是推断列名:如果没有传递任何名称,则行为与 header=0 和列名是从文件的第一行推断出来的,如果列名是显式传递的,则行为与 header=None 。显式传递 header=0 能够替换现有名称。

标题可以是为列上的多索引指定行位置的整数的列表,例如 [0,1,3] 。未指定的中间行将被跳过(例如,本例中的2被跳过)。请注意,如果出现以下情况,此参数将忽略注释行和空行 skip_blank_lines=True ,因此Header=0表示数据的第一行,而不是文件的第一行。

名称:类似数组,默认 None类似数组,默认

要使用的列名列表。如果文件不包含标题行,则应显式传递 header=None 。不允许此列表中存在重复项。

INDEX_COOL:int,字符串,int/str序列,或FALSE,可选,默认 NoneInt、str、int/str序列或False、可选、默认

要用作 DataFrame 以字符串名或列索引的形式给出。如果给定一个int/str序列,则使用多重索引。

注: index_col=False 可以用来迫使大Pandas not 使用第一列作为索引,例如,当您有一个格式错误的文件时,每行末尾都有分隔符。

的默认值 None 让Pandas猜一猜。如果列标题行中的字段数等于数据文件正文中的字段数,则使用默认索引。如果它更大,则将第一列用作索引,以便正文中的剩余字段数等于标题中的字段数。

标题后的第一行用于确定将进入索引的列数。如果后续行包含的列少于第一行,则使用 NaN

这可以通过以下方式避免 usecols 。这可确保按原样处理列,并忽略尾随数据。

用法:类似列表或可调用,默认 None列表式或可调用,默认

返回列的子集。如果是类似列表的元素,则所有元素必须是按位置排列的(即文档列的整数索引),或者是与用户在 names 或从文档标题行中推断。如果 names ,则不考虑文档标题行。例如,一个有效的类似列表的 usecols 参数将为 [0, 1, 2]['foo', 'bar', 'baz']

元素顺序被忽略,因此 usecols=[0, 1] 是否与 [1, 0] 。从以下位置实例化DataFrame data 保留元素顺序的使用 pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] 对于中的列 ['foo', 'bar'] 订购或 pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]['bar', 'foo'] 秩序。

如果可调用,则将根据列名计算可调用函数,返回可调用函数计算结果为True的名称:

In [1]: import pandas as pd

In [2]: from io import StringIO

In [3]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [4]: pd.read_csv(StringIO(data))
Out[4]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [5]: pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ["COL1", "COL3"])
Out[5]: 
  col1  col3
0    a     1
1    a     2
2    c     3

在使用c引擎时,使用此参数可以大大缩短解析时间并降低内存使用量。在决定删除哪些列之前,Python引擎首先加载数据。

挤压:布尔值,默认 False布尔值,默认

如果解析的数据只包含一列,则返回一个 Series

1.4.0 版后已移除: 附加 .squeeze("columns") 呼叫到 {{func_name}} 来压缩数据。

前缀:字符串,默认 None字符串,默认

没有标题时添加到列号的前缀,例如‘X’表示X0、X1、...

1.4.0 版后已移除: 调用后对DataFrame的列使用列表理解 read_csv

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
Mangle_dupe_cols:布尔值,默认 True布尔值,默认

重复的列将被指定为‘X’、‘X.1’...‘X.N’,而不是‘X’...‘X’。正在传递 False 如果列中存在重复名称,将导致数据被覆盖。

通用解析配置#

数据类型:列的类型名称或字典->类型,默认 None列的类型名称或字典->类型,默认

数据或列的数据类型。例如。 {{'a': np.float64, 'b': np.int32}} (不支持 engine='python' )。使用 strobject 与合适的 na_values 保留而不解释数据类型的设置。。。添加的版本::1.5.0

添加了对默认判决的支持。指定一个defaultdict作为输入,其中缺省值确定未显式列出的列的数据类型。

引擎:{‘c’'python''pyarrow' }{'c', 'python', 'pyarrow'}

要使用的解析器引擎。C和yarrow引擎速度更快,而目前的python引擎功能更齐全。多线程目前仅受pyrow引擎支持。

1.4.0 新版功能: “箭箭”引擎被添加为 实验性的 引擎,并且此引擎不支持某些功能,或者这些功能可能无法正常工作。

转换器:词典,默认 None词典,默认

用于转换某些列中的值的函数的字典。键可以是整数,也可以是列标签。

TRUE_VALUES:列表,默认 None列表,默认

要考虑为 True

FALSE_VALUES:列表,默认 None列表,默认

要考虑为 False

SkipInitialspace:布尔型,默认 False布尔值,默认

跳过分隔符后面的空格。

Skipprows:类似列表或整型,默认 None列表式或整型,默认

文件开始处要跳过的行数(0-索引)或要跳过的行数(Int)。

如果可调用,则将根据行索引计算可调用函数,如果应跳过该行,则返回True,否则返回False:

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
Skipfooter:int,默认 0整型,默认

文件底部要跳过的行数(引擎=‘c’不支持)。

Nrow:int,默认 None整型,默认

要读取的文件行数。对于读取大文件片段非常有用。

LOW_MEMORY:布尔值,默认 True布尔值,默认

在内部以块的形式处理文件,导致在解析时使用较低的内存,但可能会出现混合类型推断。为确保不会设置混合类型或 False ,或使用 dtype 参数。请注意,整个文件被读入一个 DataFrame 无论如何,请使用 chunksizeiterator 参数以块为单位返回数据。(仅对C解析器有效)

memory_map布尔值,默认为FALSE

如果为以下对象提供文件路径 filepath_or_buffer ,将文件对象直接映射到内存,并从那里直接访问数据。使用此选项可以提高性能,因为不再有任何I/O开销。

NA和缺失数据处理#

NA_VALUES:标量、字符串、类似列表或DICT,默认 None标量、字符串、类似列表或DICT,默认

要识别为NA/NaN的其他字符串。如果DCT通过,则为特定的每列NA值。看见 na values const 以获取默认情况下解释为NaN的值的列表。

Keep_Default_NA:布尔值,默认 True布尔值,默认

在解析数据时是否包括默认的NaN值。取决于是否 na_values 传入时,行为如下:

  • 如果 keep_default_naTrue ,以及 na_values 是指定的, na_values 被附加到用于解析的默认NaN值。

  • 如果 keep_default_naTrue ,以及 na_values 未指定,则只使用默认的NaN值进行解析。

  • 如果 keep_default_naFalse ,以及 na_values ,则仅指定NaN值。 na_values 用于解析。

  • 如果 keep_default_naFalse ,以及 na_values 未指定,则不会将任何字符串解析为NaN。

请注意,如果 na_filter 被作为 False ,即 keep_default_nana_values 参数将被忽略。

NA_FILTER:布尔值,默认 True布尔值,默认

检测缺少的值标记(空字符串和NA_VALUES的值)。在没有任何NAS的数据中,传递 na_filter=False 可以提高读取大文件的性能。

Verbose:布尔值,默认 False布尔值,默认

表示放置在非数字列中的NA值的数量。

SKIP_BLACK_LINES:布尔值,默认 True布尔值,默认

如果 True 跳过空行,而不是解释为NaN值。

日期时间处理#

Parse_date:布尔型或整型列表或名称列表或列表或字典,默认 False布尔值或整数列表或名称或列表或字典,默认
  • 如果 True ->尝试解析索引。

  • 如果 [1, 2, 3] ->尝试将第1、2、3列分别解析为单独的日期列。

  • 如果 [[1, 3]] ->组合第1列和第3列,并解析为单个Date列。

  • 如果 {{'foo': [1, 3]}} ->将第1、3列解析为日期,并调用结果‘foo’。对于ISO 8601格式的日期,存在一种快速路径。

INFER_DATETIME_FORMAT:布尔值,默认 False布尔值,默认

如果 True 并且为列启用了parse_date,则尝试推断日期时间格式以加快处理速度。

KEEP_DATE_COL:布尔值,默认 False布尔值,默认

如果 True Parse_date指定合并多个列,然后保留原始列。

Date_parser:函数,默认 None函数,默认

用于将字符串列序列转换为DateTime实例数组的函数。默认情况下使用 dateutil.parser.parser 来进行转换。Pandas将尝试以三种不同的方式调用date_parser,如果发生异常,则前进到下一种方式:1)传递一个或多个数组(由parse_date定义)作为参数;2)将parse_date定义的列中的字符串值连接到单个数组中并传递;3)使用一个或多个字符串(对应于parse_date定义的列)作为参数,为每一行调用date_parser一次。

Day First:布尔值,默认 False布尔值,默认

DD/MM格式日期、国际格式和欧洲格式。

cache_dates布尔值,默认为True

如果为True,则使用唯一的转换日期缓存来应用DateTime转换。在分析重复的日期字符串时,尤其是具有时区偏移量的字符串时,可能会显著提高速度。

0.25.0 新版功能.

迭代法#

迭代器:布尔值,默认 False布尔值,默认

返回 TextFileReader 对象进行迭代或获取块 get_chunk()

块大小:int,默认 None整型,默认

返回 TextFileReader 迭代的对象。看见 iterating and chunking 下面。

报价、压缩和文件格式#

压缩:{‘推断’'gzip''bz2''zip''xz''zstd'Nonedict },默认 'infer'{'infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', None, dict}, default 'infer'

For on-the-fly decompression of on-disk data. If 'infer', then use gzip, bz2, zip, xz, or zstandard if filepath_or_buffer is path-like ending in '.gz', '.bz2', '.zip', '.xz', '.zst', respectively, and no decompression otherwise. If using 'zip', the ZIP file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, or zstandard.ZstdDecompressor. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.

在 1.1.0 版更改: 扩展DICT选项以支持 gzipbz2

在 1.2.0 版更改: 以前的版本将‘gzip’的词典条目转发到 gzip.open

千位:字符串,默认 None字符串,默认

上千个分隔符。

小数:字符串,默认 '.'字符串,默认

要识别为小数点的字符。例如,使用 ',' 对于欧洲的数据。

float_precision字符串,默认为无

指定C引擎应对浮点值使用哪个转换器。这些选项包括 None 对于普通的转换器, high 对于高精度转换器,以及 round_trip 用于往返转换器。

行终止符:字符串(长度1),默认 None字符串(长度1),默认

用于将文件拆分为行的字符。仅对C解析器有效。

报价费字符串(长度1)

用于表示引用项的开始和结束的字符。引用的项目可以包括分隔符,它将被忽略。

报价:INT或 csv.QUOTE_* 实例,默认 0整型或

控制每个字段的报价行为 csv.QUOTE_* 常量。使用以下选项之一 QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2)或 QUOTE_NONE (3)。

双等分:布尔值,默认 True布尔值,默认

什么时候 quotechar 已指定,并且 quoting 不是 QUOTE_NONE ,指示是否解释连续的两个 quotechar 元素 内部 作为单个字段的字段 quotechar 元素。

换行符:字符串(长度1),默认 None字符串(长度1),默认

引号为时用于转义分隔符的单字符字符串 QUOTE_NONE

备注:字符串,默认 None字符串,默认

指示不应分析行的其余部分。如果在一行的开头找到该行,则该行将被完全忽略。此参数必须是单个字符。像空行一样(只要 skip_blank_lines=True ),则参数将忽略完全注释的行 header 但不是通过 skiprows 。例如,如果 comment='#' ,正在使用解析‘#Empty\na,b,c\n 1,2,3’ header=0 将导致‘a,b,c’被视为标头。

编码:字符串,默认 None字符串,默认

读/写时用于UTF的编码(例如 'utf-8' )。 List of Python standard encodings

方言 : str or csv.Dialect 实例,默认 None字符串或

如果提供,此参数将覆盖以下参数的值(默认或非默认): delimiterdoublequoteescapecharskipinitialspacequotechar ,以及 quoting 。如果需要重写值,将发出ParserWarning。看见 csv.Dialect 文档以了解更多详细信息。

错误处理#

ERROR_BAD_LINES:布尔值,可选,默认 None布尔值、可选、默认

默认情况下,具有过多字段的行(例如,带有过多逗号的CSV行)将导致引发异常,并且不会 DataFrame 将会被归还。如果 False ,那么这些“坏行”将从 DataFrame 这是退还的。看见 bad lines 下面。

1.3.0 版后已移除: 这个 on_bad_lines 参数来指定遇到坏行时的行为。

WARN_BAD_LINES:布尔值,可选,默认 None布尔值、可选、默认

如果Error_Bad_Lines为 False ,并且WARN_BAD_LINES为 True ,将为每个“坏行”输出一条警告。

1.3.0 版后已移除: 这个 on_bad_lines 参数来指定遇到坏行时的行为。

on_bad_lines('error', 'warn', 'skip'), default 'error'

指定遇到坏行(具有太多字段的行)时应执行的操作。允许的值包括:

  • “Error”,在遇到错误行时引发ParserError。

  • ‘warn’,在遇到坏行时打印警告并跳过该行。

  • “跳过”,跳过错误的行,在遇到错误行时不进行提示或警告。

1.3.0 新版功能.

指定列数据类型#

您可以指定整个数据类型 DataFrame 或单独的列:

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object

幸运的是,Pandas提供了不止一种方法来确保您的专栏只包含一个 dtype 。如果您不熟悉这些概念,您可以看到 here 以了解有关dtype的更多信息,以及 here 要了解更多关于 object 在Pandas身上的转化。

例如,您可以使用 converters 论证 read_csv()

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64

或者,您可以使用 to_numeric() 函数在读入数据后强制数据类型,

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64

它会将所有有效的解析转换为浮点数,将无效的解析保留为 NaN

最终,如何处理包含混合数据类型的列中的读取取决于您的特定需求。在上面的情况下,如果您想 NaN 排除数据异常,然后 to_numeric() 可能是你最好的选择。但是,如果希望强制所有数据,而不考虑类型,则使用 converters 论证 read_csv() 当然值得一试。

备注

在某些情况下,读入包含混合数据类型的列的异常数据将导致数据集不一致。如果您依赖PANDA来推断列的数据类型,则解析引擎将为不同的数据块推断数据类型,而不是一次性推断整个数据集的数据类型。因此,您可以得到具有混合数据类型的列。例如,

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')

将导致 mixed_df 包含一个 int 用于列的某些块的dtype,以及 str 对于其他类型,则是由于来自所读入的数据的混合数据类型。请务必注意,整个列将标记为 dtypeobject ,用于具有混合数据类型的列。

指定类别数据类型#

Categorical 可以通过指定以下内容直接解析列 dtype='category'dtype=CategoricalDtype(categories, ordered)

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object

单个列可以被解析为 Categorical 使用DICT规范:

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object

指定 dtype='category' 将导致无序的 Categorical 谁的 categories 是在数据中观察到的唯一值。要更好地控制类别和顺序,请创建一个 CategoricalDtype ,并将其传递给该专栏的 dtype

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object

在使用时 dtype=CategoricalDtype ,之外的“意外”值 dtype.categories 被视为缺失值。

In [43]: dtype = CategoricalDtype(["a", "b", "d"])  # No 'c'

In [44]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).col1
Out[44]: 
0      a
1      a
2    NaN
Name: col1, dtype: category
Categories (3, object): ['a', 'b', 'd']

这与 Categorical.set_categories()

备注

使用 dtype='category' ,生成的类别将始终被解析为字符串(对象数据类型)。如果类别是数值类别,则可以使用 to_numeric() 函数,或在适当的情况下,另一个转换器,如 to_datetime()

什么时候 dtype 是一种 CategoricalDtype 同质的 categories (所有数字、所有日期时间等),则自动完成转换。

In [45]: df = pd.read_csv(StringIO(data), dtype="category")

In [46]: df.dtypes
Out[46]: 
col1    category
col2    category
col3    category
dtype: object

In [47]: df["col3"]
Out[47]: 
0    1
1    2
2    3
Name: col3, dtype: category
Categories (3, object): ['1', '2', '3']

In [48]: df["col3"].cat.categories = pd.to_numeric(df["col3"].cat.categories)

In [49]: df["col3"]
Out[49]: 
0    1
1    2
2    3
Name: col3, dtype: category
Categories (3, int64): [1, 2, 3]

命名和使用列#

处理列名#

文件可能有也可能没有标题行。Pandas假设第一行应该用作列名:

In [50]: data = "a,b,c\n1,2,3\n4,5,6\n7,8,9"

In [51]: print(data)
a,b,c
1,2,3
4,5,6
7,8,9

In [52]: pd.read_csv(StringIO(data))
Out[52]: 
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

通过指定 names 结合以下内容进行论证 header 您可以指定要使用的其他名称以及是否丢弃标题行(如果有):

In [53]: print(data)
a,b,c
1,2,3
4,5,6
7,8,9

In [54]: pd.read_csv(StringIO(data), names=["foo", "bar", "baz"], header=0)
Out[54]: 
   foo  bar  baz
0    1    2    3
1    4    5    6
2    7    8    9

In [55]: pd.read_csv(StringIO(data), names=["foo", "bar", "baz"], header=None)
Out[55]: 
  foo bar baz
0   a   b   c
1   1   2   3
2   4   5   6
3   7   8   9

如果标题不在第一行中,则将行号传递给 header 。这将跳过前面的行:

In [56]: data = "skip this skip it\na,b,c\n1,2,3\n4,5,6\n7,8,9"

In [57]: pd.read_csv(StringIO(data), header=1)
Out[57]: 
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

备注

默认行为是推断列名:如果没有传递任何名称,则行为与 header=0 和列名是从文件的第一个非空行推断出来的,如果列名是显式传递的,则行为与 header=None

重复名称解析#

如果文件或头文件包含重复名称,默认情况下,PANDA会区分它们,以防止覆盖数据:

In [58]: data = "a,b,a\n0,1,2\n3,4,5"

In [59]: pd.read_csv(StringIO(data))
Out[59]: 
   a  b  a.1
0  0  1    2
1  3  4    5

不再有重复数据,因为 mangle_dupe_cols=True 默认情况下,它会将一系列重复的列‘X’,...,‘X’修改为‘X’,‘X.1’,...,‘X.N’。如果 mangle_dupe_cols=False ,则可能会出现重复数据:

In [2]: data = 'a,b,a\n0,1,2\n3,4,5'
In [3]: pd.read_csv(StringIO(data), mangle_dupe_cols=False)
Out[3]:
   a  b  a
0  2  1  2
1  5  4  5

若要防止用户遇到此重复数据问题,请使用 ValueError 如果满足以下条件,则引发异常 mangle_dupe_cols != True

In [2]: data = 'a,b,a\n0,1,2\n3,4,5'
In [3]: pd.read_csv(StringIO(data), mangle_dupe_cols=False)
...
ValueError: Setting mangle_dupe_cols=False is not supported yet

筛选列 (usecols )#

这个 usecols 参数允许您选择文件中列的任何子集,使用列名、位置编号或可调用:

In [60]: data = "a,b,c,d\n1,2,3,foo\n4,5,6,bar\n7,8,9,baz"

In [61]: pd.read_csv(StringIO(data))
Out[61]: 
   a  b  c    d
0  1  2  3  foo
1  4  5  6  bar
2  7  8  9  baz

In [62]: pd.read_csv(StringIO(data), usecols=["b", "d"])
Out[62]: 
   b    d
0  2  foo
1  5  bar
2  8  baz

In [63]: pd.read_csv(StringIO(data), usecols=[0, 2, 3])
Out[63]: 
   a  c    d
0  1  3  foo
1  4  6  bar
2  7  9  baz

In [64]: pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ["A", "C"])
Out[64]: 
   a  c
0  1  3
1  4  6
2  7  9

这个 usecols 参数还可用于指定在最终结果中不使用哪些列:

In [65]: pd.read_csv(StringIO(data), usecols=lambda x: x not in ["a", "c"])
Out[65]: 
   b    d
0  2  foo
1  5  bar
2  8  baz

在本例中,Callable指定我们从输出中排除“a”和“c”列。

注释和空行#

忽略行注释和空行#

如果 comment 参数,则完全注释的行将被忽略。默认情况下,完全空白行也将被忽略。

In [66]: data = "\na,b,c\n  \n# commented line\n1,2,3\n\n4,5,6"

In [67]: print(data)

a,b,c
  
# commented line
1,2,3

4,5,6

In [68]: pd.read_csv(StringIO(data), comment="#")
Out[68]: 
   a  b  c
0  1  2  3
1  4  5  6

如果 skip_blank_lines=False ,那么 read_csv 不会忽略空行:

In [69]: data = "a,b,c\n\n1,2,3\n\n\n4,5,6"

In [70]: pd.read_csv(StringIO(data), skip_blank_lines=False)
Out[70]: 
     a    b    c
0  NaN  NaN  NaN
1  1.0  2.0  3.0
2  NaN  NaN  NaN
3  NaN  NaN  NaN
4  4.0  5.0  6.0

警告

忽略行的存在可能会造成涉及行号的歧义;参数 header 使用行号(忽略注释/空行),而 skiprows 使用行号(包括注释/空行):

In [71]: data = "#comment\na,b,c\nA,B,C\n1,2,3"

In [72]: pd.read_csv(StringIO(data), comment="#", header=1)
Out[72]: 
   A  B  C
0  1  2  3

In [73]: data = "A,B,C\n#comment\na,b,c\n1,2,3"

In [74]: pd.read_csv(StringIO(data), comment="#", skiprows=2)
Out[74]: 
   a  b  c
0  1  2  3

如果两者都有 headerskiprows 是指定的, header 将相对于 skiprows 。例如:

In [75]: data = (
   ....:     "# empty\n"
   ....:     "# second empty line\n"
   ....:     "# third emptyline\n"
   ....:     "X,Y,Z\n"
   ....:     "1,2,3\n"
   ....:     "A,B,C\n"
   ....:     "1,2.,4.\n"
   ....:     "5.,NaN,10.0\n"
   ....: )
   ....: 

In [76]: print(data)
# empty
# second empty line
# third emptyline
X,Y,Z
1,2,3
A,B,C
1,2.,4.
5.,NaN,10.0


In [77]: pd.read_csv(StringIO(data), comment="#", skiprows=4, header=1)
Out[77]: 
     A    B     C
0  1.0  2.0   4.0
1  5.0  NaN  10.0

评论#

有时,文件中可能包含注释或元数据:

In [78]: print(open("tmp.csv").read())
ID,level,category
Patient1,123000,x # really unpleasant
Patient2,23000,y # wouldn't take his medicine
Patient3,1234018,z # awesome

默认情况下,解析器在输出中包括注释:

In [79]: df = pd.read_csv("tmp.csv")

In [80]: df
Out[80]: 
         ID    level                        category
0  Patient1   123000           x # really unpleasant
1  Patient2    23000  y # wouldn't take his medicine
2  Patient3  1234018                     z # awesome

我们可以使用 comment 关键词:

In [81]: df = pd.read_csv("tmp.csv", comment="#")

In [82]: df
Out[82]: 
         ID    level category
0  Patient1   123000       x 
1  Patient2    23000       y 
2  Patient3  1234018       z 

处理Unicode数据#

这个 encoding 参数应用于编码的Unicode数据,这将导致字节字符串在结果中被解码为Unicode:

In [83]: from io import BytesIO

In [84]: data = b"word,length\n" b"Tr\xc3\xa4umen,7\n" b"Gr\xc3\xbc\xc3\x9fe,5"

In [85]: data = data.decode("utf8").encode("latin-1")

In [86]: df = pd.read_csv(BytesIO(data), encoding="latin-1")

In [87]: df
Out[87]: 
      word  length
0  Träumen       7
1    Grüße       5

In [88]: df["word"][1]
Out[88]: 'Grüße'

有些将所有字符编码为多个字节的格式,如UTF-16,如果不指定编码,根本无法正确解析。 Full list of Python standard encodings

索引列和尾部分隔符#

如果文件的数据列比列名多一列,则第一列将用作 DataFrame 的行名:

In [89]: data = "a,b,c\n4,apple,bat,5.7\n8,orange,cow,10"

In [90]: pd.read_csv(StringIO(data))
Out[90]: 
        a    b     c
4   apple  bat   5.7
8  orange  cow  10.0
In [91]: data = "index,a,b,c\n4,apple,bat,5.7\n8,orange,cow,10"

In [92]: pd.read_csv(StringIO(data), index_col=0)
Out[92]: 
            a    b     c
index                   
4       apple  bat   5.7
8      orange  cow  10.0

通常,您可以使用 index_col 选项。

在某些例外情况下,准备的文件在每个数据行的末尾都带有分隔符,这会使解析器感到困惑。若要显式禁用索引列推断并丢弃最后一列,请传递 index_col=False

In [93]: data = "a,b,c\n4,apple,bat,\n8,orange,cow,"

In [94]: print(data)
a,b,c
4,apple,bat,
8,orange,cow,

In [95]: pd.read_csv(StringIO(data))
Out[95]: 
        a    b   c
4   apple  bat NaN
8  orange  cow NaN

In [96]: pd.read_csv(StringIO(data), index_col=False)
Out[96]: 
   a       b    c
0  4   apple  bat
1  8  orange  cow

如果正在使用 usecols 选项,则 index_col 规范基于该子集,而不是原始数据。

In [97]: data = "a,b,c\n4,apple,bat,\n8,orange,cow,"

In [98]: print(data)
a,b,c
4,apple,bat,
8,orange,cow,

In [99]: pd.read_csv(StringIO(data), usecols=["b", "c"])
Out[99]: 
     b   c
4  bat NaN
8  cow NaN

In [100]: pd.read_csv(StringIO(data), usecols=["b", "c"], index_col=0)
Out[100]: 
     b   c
4  bat NaN
8  cow NaN

日期处理#

指定日期列#

为了更好地便于使用日期时间数据, read_csv() 使用关键字参数 parse_datesdate_parser 允许用户指定各种列和日期/时间格式以将输入文本数据转换为 datetime 对象。

最简单的情况是直接传进来 parse_dates=True

# Use a column as an index, and parse it as dates.
In [101]: df = pd.read_csv("foo.csv", index_col=0, parse_dates=True)

In [102]: df
Out[102]: 
            A  B  C
date               
2009-01-01  a  1  2
2009-01-02  b  3  4
2009-01-03  c  4  5

# These are Python datetime objects
In [103]: df.index
Out[103]: DatetimeIndex(['2009-01-01', '2009-01-02', '2009-01-03'], dtype='datetime64[ns]', name='date', freq=None)

通常情况下,我们可能希望分别存储日期和时间数据,或者分别存储各种日期字段。这个 parse_dates 关键字可用于指定要从中解析日期和/或时间的列的组合。

您可以指定列列表列表以 parse_dates ,则生成的日期列将优先于输出(以便不影响现有的列顺序),并且新的列名将是组件列名的串联:

In [104]: print(open("tmp.csv").read())
KORD,19990127, 19:00:00, 18:56:00, 0.8100
KORD,19990127, 20:00:00, 19:56:00, 0.0100
KORD,19990127, 21:00:00, 20:56:00, -0.5900
KORD,19990127, 21:00:00, 21:18:00, -0.9900
KORD,19990127, 22:00:00, 21:56:00, -0.5900
KORD,19990127, 23:00:00, 22:56:00, -0.5900

In [105]: df = pd.read_csv("tmp.csv", header=None, parse_dates=[[1, 2], [1, 3]])

In [106]: df
Out[106]: 
                  1_2                 1_3     0     4
0 1999-01-27 19:00:00 1999-01-27 18:56:00  KORD  0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00  KORD  0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00  KORD -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00  KORD -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00  KORD -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00  KORD -0.59

默认情况下,分析器移除组件日期列,但您可以选择通过 keep_date_col 关键词:

In [107]: df = pd.read_csv(
   .....:     "tmp.csv", header=None, parse_dates=[[1, 2], [1, 3]], keep_date_col=True
   .....: )
   .....: 

In [108]: df
Out[108]: 
                  1_2                 1_3     0         1          2          3     4
0 1999-01-27 19:00:00 1999-01-27 18:56:00  KORD  19990127   19:00:00   18:56:00  0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00  KORD  19990127   20:00:00   19:56:00  0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00  KORD  19990127   21:00:00   20:56:00 -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00  KORD  19990127   21:00:00   21:18:00 -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00  KORD  19990127   22:00:00   21:56:00 -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00  KORD  19990127   23:00:00   22:56:00 -0.59

请注意,如果您希望将多个列合并为一个日期列,则必须使用嵌套列表。换句话说, parse_dates=[1, 2] 指示第二列和第三列应分别分析为单独的日期列,而 parse_dates=[[1, 2]] 意味着应该将这两列解析为一列。

您还可以使用DICT指定自定义名称列:

In [109]: date_spec = {"nominal": [1, 2], "actual": [1, 3]}

In [110]: df = pd.read_csv("tmp.csv", header=None, parse_dates=date_spec)

In [111]: df
Out[111]: 
              nominal              actual     0     4
0 1999-01-27 19:00:00 1999-01-27 18:56:00  KORD  0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00  KORD  0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00  KORD -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00  KORD -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00  KORD -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00  KORD -0.59

重要的是要记住,如果要将多个文本列解析为单个日期列,则会在数据前面添加一个新列。这个 index_col 规范基于这组新列,而不是原始数据列:

In [112]: date_spec = {"nominal": [1, 2], "actual": [1, 3]}

In [113]: df = pd.read_csv(
   .....:     "tmp.csv", header=None, parse_dates=date_spec, index_col=0
   .....: )  # index is the nominal column
   .....: 

In [114]: df
Out[114]: 
                                 actual     0     4
nominal                                            
1999-01-27 19:00:00 1999-01-27 18:56:00  KORD  0.81
1999-01-27 20:00:00 1999-01-27 19:56:00  KORD  0.01
1999-01-27 21:00:00 1999-01-27 20:56:00  KORD -0.59
1999-01-27 21:00:00 1999-01-27 21:18:00  KORD -0.99
1999-01-27 22:00:00 1999-01-27 21:56:00  KORD -0.59
1999-01-27 23:00:00 1999-01-27 22:56:00  KORD -0.59

备注

如果列或索引包含无法分析的日期,则整个列或索引将原封不动地作为对象数据类型返回。对于非标准日期时间解析,请使用 to_datetime() 之后 pd.read_csv

备注

Read_CSV有一个FAST_PATH,用于解析is8601格式的日期时间字符串,例如“2000-01-01T00:01:02+00:00”和类似的变体。如果您可以安排您的数据以这种格式存储日期时间,加载时间将显著更快,已观察到~20倍。

日期解析函数#

最后,解析器允许您指定一个定制 date_parser 函数以充分利用日期解析API的灵活性:

In [115]: df = pd.read_csv(
   .....:     "tmp.csv", header=None, parse_dates=date_spec, date_parser=pd.to_datetime
   .....: )
   .....: 

In [116]: df
Out[116]: 
              nominal              actual     0     4
0 1999-01-27 19:00:00 1999-01-27 18:56:00  KORD  0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00  KORD  0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00  KORD -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00  KORD -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00  KORD -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00  KORD -0.59

Pandas们会试着呼叫 date_parser 以三种不同的方式发挥作用。如果引发异常,则尝试下一个异常:

  1. date_parser 首先使用一个或多个数组作为参数进行调用,如使用 parse_dates (例如, date_parser(['2013', '2013'], ['1', '2']) )。

  2. 如果#1失败了, date_parser 在所有列按行级联成单个数组的情况下调用(例如, date_parser(['2013 1', '2013 2']) )。

请注意,在性能方面,您应该按顺序尝试以下日期解析方法:

  1. 尝试使用以下命令来推断格式 infer_datetime_format=True (见下文部分)。

  2. 如果您知道格式,请使用 pd.to_datetime()date_parser=lambda x: pd.to_datetime(x, format=...)

  3. 如果您有真正的非标准格式,请使用自定义 date_parser 功能。为了获得最佳性能,应该将其矢量化,即它应该接受数组作为参数。

解析具有混合时区的CSV#

Pandas本身不能表示具有混合时区的列或索引。如果您的CSV文件包含混合时区的列,则默认结果将是带有字符串的对象-dtype列,即使是 parse_dates

In [117]: content = """\
   .....: a
   .....: 2000-01-01T00:00:00+05:00
   .....: 2000-01-01T00:00:00+06:00"""
   .....: 

In [118]: df = pd.read_csv(StringIO(content), parse_dates=["a"])

In [119]: df["a"]
Out[119]: 
0    2000-01-01 00:00:00+05:00
1    2000-01-01 00:00:00+06:00
Name: a, dtype: object

若要将混合时区值解析为DateTime列,请传递部分应用的 to_datetime() 使用 utc=True 作为 date_parser

In [120]: df = pd.read_csv(
   .....:     StringIO(content),
   .....:     parse_dates=["a"],
   .....:     date_parser=lambda col: pd.to_datetime(col, utc=True),
   .....: )
   .....: 

In [121]: df["a"]
Out[121]: 
0   1999-12-31 19:00:00+00:00
1   1999-12-31 18:00:00+00:00
Name: a, dtype: datetime64[ns, UTC]

推断日期时间格式#

如果你有 parse_dates 为某些或所有列启用,并且日期时间字符串的格式都相同,则可以通过设置 infer_datetime_format=True 。如果设置,Pandas将尝试猜测日期时间字符串的格式,然后使用更快的方法解析字符串。观察到了5-10倍的解析速度。如果无法猜测格式,或者猜测的格式无法正确解析整列字符串列,则PANDA将退回到通常的解析。所以总的来说, infer_datetime_format 如果启用,应该不会有任何负面后果。

以下是一些可以猜测的日期时间字符串的示例(都表示2011年12月30日00:00:00):

  • “20111230”

  • “2011/12/30”

  • “20111230 00:00:00”

  • “2011年12月30日00:00:00”

  • “2011年12月30日00:00:00”

  • “2011年12月30日00:00:00”

请注意, infer_datetime_format 对此很敏感 dayfirst 。使用 dayfirst=True ,它会猜测2011年1月12日为12月1日。使用 dayfirst=False (默认)它会猜测“01/12/2011”为1月12日。

# Try to infer the format for the index column
In [122]: df = pd.read_csv(
   .....:     "foo.csv",
   .....:     index_col=0,
   .....:     parse_dates=True,
   .....:     infer_datetime_format=True,
   .....: )
   .....: 

In [123]: df
Out[123]: 
            A  B  C
date               
2009-01-01  a  1  2
2009-01-02  b  3  4
2009-01-03  c  4  5

国际日期格式#

虽然美国的日期格式往往是MM/DD/YYYY,但许多国际格式使用的是DD/MM/YYYY。为方便起见,请使用 dayfirst 提供关键字:

In [124]: print(open("tmp.csv").read())
date,value,cat
1/6/2000,5,a
2/6/2000,10,b
3/6/2000,15,c

In [125]: pd.read_csv("tmp.csv", parse_dates=[0])
Out[125]: 
        date  value cat
0 2000-01-06      5   a
1 2000-02-06     10   b
2 2000-03-06     15   c

In [126]: pd.read_csv("tmp.csv", dayfirst=True, parse_dates=[0])
Out[126]: 
        date  value cat
0 2000-06-01      5   a
1 2000-06-02     10   b
2 2000-06-03     15   c

将CSV写入二进制文件对象#

1.2.0 新版功能.

df.to_csv(..., mode="wb") 允许将CSV写入文件对象打开的二进制模式。在大多数情况下,没有必要指定 mode 因为Pandas将自动检测文件对象是以文本还是二进制模式打开的。

In [127]: import io

In [128]: data = pd.DataFrame([0, 1, 2])

In [129]: buffer = io.BytesIO()

In [130]: data.to_csv(buffer, encoding="utf-8", compression="gzip")

指定浮点转换的方法#

该参数 float_precision 可以指定,以便在使用C引擎进行分析期间使用特定的浮点转换器。选项包括普通转换器、高精度转换器和往返转换器(确保在写入文件后往返设置值)。例如:

In [131]: val = "0.3066101993807095471566981359501369297504425048828125"

In [132]: data = "a,b,c\n1,2,{0}".format(val)

In [133]: abs(
   .....:     pd.read_csv(
   .....:         StringIO(data),
   .....:         engine="c",
   .....:         float_precision=None,
   .....:     )["c"][0] - float(val)
   .....: )
   .....: 
Out[133]: 5.551115123125783e-17

In [134]: abs(
   .....:     pd.read_csv(
   .....:         StringIO(data),
   .....:         engine="c",
   .....:         float_precision="high",
   .....:     )["c"][0] - float(val)
   .....: )
   .....: 
Out[134]: 5.551115123125783e-17

In [135]: abs(
   .....:     pd.read_csv(StringIO(data), engine="c", float_precision="round_trip")["c"][0]
   .....:     - float(val)
   .....: )
   .....: 
Out[135]: 0.0

千个分隔符#

对于使用千位分隔符写入的大数字,可以将 thousands 关键字设置为长度为1的字符串,以便正确解析整数:

默认情况下,带有千位分隔符的数字将被解析为字符串:

In [136]: print(open("tmp.csv").read())
ID|level|category
Patient1|123,000|x
Patient2|23,000|y
Patient3|1,234,018|z

In [137]: df = pd.read_csv("tmp.csv", sep="|")

In [138]: df
Out[138]: 
         ID      level category
0  Patient1    123,000        x
1  Patient2     23,000        y
2  Patient3  1,234,018        z

In [139]: df.level.dtype
Out[139]: dtype('O')

这个 thousands 关键字允许正确解析整数:

In [140]: print(open("tmp.csv").read())
ID|level|category
Patient1|123,000|x
Patient2|23,000|y
Patient3|1,234,018|z

In [141]: df = pd.read_csv("tmp.csv", sep="|", thousands=",")

In [142]: df
Out[142]: 
         ID    level category
0  Patient1   123000        x
1  Patient2    23000        y
2  Patient3  1234018        z

In [143]: df.level.dtype
Out[143]: dtype('int64')

NA值#

控制哪些值被解析为缺失值(由表示 NaN )中指定一个字符串 na_values 。如果指定字符串列表,则其中的所有值都被视为缺少值。如果您指定一个数字(a float ,就像 5.0 或一个 integer 喜欢 5 ),则相应的等价值也将隐含缺失值(在这种情况下有效 [5.0, 5] 都被认为是 NaN )。

若要完全覆盖识别为缺少的缺省值,请指定 keep_default_na=False

默认设置 NaN 公认的值为 ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', 'NA', '<NA>', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', '']

让我们看几个例子:

pd.read_csv("path_to_file.csv", na_values=[5])

在上面的例子中 55.0 将被确认为 NaN ,除了默认设置之外。字符串首先将被解释为数字 5 ,然后作为一个 NaN

pd.read_csv("path_to_file.csv", keep_default_na=False, na_values=[""])

在上面,只有空字段将被识别为 NaN

pd.read_csv("path_to_file.csv", keep_default_na=False, na_values=["NA", "0"])

上图,两者都是 NA0 因为字符串是 NaN

pd.read_csv("path_to_file.csv", na_values=["Nope"])

缺省值,除了字符串 "Nope" 都被认为是 NaN

无穷大#

inf LIKE值将被解析为 np.inf (正无穷大),以及 -inf 作为 -np.inf (负无穷大)。这些将忽略值的大小写,即 Inf ,也将被解析为 np.inf

回归系列#

使用 squeeze 关键字,解析器将返回带有单列的输出 Series

1.4.0 版后已移除: 用户应追加 .squeeze("columns") 返回的DataFrame read_csv 取而代之的是。

In [144]: print(open("tmp.csv").read())
level
Patient1,123000
Patient2,23000
Patient3,1234018

In [145]: output = pd.read_csv("tmp.csv", squeeze=True)

In [146]: output
Out[146]: 
Patient1     123000
Patient2      23000
Patient3    1234018
Name: level, dtype: int64

In [147]: type(output)
Out[147]: pandas.core.series.Series

布尔值#

共同的价值观 TrueFalseTRUE ,以及 FALSE 都被认为是布尔值。有时,您可能希望将其他值识别为布尔值。为此,请使用 true_valuesfalse_values 选项如下:

In [148]: data = "a,b,c\n1,Yes,2\n3,No,4"

In [149]: print(data)
a,b,c
1,Yes,2
3,No,4

In [150]: pd.read_csv(StringIO(data))
Out[150]: 
   a    b  c
0  1  Yes  2
1  3   No  4

In [151]: pd.read_csv(StringIO(data), true_values=["Yes"], false_values=["No"])
Out[151]: 
   a      b  c
0  1   True  2
1  3  False  4

处理“坏”行#

某些文件可能包含格式错误的行,这些行的字段太少或太多。字段太少的行将在尾随字段中填充NA值。默认情况下,字段太多的行将引发错误:

In [152]: data = "a,b,c\n1,2,3\n4,5,6,7\n8,9,10"

In [153]: pd.read_csv(StringIO(data))
---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
Input In [153], in <cell line: 1>()
----> 1 pd.read_csv(StringIO(data))

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/util/_decorators.py:317, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    311 if len(args) > num_allow_args:
    312     warnings.warn(
    313         msg.format(arguments=arguments),
    314         FutureWarning,
    315         stacklevel=stacklevel,
    316     )
--> 317 return func(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/parsers/readers.py:927, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    912 kwds_defaults = _refine_defaults_read(
    913     dialect,
    914     delimiter,
   (...)
    923     defaults={"delimiter": ","},
    924 )
    925 kwds.update(kwds_defaults)
--> 927 return _read(filepath_or_buffer, kwds)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/parsers/readers.py:588, in _read(filepath_or_buffer, kwds)
    585     return parser
    587 with parser:
--> 588     return parser.read(nrows)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/parsers/readers.py:1750, in TextFileReader.read(self, nrows)
   1743 nrows = validate_integer("nrows", nrows)
   1744 try:
   1745     # error: "ParserBase" has no attribute "read"
   1746     (
   1747         index,
   1748         columns,
   1749         col_dict,
-> 1750     ) = self._engine.read(  # type: ignore[attr-defined]
   1751         nrows
   1752     )
   1753 except Exception:
   1754     self.close()

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/parsers/c_parser_wrapper.py:227, in CParserWrapper.read(self, nrows)
    225 try:
    226     if self.low_memory:
--> 227         chunks = self._reader.read_low_memory(nrows)
    228         # destructive to chunks
    229         data = _concatenate_chunks(chunks)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/_libs/parsers.pyx:807, in pandas._libs.parsers.TextReader.read_low_memory()

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/_libs/parsers.pyx:865, in pandas._libs.parsers.TextReader._read_rows()

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/_libs/parsers.pyx:851, in pandas._libs.parsers.TextReader._tokenize_rows()

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/_libs/parsers.pyx:1970, in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

您可以选择跳过错误的行:

In [29]: pd.read_csv(StringIO(data), on_bad_lines="warn")
Skipping line 3: expected 3 fields, saw 4

Out[29]:
   a  b   c
0  1  2   3
1  8  9  10

或者传递一个可调用的函数来处理错误行,如果 engine="python" 。错误的行将是由 sep

In [29]: external_list = []

In [30]: def bad_lines_func(line):
    ...:     external_list.append(line)
    ...:     return line[-3:]

In [31]: pd.read_csv(StringIO(data), on_bad_lines=bad_lines_func, engine="python")
Out[31]:
   a  b   c
0  1  2   3
1  5  6   7
2  8  9  10

In [32]: external_list
Out[32]: [4, 5, 6, 7]

.. versionadded:: 1.4.0

您也可以使用 usecols 参数以消除出现在某些行中但不出现在其他行中的无关列数据:

In [33]: pd.read_csv(StringIO(data), usecols=[0, 1, 2])

 Out[33]:
    a  b   c
 0  1  2   3
 1  4  5   6
 2  8  9  10

如果要保留所有数据,包括包含过多字段的行,可以指定足够数量的 names 。这可确保字段不足的行被填充 NaN

In [34]: pd.read_csv(StringIO(data), names=['a', 'b', 'c', 'd'])

Out[34]:
    a  b   c  d
 0  1  2   3  NaN
 1  4  5   6  7
 2  8  9  10  NaN

方言#

这个 dialect 关键字在指定文件格式方面提供了更大的灵活性。默认情况下,它使用Excel方言,但您可以指定方言名称或 csv.Dialect 实例。

假设您有不带引号的数据:

In [154]: print(data)
label1,label2,label3
index1,"a,c,e
index2,b,d,f

默认情况下, read_csv 使用Excel方言并将双引号视为引号字符,这会导致在找到右双引号之前找到换行符时失败。

我们可以使用以下工具来绕过此问题 dialect

In [155]: import csv

In [156]: dia = csv.excel()

In [157]: dia.quoting = csv.QUOTE_NONE

In [158]: pd.read_csv(StringIO(data), dialect=dia)
Out[158]: 
       label1 label2 label3
index1     "a      c      e
index2      b      d      f

所有方言选项都可以通过关键字参数单独指定:

In [159]: data = "a,b,c~1,2,3~4,5,6"

In [160]: pd.read_csv(StringIO(data), lineterminator="~")
Out[160]: 
   a  b  c
0  1  2  3
1  4  5  6

另一个常见的方言选项是 skipinitialspace ,跳过分隔符后面的任何空格:

In [161]: data = "a, b, c\n1, 2, 3\n4, 5, 6"

In [162]: print(data)
a, b, c
1, 2, 3
4, 5, 6

In [163]: pd.read_csv(StringIO(data), skipinitialspace=True)
Out[163]: 
   a  b  c
0  1  2  3
1  4  5  6

解析器会尽一切努力“做正确的事情”,并且不会变得脆弱。类型推断是一件非常重要的事情。如果可以在不更改内容的情况下将列强制为整型数据类型,则解析器会这样做。任何非数字列都将作为对象数据类型传入,就像其他Pandas对象一样。

引号和转义字符#

可以通过多种方式处理嵌入字段中的引号(和其他转义字符)。一种方法是使用反斜杠;若要正确解析此数据,应将 escapechar 选项:

In [164]: data = 'a,b\n"hello, \\"Bob\\", nice to see you",5'

In [165]: print(data)
a,b
"hello, \"Bob\", nice to see you",5

In [166]: pd.read_csv(StringIO(data), escapechar="\\")
Out[166]: 
                               a  b
0  hello, "Bob", nice to see you  5

具有固定宽度列的文件#

而当 read_csv() 读取分隔数据,则 read_fwf() 函数处理具有已知和固定列宽的数据文件。的函数参数 read_fwf 大体上与 read_csv 使用两个额外的参数,并使用 delimiter 参数:

  • colspecs :将每行的固定宽度字段的范围作为半开区间(即[从、到[))的对(元组)的列表。字符串值‘INFER’可用于指示解析器尝试从数据的前100行检测列规范。默认行为(如果未指定)是推断。

  • widths :如果间隔是连续的,则可以使用的字段宽度列表可以用来代替‘colspecs’。

  • delimiter :在固定宽度文件中视为填充字符的字符。如果不是空格(例如,‘~’),则可用于指定字段的填充字符。

考虑一个典型的固定宽度数据文件:

In [167]: print(open("bar.csv").read())
id8141    360.242940   149.910199   11950.7
id1594    444.953632   166.985655   11788.4
id1849    364.136849   183.628767   11806.2
id1230    413.836124   184.375703   11916.8
id1948    502.953953   173.237159   12468.3

为了将此文件解析为 DataFrame ,我们只需要将列规范提供给 read_fwf 函数和文件名:

# Column specifications are a list of half-intervals
In [168]: colspecs = [(0, 6), (8, 20), (21, 33), (34, 43)]

In [169]: df = pd.read_fwf("bar.csv", colspecs=colspecs, header=None, index_col=0)

In [170]: df
Out[170]: 
                 1           2        3
0                                      
id8141  360.242940  149.910199  11950.7
id1594  444.953632  166.985655  11788.4
id1849  364.136849  183.628767  11806.2
id1230  413.836124  184.375703  11916.8
id1948  502.953953  173.237159  12468.3

注意解析器如何自动选择列名X。 header=None 参数已指定。或者,您可以只提供连续列的列宽:

# Widths are a list of integers
In [171]: widths = [6, 14, 13, 10]

In [172]: df = pd.read_fwf("bar.csv", widths=widths, header=None)

In [173]: df
Out[173]: 
        0           1           2        3
0  id8141  360.242940  149.910199  11950.7
1  id1594  444.953632  166.985655  11788.4
2  id1849  364.136849  183.628767  11806.2
3  id1230  413.836124  184.375703  11916.8
4  id1948  502.953953  173.237159  12468.3

解析器将处理列周围的额外空格,因此在文件中的列之间有额外的分隔是可以的。

默认情况下, read_fwf 将尝试推断该文件的 colspecs 通过使用文件的前100行。仅当列对齐并由提供的 delimiter (默认分隔符为空格)。

In [174]: df = pd.read_fwf("bar.csv", header=None, index_col=0)

In [175]: df
Out[175]: 
                 1           2        3
0                                      
id8141  360.242940  149.910199  11950.7
id1594  444.953632  166.985655  11788.4
id1849  364.136849  183.628767  11806.2
id1230  413.836124  184.375703  11916.8
id1948  502.953953  173.237159  12468.3

read_fwf 支持 dtype 用于指定已分析列的类型不同于推断类型的参数。

In [176]: pd.read_fwf("bar.csv", header=None, index_col=0).dtypes
Out[176]: 
1    float64
2    float64
3    float64
dtype: object

In [177]: pd.read_fwf("bar.csv", header=None, dtype={2: "object"}).dtypes
Out[177]: 
0     object
1    float64
2     object
3    float64
dtype: object

索引#

具有“隐式”索引列的文件#

假设文件标题中的条目比数据列数少一个:

In [178]: print(open("foo.csv").read())
A,B,C
20090101,a,1,2
20090102,b,3,4
20090103,c,4,5

在这种特殊情况下, read_csv 假定第一列将用作 DataFrame

In [179]: pd.read_csv("foo.csv")
Out[179]: 
          A  B  C
20090101  a  1  2
20090102  b  3  4
20090103  c  4  5

请注意,不会自动解析日期。在这种情况下,您需要像以前一样执行以下操作:

In [180]: df = pd.read_csv("foo.csv", parse_dates=True)

In [181]: df.index
Out[181]: DatetimeIndex(['2009-01-01', '2009-01-02', '2009-01-03'], dtype='datetime64[ns]', freq=None)

读取索引时使用 MultiIndex#

假设您有两列索引的数据:

In [182]: print(open("data/mindex_ex.csv").read())
year,indiv,zit,xit
1977,"A",1.2,.6
1977,"B",1.5,.5
1977,"C",1.7,.8
1978,"A",.2,.06
1978,"B",.7,.2
1978,"C",.8,.3
1978,"D",.9,.5
1978,"E",1.4,.9
1979,"C",.2,.15
1979,"D",.14,.05
1979,"E",.5,.15
1979,"F",1.2,.5
1979,"G",3.4,1.9
1979,"H",5.4,2.7
1979,"I",6.4,1.2

这个 index_col 参数为 read_csv 可以采用列号列表将多个列转换为 MultiIndex 对于返回对象的索引:

In [183]: df = pd.read_csv("data/mindex_ex.csv", index_col=[0, 1])

In [184]: df
Out[184]: 
             zit   xit
year indiv            
1977 A      1.20  0.60
     B      1.50  0.50
     C      1.70  0.80
1978 A      0.20  0.06
     B      0.70  0.20
     C      0.80  0.30
     D      0.90  0.50
     E      1.40  0.90
1979 C      0.20  0.15
     D      0.14  0.05
     E      0.50  0.15
     F      1.20  0.50
     G      3.40  1.90
     H      5.40  2.70
     I      6.40  1.20

In [185]: df.loc[1978]
Out[185]: 
       zit   xit
indiv           
A      0.2  0.06
B      0.7  0.20
C      0.8  0.30
D      0.9  0.50
E      1.4  0.90

使用读取列 MultiIndex#

属性的行位置列表。 header 参数,您可以读入一个 MultiIndex 对于柱子。指定不连续的行将跳过中间的行。

In [186]: from pandas._testing import makeCustomDataframe as mkdf

In [187]: df = mkdf(5, 3, r_idx_nlevels=2, c_idx_nlevels=4)

In [188]: df.to_csv("mi.csv")

In [189]: print(open("mi.csv").read())
C0,,C_l0_g0,C_l0_g1,C_l0_g2
C1,,C_l1_g0,C_l1_g1,C_l1_g2
C2,,C_l2_g0,C_l2_g1,C_l2_g2
C3,,C_l3_g0,C_l3_g1,C_l3_g2
R0,R1,,,
R_l0_g0,R_l1_g0,R0C0,R0C1,R0C2
R_l0_g1,R_l1_g1,R1C0,R1C1,R1C2
R_l0_g2,R_l1_g2,R2C0,R2C1,R2C2
R_l0_g3,R_l1_g3,R3C0,R3C1,R3C2
R_l0_g4,R_l1_g4,R4C0,R4C1,R4C2


In [190]: pd.read_csv("mi.csv", header=[0, 1, 2, 3], index_col=[0, 1])
Out[190]: 
C0              C_l0_g0 C_l0_g1 C_l0_g2
C1              C_l1_g0 C_l1_g1 C_l1_g2
C2              C_l2_g0 C_l2_g1 C_l2_g2
C3              C_l3_g0 C_l3_g1 C_l3_g2
R0      R1                             
R_l0_g0 R_l1_g0    R0C0    R0C1    R0C2
R_l0_g1 R_l1_g1    R1C0    R1C1    R1C2
R_l0_g2 R_l1_g2    R2C0    R2C1    R2C2
R_l0_g3 R_l1_g3    R3C0    R3C1    R3C2
R_l0_g4 R_l1_g4    R4C0    R4C1    R4C2

read_csv 还能够解释一种更常见的多列索引格式。

In [191]: print(open("mi2.csv").read())
,a,a,a,b,c,c
,q,r,s,t,u,v
one,1,2,3,4,5,6
two,7,8,9,10,11,12

In [192]: pd.read_csv("mi2.csv", header=[0, 1], index_col=0)
Out[192]: 
     a         b   c    
     q  r  s   t   u   v
one  1  2  3   4   5   6
two  7  8  9  10  11  12

注意:如果一个 index_col 未指定(例如,您没有索引,或使用 df.to_csv(..., index=False) ,然后是任何 names 列上的索引将是 lost

自动“嗅探”分隔符#

read_csv 能够推断分隔的(不一定是逗号分隔的)文件,因为Pandas使用 csv.Sniffer CSV模块的类。为此,您必须指定 sep=None

In [193]: print(open("tmp2.sv").read())
:0:1:2:3
0:0.4691122999071863:-0.2828633443286633:-1.5090585031735124:-1.1356323710171934
1:1.2121120250208506:-0.17321464905330858:0.11920871129693428:-1.0442359662799567
2:-0.8618489633477999:-2.1045692188948086:-0.4949292740687813:1.071803807037338
3:0.7215551622443669:-0.7067711336300845:-1.0395749851146963:0.27185988554282986
4:-0.42497232978883753:0.567020349793672:0.27623201927771873:-1.0874006912859915
5:-0.6736897080883706:0.1136484096888855:-1.4784265524372235:0.5249876671147047
6:0.4047052186802365:0.5770459859204836:-1.7150020161146375:-1.0392684835147725
7:-0.3706468582364464:-1.1578922506419993:-1.344311812731667:0.8448851414248841
8:1.0757697837155533:-0.10904997528022223:1.6435630703622064:-1.4693879595399115
9:0.35702056413309086:-0.6746001037299882:-1.776903716971867:-0.9689138124473498


In [194]: pd.read_csv("tmp2.sv", sep=None, engine="python")
Out[194]: 
   Unnamed: 0         0         1         2         3
0           0  0.469112 -0.282863 -1.509059 -1.135632
1           1  1.212112 -0.173215  0.119209 -1.044236
2           2 -0.861849 -2.104569 -0.494929  1.071804
3           3  0.721555 -0.706771 -1.039575  0.271860
4           4 -0.424972  0.567020  0.276232 -1.087401
5           5 -0.673690  0.113648 -1.478427  0.524988
6           6  0.404705  0.577046 -1.715002 -1.039268
7           7 -0.370647 -1.157892 -1.344312  0.844885
8           8  1.075770 -0.109050  1.643563 -1.469388
9           9  0.357021 -0.674600 -1.776904 -0.968914

读取多个文件以创建单个DataFrame#

最好使用的是 concat() 要合并多个文件,请执行以下操作。请参阅 cookbook 举个例子。

逐个块遍历文件#

假设您希望延迟地迭代一个(可能非常大的)文件,而不是将整个文件读取到内存中,如下所示:

In [195]: print(open("tmp.sv").read())
|0|1|2|3
0|0.4691122999071863|-0.2828633443286633|-1.5090585031735124|-1.1356323710171934
1|1.2121120250208506|-0.17321464905330858|0.11920871129693428|-1.0442359662799567
2|-0.8618489633477999|-2.1045692188948086|-0.4949292740687813|1.071803807037338
3|0.7215551622443669|-0.7067711336300845|-1.0395749851146963|0.27185988554282986
4|-0.42497232978883753|0.567020349793672|0.27623201927771873|-1.0874006912859915
5|-0.6736897080883706|0.1136484096888855|-1.4784265524372235|0.5249876671147047
6|0.4047052186802365|0.5770459859204836|-1.7150020161146375|-1.0392684835147725
7|-0.3706468582364464|-1.1578922506419993|-1.344311812731667|0.8448851414248841
8|1.0757697837155533|-0.10904997528022223|1.6435630703622064|-1.4693879595399115
9|0.35702056413309086|-0.6746001037299882|-1.776903716971867|-0.9689138124473498


In [196]: table = pd.read_csv("tmp.sv", sep="|")

In [197]: table
Out[197]: 
   Unnamed: 0         0         1         2         3
0           0  0.469112 -0.282863 -1.509059 -1.135632
1           1  1.212112 -0.173215  0.119209 -1.044236
2           2 -0.861849 -2.104569 -0.494929  1.071804
3           3  0.721555 -0.706771 -1.039575  0.271860
4           4 -0.424972  0.567020  0.276232 -1.087401
5           5 -0.673690  0.113648 -1.478427  0.524988
6           6  0.404705  0.577046 -1.715002 -1.039268
7           7 -0.370647 -1.157892 -1.344312  0.844885
8           8  1.075770 -0.109050  1.643563 -1.469388
9           9  0.357021 -0.674600 -1.776904 -0.968914

通过指定一个 chunksizeread_csv ,则返回值将是以下类型的可迭代对象 TextFileReader

In [198]: with pd.read_csv("tmp.sv", sep="|", chunksize=4) as reader:
   .....:     reader
   .....:     for chunk in reader:
   .....:         print(chunk)
   .....: 
   Unnamed: 0         0         1         2         3
0           0  0.469112 -0.282863 -1.509059 -1.135632
1           1  1.212112 -0.173215  0.119209 -1.044236
2           2 -0.861849 -2.104569 -0.494929  1.071804
3           3  0.721555 -0.706771 -1.039575  0.271860
   Unnamed: 0         0         1         2         3
4           4 -0.424972  0.567020  0.276232 -1.087401
5           5 -0.673690  0.113648 -1.478427  0.524988
6           6  0.404705  0.577046 -1.715002 -1.039268
7           7 -0.370647 -1.157892 -1.344312  0.844885
   Unnamed: 0         0        1         2         3
8           8  1.075770 -0.10905  1.643563 -1.469388
9           9  0.357021 -0.67460 -1.776904 -0.968914

在 1.2 版更改: read_csv/json/sas 循环访问文件时返回上下文管理器。

指定 iterator=True 还将返回 TextFileReader 对象:

In [199]: with pd.read_csv("tmp.sv", sep="|", iterator=True) as reader:
   .....:     reader.get_chunk(5)
   .....: 

指定解析器引擎#

Pandas目前支持三个引擎,C引擎、PYTHON引擎和实验性的PYARROW引擎(需要 pyarrow 包)。一般来说,对于较大的工作负载,pyrow引擎是最快的,并且在大多数其他工作负载上的速度与C引擎相同。在大多数工作负载上,Python引擎往往比pyrow和C引擎慢。然而,与C引擎相比,pyrow引擎的健壮性要差得多,C引擎与Python引擎相比缺少一些功能。

在可能的情况下,Pandas使用C解析器(指定为 engine='c' ),但如果指定了不支持C语言的选项,它可能会回退到Python。

目前,C和pyrow引擎不支持的选项包括:

  • sep 不是单个字符(例如正则表达式分隔符)

  • skipfooter

  • sep=None with delim_whitespace=False

指定上述任一选项都会产生 ParserWarning 除非使用以下方法显式选择了python引擎 engine='python'

以上列表中未涵盖的pyrow引擎不支持的选项包括:

  • float_precision

  • chunksize

  • comment

  • nrows

  • thousands

  • memory_map

  • dialect

  • warn_bad_lines

  • error_bad_lines

  • on_bad_lines

  • delim_whitespace

  • quoting

  • lineterminator

  • converters

  • decimal

  • iterator

  • dayfirst

  • infer_datetime_format

  • verbose

  • skipinitialspace

  • low_memory

使用指定这些选项 engine='pyarrow' 将引发一个 ValueError

读/写远程文件#

您可以通过传入URL来读取或写入PANDA的许多IO函数的远程文件--下面的示例显示了如何读取CSV文件:

df = pd.read_csv("https://download.bls.gov/pub/time.series/cu/cu.item", sep="\t")

1.3.0 新版功能.

通过将标头键值映射的字典传递给 storage_options 关键字参数如下:

headers = {"User-Agent": "pandas"}
df = pd.read_csv(
    "https://download.bls.gov/pub/time.series/cu/cu.item",
    sep="\t",
    storage_options=headers
)

所有非本地文件或HTTP的URL都由 fsspec ,以及它的各种文件系统实现(包括Amazon S3、Google Cloud、SSH、FTP、WebHDFS...)。其中一些实现将需要安装其他包,例如,S3URL需要 s3fs 库:

df = pd.read_json("s3://pandas-test/adatafile.json")

在处理远程存储系统时,您可能需要在特殊位置使用环境变量或配置文件进行额外配置。例如,要访问S3存储桶中的数据,您需要使用 S3Fs documentation 。对于几个存储后端也是如此,您应该遵循以下链接 fsimpl1 对于内置于 fsspecfsimpl2 对于那些没有包括在主要 fsspec 分配。

您还可以将参数直接传递给后端驱动程序。例如,如果您这样做 not 拥有S3凭据,您仍然可以通过指定匿名连接来访问公共数据,例如

1.2.0 新版功能.

pd.read_csv(
    "s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/SaKe2013"
    "-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv",
    storage_options={"anon": True},
)

fsspec 还允许使用复杂的URL,用于访问压缩存档中的数据、文件的本地缓存等。要本地缓存上面的示例,您需要将调用修改为

pd.read_csv(
    "simplecache::s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/"
    "SaKe2013-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv",
    storage_options={"s3": {"anon": True}},
)

其中,我们指定“anon”参数用于实现的“S3”部分,而不是用于缓存实现。请注意,这只在会话期间缓存到临时目录,但您也可以指定永久存储。

写出数据#

写入CSV格式#

这个 SeriesDataFrame 对象有一个实例方法 to_csv 它允许将对象的内容存储为逗号分隔值文件。该函数接受多个参数。只有第一个是必需的。

  • path_or_buf: A string path to the file to write or a file object. If a file object it must be opened with newline=''

  • sep :输出文件的字段分隔符(默认为“,”)

  • na_rep :缺失值的字符串表示形式(默认为‘’)

  • float_format :浮点数的格式字符串

  • columns :要写入的列(默认为无)

  • header :是否写出列名(默认为True)

  • index :是否写入行(索引)名称(默认为True)

  • index_label :索引列的列标签(如果需要)。如果为None(默认),则为 headerindex 为True,则使用索引名称。(如果符合以下条件,则应提供序列 DataFrame 使用多重索引)。

  • mode :Python写入模式,默认为‘w’

  • encoding :一个字符串,表示在内容为非ASCII时使用的编码,对于3之前的Python版本

  • lineterminator :表示行尾的字符序列(默认 os.linesep )

  • quoting :将报价规则设置为CSV模块(默认为csv.QUOTE_MINIMAL)。请注意,如果您设置了 float_format 然后将浮点数转换为字符串,csv.QUOTE_NONNUMERIC将它们视为非数字

  • quotechar :用于引号字段的字符(默认为‘“’)

  • doublequote :控制报价 quotechar 在字段中(默认为True)

  • escapechar :用于转义的字符 sepquotechar 适当时(默认为无)

  • chunksize :一次写入的行数

  • date_format :DateTime对象的格式字符串

编写格式化字符串#

这个 DataFrame 对象有一个实例方法 to_string 这允许对对象的字符串表示进行控制。所有参数都是可选的:

  • buf 默认无,例如StringIO对象

  • columns 默认无,写入哪些列

  • col_space 默认无,即每列的最小宽度。

  • na_rep 默认设置 NaN ,NA值的表示

  • formatters 默认无,函数的字典(按列),每个函数接受单个参数并返回格式化字符串

  • float_format 默认None,接受单个(浮点)参数并返回格式化字符串的函数;将应用于 DataFrame

  • sparsify 默认True,设置为False DataFrame 使用分层索引来打印每行的每个多索引键。

  • index_names 默认为True,将打印索引的名称

  • index 默认为True,将打印索引(即行标签)

  • header 默认为True,将打印列标签

  • justify 默认设置 left ,将打印左对齐或右对齐的列标题

这个 Series 对象还具有一个 to_string 方法,但只使用 bufna_repfloat_format 争论。也有一个 length 参数,如果设置为 True ,还将输出系列的长度。

JSON#

读写 JSON 格式化文件和字符串。

编写JSON#

A SeriesDataFrame 可以转换为有效的JSON字符串。使用 to_json 使用可选参数:

  • path_or_buf :写入输出的路径名或缓冲区,可以是 None 在这种情况下,将返回JSON字符串

  • orient

    Series
    • 缺省值为 index

    • 允许的值为{拆分recordsindex }

    DataFrame
    • 缺省值为 columns

    • 允许的值为{拆分recordsindexcolumnsvaluestable }

    JSON字符串的格式

    split

    Dict Like{index-> [索引] 、列-> [列] 、数据-> [值] }

    records

    列表点赞 [{{column -> value}}, ... , {{column -> value}}]

    index

    Dict Like{索引->{列->值}}

    columns

    Dict Like{列->{索引->值}}

    values

    仅值数组

    table

    坚持JSON Table Schema

  • date_format :字符串,日期转换类型,‘EPOCH’表示时间戳,‘ISO’表示ISO8601。

  • double_precision :对浮点值进行编码时使用的小数位数,默认为10。

  • force_ascii :强制编码字符串为ASCII,默认为True。

  • date_unit :要编码的时间单位,控制时间戳和ISO8601精度。“s”、“ms”、“us”或“ns”中的一个分别表示秒、毫秒、微秒和纳秒。默认为“ms”。

  • default_handler :如果对象无法以其他方式转换为适合JSON的格式,则调用的处理程序。接受单个参数,该参数是要转换的对象,并返回可序列化的对象。

  • lines :如果 records Orient,然后将每行的每条记录写入json。

注意事项 NaN 的, NaT 的和 None 将被转换为 nulldatetime 对象将基于 date_formatdate_unit 参数。

In [200]: dfj = pd.DataFrame(np.random.randn(5, 2), columns=list("AB"))

In [201]: json = dfj.to_json()

In [202]: json
Out[202]: '{"A":{"0":-1.2945235903,"1":0.2766617129,"2":-0.0139597524,"3":-0.0061535699,"4":0.8957173022},"B":{"0":0.4137381054,"1":-0.472034511,"2":-0.3625429925,"3":-0.923060654,"4":0.8052440254}}'

定向选项#

生成的JSON文件/字符串的格式有许多不同的选项。请考虑以下几点 DataFrameSeries

In [203]: dfjo = pd.DataFrame(
   .....:     dict(A=range(1, 4), B=range(4, 7), C=range(7, 10)),
   .....:     columns=list("ABC"),
   .....:     index=list("xyz"),
   .....: )
   .....: 

In [204]: dfjo
Out[204]: 
   A  B  C
x  1  4  7
y  2  5  8
z  3  6  9

In [205]: sjo = pd.Series(dict(x=15, y=16, z=17), name="D")

In [206]: sjo
Out[206]: 
x    15
y    16
z    17
Name: D, dtype: int64

列定向 (默认设置为 DataFrame )将数据序列化为嵌套的JSON对象,列标签充当主索引:

In [207]: dfjo.to_json(orient="columns")
Out[207]: '{"A":{"x":1,"y":2,"z":3},"B":{"x":4,"y":5,"z":6},"C":{"x":7,"y":8,"z":9}}'

# Not available for Series

面向索引 (默认设置为 Series )类似于以列为导向,但索引标签现在是主要的:

In [208]: dfjo.to_json(orient="index")
Out[208]: '{"x":{"A":1,"B":4,"C":7},"y":{"A":2,"B":5,"C":8},"z":{"A":3,"B":6,"C":9}}'

In [209]: sjo.to_json(orient="index")
Out[209]: '{"x":15,"y":16,"z":17}'

面向记录 将数据序列化为列->值记录的JSON数组,不包括索引标签。这对传球很有用 DataFrame 数据到绘图库,例如JAVASCRIPT库 d3.js

In [210]: dfjo.to_json(orient="records")
Out[210]: '[{"A":1,"B":4,"C":7},{"A":2,"B":5,"C":8},{"A":3,"B":6,"C":9}]'

In [211]: sjo.to_json(orient="records")
Out[211]: '[15,16,17]'

价值导向 是一个基本选项,它仅序列化为值的嵌套JSON数组,不包括列和索引标签:

In [212]: dfjo.to_json(orient="values")
Out[212]: '[[1,4,7],[2,5,8],[3,6,9]]'

# Not available for Series

面向拆分 序列化为包含值、索引和列的单独条目的JSON对象。名称也包括在 Series

In [213]: dfjo.to_json(orient="split")
Out[213]: '{"columns":["A","B","C"],"index":["x","y","z"],"data":[[1,4,7],[2,5,8],[3,6,9]]}'

In [214]: sjo.to_json(orient="split")
Out[214]: '{"name":"D","index":["x","y","z"],"data":[15,16,17]}'

面向表格的 序列化到JSON Table Schema 允许保存元数据,包括但不限于数据类型和索引名。

备注

任何编码为JSON对象的Orient选项在往返序列化过程中都不会保留索引和列标签的顺序。如果希望保留标签顺序,请使用 split 选项,因为它使用有序的容器。

日期处理#

以ISO日期格式书写:

In [215]: dfd = pd.DataFrame(np.random.randn(5, 2), columns=list("AB"))

In [216]: dfd["date"] = pd.Timestamp("20130101")

In [217]: dfd = dfd.sort_index(axis=1, ascending=False)

In [218]: json = dfd.to_json(date_format="iso")

In [219]: json
Out[219]: '{"date":{"0":"2013-01-01T00:00:00.000Z","1":"2013-01-01T00:00:00.000Z","2":"2013-01-01T00:00:00.000Z","3":"2013-01-01T00:00:00.000Z","4":"2013-01-01T00:00:00.000Z"},"B":{"0":2.5656459463,"1":1.3403088498,"2":-0.2261692849,"3":0.8138502857,"4":-0.8273169356},"A":{"0":-1.2064117817,"1":1.4312559863,"2":-1.1702987971,"3":0.4108345112,"4":0.1320031703}}'

以ISO日期格式书写,以微秒为单位:

In [220]: json = dfd.to_json(date_format="iso", date_unit="us")

In [221]: json
Out[221]: '{"date":{"0":"2013-01-01T00:00:00.000000Z","1":"2013-01-01T00:00:00.000000Z","2":"2013-01-01T00:00:00.000000Z","3":"2013-01-01T00:00:00.000000Z","4":"2013-01-01T00:00:00.000000Z"},"B":{"0":2.5656459463,"1":1.3403088498,"2":-0.2261692849,"3":0.8138502857,"4":-0.8273169356},"A":{"0":-1.2064117817,"1":1.4312559863,"2":-1.1702987971,"3":0.4108345112,"4":0.1320031703}}'

纪元时间戳(秒):

In [222]: json = dfd.to_json(date_format="epoch", date_unit="s")

In [223]: json
Out[223]: '{"date":{"0":1356998400,"1":1356998400,"2":1356998400,"3":1356998400,"4":1356998400},"B":{"0":2.5656459463,"1":1.3403088498,"2":-0.2261692849,"3":0.8138502857,"4":-0.8273169356},"A":{"0":-1.2064117817,"1":1.4312559863,"2":-1.1702987971,"3":0.4108345112,"4":0.1320031703}}'

写入具有日期索引和日期列的文件:

In [224]: dfj2 = dfj.copy()

In [225]: dfj2["date"] = pd.Timestamp("20130101")

In [226]: dfj2["ints"] = list(range(5))

In [227]: dfj2["bools"] = True

In [228]: dfj2.index = pd.date_range("20130101", periods=5)

In [229]: dfj2.to_json("test.json")

In [230]: with open("test.json") as fh:
   .....:     print(fh.read())
   .....: 
{"A":{"1356998400000":-1.2945235903,"1357084800000":0.2766617129,"1357171200000":-0.0139597524,"1357257600000":-0.0061535699,"1357344000000":0.8957173022},"B":{"1356998400000":0.4137381054,"1357084800000":-0.472034511,"1357171200000":-0.3625429925,"1357257600000":-0.923060654,"1357344000000":0.8052440254},"date":{"1356998400000":1356998400000,"1357084800000":1356998400000,"1357171200000":1356998400000,"1357257600000":1356998400000,"1357344000000":1356998400000},"ints":{"1356998400000":0,"1357084800000":1,"1357171200000":2,"1357257600000":3,"1357344000000":4},"bools":{"1356998400000":true,"1357084800000":true,"1357171200000":true,"1357257600000":true,"1357344000000":true}}

后备行为#

如果JSON序列化程序不能直接处理容器内容,它将以以下方式后退:

  • 如果不支持该数据类型(例如 np.complex_ ),然后 default_handler 如果提供,将为每个值调用,否则引发异常。

  • 如果对象不受支持,它将尝试以下操作:

    • 检查对象是否定义了 toDict 方法并调用它。一个 toDict 方法应返回一个 dict 然后,它将被JSON序列化。

    • 调用 default_handler 如果提供的话。

    • 将对象转换为 dict 通过遍历其内容。但是,这通常会失败,并显示 OverflowError 或者给出意想不到的结果。

通常,对于不受支持的对象或数据类型,最佳方法是提供 default_handler 。例如:

>>> DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json()  # raises
RuntimeError: Unhandled numpy dtype 15

可以通过指定简单的 default_handler

In [231]: pd.DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json(default_handler=str)
Out[231]: '{"0":{"0":"(1+0j)","1":"(2+0j)","2":"(1+2j)"}}'

阅读JSON#

将JSON字符串读取到Pandas对象可能需要许多参数。解析器将尝试解析 DataFrame 如果 typ 未提供或未提供 None 。明确地强制 Series 解析,传递 typ=series

  • filepath_or_buffer :A VALID JSON字符串或文件句柄/StringIO。该字符串可以是URL。有效的URL方案包括http、ftp、s3和文件。对于文件URL,需要主机。例如,本地文件可以是FILE://LOCALHOST/PATH/TO/able.json

  • typ :要恢复的对象类型(序列或帧),默认‘Frame’

  • orient

    系列:
    • 缺省值为 index

    • 允许的值为{拆分recordsindex }

    DataFrame
    • 缺省值为 columns

    • 允许的值为{拆分recordsindexcolumnsvaluestable }

    JSON字符串的格式

    split

    Dict Like{index-> [索引] 、列-> [列] 、数据-> [值] }

    records

    列表点赞 [{{column -> value}}, ... , {{column -> value}}]

    index

    Dict Like{索引->{列->值}}

    columns

    Dict Like{列->{索引->值}}

    values

    仅值数组

    table

    坚持JSON Table Schema

  • dtype :如果为True,则推断数据类型;如果将列转换为数据类型,则使用这些数据类型;如果 False ,则根本不推断数据类型,默认为True,仅应用于数据。

  • convert_axes : boolean, try to convert the axes to the proper dtypes, default is True

  • convert_dates :要分析日期的列的列表;如果 True ,然后尝试解析类似日期的列,默认为 True

  • keep_default_dates :布尔值,默认 True 。如果解析日期,则解析默认的类似日期的列。

  • numpy :直接解码到NumPy数组。缺省值为 False ;仅支持数字数据,尽管标签可以是非数字的。另请注意,JSON顺序 MUST 如果符合以下条件,则每个学期都相同 numpy=True

  • precise_float :布尔值,默认 False 。设置为允许在将字符串解码为双精度值时使用更高精度(Strtod)函数。默认 (False )是使用快速但不太精确的内置功能。

  • date_unit :字符串,检测是否转换日期的时间戳单位。默认为无。默认情况下,将检测时间戳精度,如果这不是所需的,则传递‘s’、‘ms’、‘us’或‘ns’之一,以分别将时间戳精度强制为秒、毫秒、微秒或纳秒。

  • lines :将文件作为每行一个json对象读取。

  • encoding :用于解码py3字节的编码。

  • chunksize :与以下各项结合使用 lines=True ,返回读入的JsonReader chunksize 每次迭代的行数。

解析器将引发 ValueError/TypeError/AssertionError 如果JSON不可分析。

如果非默认项 orient 在编码为JSON时使用请确保在此处传递相同的选项,以便解码产生合理的结果,请参见 Orient Options 以获得概述。

数据转换#

的默认设置 convert_axes=Truedtype=True ,以及 convert_dates=True 将尝试将轴和所有数据解析为适当的类型,包括日期。如果需要覆盖特定的数据类型,请将dict传递给 dtypeconvert_axes 应仅设置为 False 如果您需要在轴中保留类似字符串的数字(例如‘1’、‘2’)。

备注

如果满足以下条件,则可以将大整数值转换为日期 convert_dates=True 并且数据和/或列标签显示为类似日期。确切的阈值取决于 date_unit 指定的。“类似日期”表示列标签符合以下条件之一:

  • 它的结尾是 '_at'

  • 它的结尾是 '_time'

  • 它的开头是 'timestamp'

  • 它是 'modified'

  • 它是 'date'

警告

在读取JSON数据时,自动强制转换为数据类型有一些特殊之处:

  • 可以按与序列化不同的顺序重新构造索引,即不能保证返回的顺序与序列化前相同

  • 一篇曾经的专栏 float 数据将转换为 integer 如果它可以安全地完成,例如一列 1.

  • 布尔列将转换为 integer 浅谈重构

因此,有时您可能希望通过 dtype 关键字参数。

读取JSON字符串:

In [232]: pd.read_json(json)
Out[232]: 
        date         B         A
0 2013-01-01  2.565646 -1.206412
1 2013-01-01  1.340309  1.431256
2 2013-01-01 -0.226169 -1.170299
3 2013-01-01  0.813850  0.410835
4 2013-01-01 -0.827317  0.132003

正在从文件中读取:

In [233]: pd.read_json("test.json")
Out[233]: 
                   A         B       date  ints  bools
2013-01-01 -1.294524  0.413738 2013-01-01     0   True
2013-01-02  0.276662 -0.472035 2013-01-01     1   True
2013-01-03 -0.013960 -0.362543 2013-01-01     2   True
2013-01-04 -0.006154 -0.923061 2013-01-01     3   True
2013-01-05  0.895717  0.805244 2013-01-01     4   True

不转换任何数据(但仍转换轴和日期):

In [234]: pd.read_json("test.json", dtype=object).dtypes
Out[234]: 
A        object
B        object
date     object
ints     object
bools    object
dtype: object

指定要转换的数据类型:

In [235]: pd.read_json("test.json", dtype={"A": "float32", "bools": "int8"}).dtypes
Out[235]: 
A               float32
B               float64
date     datetime64[ns]
ints              int64
bools              int8
dtype: object

保留字符串索引:

In [236]: si = pd.DataFrame(
   .....:     np.zeros((4, 4)), columns=list(range(4)), index=[str(i) for i in range(4)]
   .....: )
   .....: 

In [237]: si
Out[237]: 
     0    1    2    3
0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
3  0.0  0.0  0.0  0.0

In [238]: si.index
Out[238]: Index(['0', '1', '2', '3'], dtype='object')

In [239]: si.columns
Out[239]: Int64Index([0, 1, 2, 3], dtype='int64')

In [240]: json = si.to_json()

In [241]: sij = pd.read_json(json, convert_axes=False)

In [242]: sij
Out[242]: 
   0  1  2  3
0  0  0  0  0
1  0  0  0  0
2  0  0  0  0
3  0  0  0  0

In [243]: sij.index
Out[243]: Index(['0', '1', '2', '3'], dtype='object')

In [244]: sij.columns
Out[244]: Index(['0', '1', '2', '3'], dtype='object')

以纳秒为单位的日期需要以纳秒为单位回读:

In [245]: json = dfj2.to_json(date_unit="ns")

# Try to parse timestamps as milliseconds -> Won't Work
In [246]: dfju = pd.read_json(json, date_unit="ms")

In [247]: dfju
Out[247]: 
                            A         B                 date  ints  bools
1356998400000000000 -1.294524  0.413738  1356998400000000000     0   True
1357084800000000000  0.276662 -0.472035  1356998400000000000     1   True
1357171200000000000 -0.013960 -0.362543  1356998400000000000     2   True
1357257600000000000 -0.006154 -0.923061  1356998400000000000     3   True
1357344000000000000  0.895717  0.805244  1356998400000000000     4   True

# Let pandas detect the correct precision
In [248]: dfju = pd.read_json(json)

In [249]: dfju
Out[249]: 
                   A         B       date  ints  bools
2013-01-01 -1.294524  0.413738 2013-01-01     0   True
2013-01-02  0.276662 -0.472035 2013-01-01     1   True
2013-01-03 -0.013960 -0.362543 2013-01-01     2   True
2013-01-04 -0.006154 -0.923061 2013-01-01     3   True
2013-01-05  0.895717  0.805244 2013-01-01     4   True

# Or specify that all timestamps are in nanoseconds
In [250]: dfju = pd.read_json(json, date_unit="ns")

In [251]: dfju
Out[251]: 
                   A         B       date  ints  bools
2013-01-01 -1.294524  0.413738 2013-01-01     0   True
2013-01-02  0.276662 -0.472035 2013-01-01     1   True
2013-01-03 -0.013960 -0.362543 2013-01-01     2   True
2013-01-04 -0.006154 -0.923061 2013-01-01     3   True
2013-01-05  0.895717  0.805244 2013-01-01     4   True

Numpy参数#

备注

此参数从1.0.0版起已弃用,并将引发 FutureWarning

这仅支持数字数据。索引和列标签可以是非数字的,例如字符串、日期等。

如果 numpy=True 传递给 read_json 将尝试在反序列化过程中嗅探适当的数据类型,并随后直接解码为NumPy数组,从而绕过对中间Python对象的需要。

如果您要反序列化大量数字数据,这可以提供加速:

In [252]: randfloats = np.random.uniform(-100, 1000, 10000)

In [253]: randfloats.shape = (1000, 10)

In [254]: dffloats = pd.DataFrame(randfloats, columns=list("ABCDEFGHIJ"))

In [255]: jsonfloats = dffloats.to_json()
In [256]: %timeit pd.read_json(jsonfloats)
5.7 ms +- 16.3 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
In [257]: %timeit pd.read_json(jsonfloats, numpy=True)
4.31 ms +- 18 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

对于较小的数据集,加速效果不太明显:

In [258]: jsonfloats = dffloats.head(100).to_json()
In [259]: %timeit pd.read_json(jsonfloats)
3.39 ms +- 2.79 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
In [260]: %timeit pd.read_json(jsonfloats, numpy=True)
3.07 ms +- 12.6 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

警告

直接NumPy解码进行了许多假设,如果不满足这些假设,可能会失败或产生意外输出:

  • 数据是数字的。

  • 数据是统一的。从解码的第一个值嗅探数据类型。一个 ValueError 如果不满足此条件,可能会产生错误的输出。

  • 标签是订购的。标签只从第一个容器中读取,假设每个后续行/列都以相同的顺序进行了编码。如果使用以下编码方式对数据进行编码,则应满足此要求 to_json 但如果JSON来自其他来源,情况可能并非如此。

正规化#

Pandas提供了一个实用函数来获取词典或词典列表,并 规格化 将这些半结构化数据转换为平面表。

In [261]: data = [
   .....:     {"id": 1, "name": {"first": "Coleen", "last": "Volk"}},
   .....:     {"name": {"given": "Mark", "family": "Regner"}},
   .....:     {"id": 2, "name": "Faye Raker"},
   .....: ]
   .....: 

In [262]: pd.json_normalize(data)
Out[262]: 
    id name.first name.last name.given name.family        name
0  1.0     Coleen      Volk        NaN         NaN         NaN
1  NaN        NaN       NaN       Mark      Regner         NaN
2  2.0        NaN       NaN        NaN         NaN  Faye Raker
In [263]: data = [
   .....:     {
   .....:         "state": "Florida",
   .....:         "shortname": "FL",
   .....:         "info": {"governor": "Rick Scott"},
   .....:         "county": [
   .....:             {"name": "Dade", "population": 12345},
   .....:             {"name": "Broward", "population": 40000},
   .....:             {"name": "Palm Beach", "population": 60000},
   .....:         ],
   .....:     },
   .....:     {
   .....:         "state": "Ohio",
   .....:         "shortname": "OH",
   .....:         "info": {"governor": "John Kasich"},
   .....:         "county": [
   .....:             {"name": "Summit", "population": 1234},
   .....:             {"name": "Cuyahoga", "population": 1337},
   .....:         ],
   .....:     },
   .....: ]
   .....: 

In [264]: pd.json_normalize(data, "county", ["state", "shortname", ["info", "governor"]])
Out[264]: 
         name  population    state shortname info.governor
0        Dade       12345  Florida        FL    Rick Scott
1     Broward       40000  Florida        FL    Rick Scott
2  Palm Beach       60000  Florida        FL    Rick Scott
3      Summit        1234     Ohio        OH   John Kasich
4    Cuyahoga        1337     Ohio        OH   John Kasich

Max_Level参数提供了对结束标准化的级别的更多控制。当max_Level=1时,下面的代码片断将标准化,直到提供的dict的第一个嵌套级别。

In [265]: data = [
   .....:     {
   .....:         "CreatedBy": {"Name": "User001"},
   .....:         "Lookup": {
   .....:             "TextField": "Some text",
   .....:             "UserField": {"Id": "ID001", "Name": "Name001"},
   .....:         },
   .....:         "Image": {"a": "b"},
   .....:     }
   .....: ]
   .....: 

In [266]: pd.json_normalize(data, max_level=1)
Out[266]: 
  CreatedBy.Name Lookup.TextField                    Lookup.UserField Image.a
0        User001        Some text  {'Id': 'ID001', 'Name': 'Name001'}       b

行分隔符json#

Pandas能够使用Hadoop或Spark读写数据处理管道中常见的以行分隔的json文件。

对于以行分隔的json文件,Pandas还可以返回一个迭代器,该迭代器读入 chunksize 一次排成一行。这对于大型文件或从流中读取非常有用。

In [267]: jsonl = """
   .....:     {"a": 1, "b": 2}
   .....:     {"a": 3, "b": 4}
   .....: """
   .....: 

In [268]: df = pd.read_json(jsonl, lines=True)

In [269]: df
Out[269]: 
   a  b
0  1  2
1  3  4

In [270]: df.to_json(orient="records", lines=True)
Out[270]: '{"a":1,"b":2}\n{"a":3,"b":4}\n'

# reader is an iterator that returns ``chunksize`` lines each iteration
In [271]: with pd.read_json(StringIO(jsonl), lines=True, chunksize=1) as reader:
   .....:     reader
   .....:     for chunk in reader:
   .....:         print(chunk)
   .....: 
Empty DataFrame
Columns: []
Index: []
   a  b
0  1  2
   a  b
1  3  4

表模式#

Table Schema 是将表格数据集描述为JSON对象的规范。JSON包括有关字段名、类型和其他属性的信息。您可以使用Orient table 要构建具有两个字段的JSON字符串, schemadata

In [272]: df = pd.DataFrame(
   .....:     {
   .....:         "A": [1, 2, 3],
   .....:         "B": ["a", "b", "c"],
   .....:         "C": pd.date_range("2016-01-01", freq="d", periods=3),
   .....:     },
   .....:     index=pd.Index(range(3), name="idx"),
   .....: )
   .....: 

In [273]: df
Out[273]: 
     A  B          C
idx                 
0    1  a 2016-01-01
1    2  b 2016-01-02
2    3  c 2016-01-03

In [274]: df.to_json(orient="table", date_format="iso")
Out[274]: '{"schema":{"fields":[{"name":"idx","type":"integer"},{"name":"A","type":"integer"},{"name":"B","type":"string"},{"name":"C","type":"datetime"}],"primaryKey":["idx"],"pandas_version":"1.4.0"},"data":[{"idx":0,"A":1,"B":"a","C":"2016-01-01T00:00:00.000Z"},{"idx":1,"A":2,"B":"b","C":"2016-01-02T00:00:00.000Z"},{"idx":2,"A":3,"B":"c","C":"2016-01-03T00:00:00.000Z"}]}'

这个 schema 字段包含 fields 键本身包含列名到类型对的列表,包括 IndexMultiIndex (有关类型列表,请参阅下面的内容)。这个 schema 字段还包含一个 primaryKey 如果(多)索引是唯一的,则为字段。

第二个领域, data 属性包含序列化数据。 records 东方。其中包括索引,并且任何日期时间都是ISO 8601格式的,这是表架构规范所要求的。

支持的类型的完整列表在表架构规范中描述。下表显示了Pandas类型的映射:

大Pandas类型

表架构类型

Int64

整数

Float64

布尔尔

布尔值

日期时间64 [ns]

日期时间

时间增量64 [ns]

持续时间

直截了当的

任何

对象

应力

关于所生成的表模式,请注意以下几点:

  • 这个 schema 对象包含一个 pandas_version 菲尔德。它包含模式的Pandas方言的版本,并且将随着每次修订而递增。

  • 序列化时,所有日期都转换为UTC。偶数时区朴素值,被视为偏移量为0的UTC。

    In [275]: from pandas.io.json import build_table_schema
    
    In [276]: s = pd.Series(pd.date_range("2016", periods=4))
    
    In [277]: build_table_schema(s)
    Out[277]: 
    {'fields': [{'name': 'index', 'type': 'integer'},
      {'name': 'values', 'type': 'datetime'}],
     'primaryKey': ['index'],
     'pandas_version': '1.4.0'}
    
  • 带有时区的DateTime(在序列化之前),包括一个附加字段 tz 带有时区名称(例如 'US/Central' )。

    In [278]: s_tz = pd.Series(pd.date_range("2016", periods=12, tz="US/Central"))
    
    In [279]: build_table_schema(s_tz)
    Out[279]: 
    {'fields': [{'name': 'index', 'type': 'integer'},
      {'name': 'values', 'type': 'datetime', 'tz': 'US/Central'}],
     'primaryKey': ['index'],
     'pandas_version': '1.4.0'}
    
  • 句点在序列化之前转换为时间戳,因此具有转换为UTC的相同行为。此外,句点将包含和附加字段 freq 根据周期的频率,例如 'A-DEC'

    In [280]: s_per = pd.Series(1, index=pd.period_range("2016", freq="A-DEC", periods=4))
    
    In [281]: build_table_schema(s_per)
    Out[281]: 
    {'fields': [{'name': 'index', 'type': 'datetime', 'freq': 'A-DEC'},
      {'name': 'values', 'type': 'integer'}],
     'primaryKey': ['index'],
     'pandas_version': '1.4.0'}
    
  • 范畴词使用 any 类型和类型 enum 列出一组可能值的约束。此外,一个 ordered 此字段包括:

    In [282]: s_cat = pd.Series(pd.Categorical(["a", "b", "a"]))
    
    In [283]: build_table_schema(s_cat)
    Out[283]: 
    {'fields': [{'name': 'index', 'type': 'integer'},
      {'name': 'values',
       'type': 'any',
       'constraints': {'enum': ['a', 'b']},
       'ordered': False}],
     'primaryKey': ['index'],
     'pandas_version': '1.4.0'}
    
  • A primaryKey 包含标签数组的字段 如果索引是唯一的

    In [284]: s_dupe = pd.Series([1, 2], index=[1, 1])
    
    In [285]: build_table_schema(s_dupe)
    Out[285]: 
    {'fields': [{'name': 'index', 'type': 'integer'},
      {'name': 'values', 'type': 'integer'}],
     'pandas_version': '1.4.0'}
    
  • 这个 primaryKey 行为与多索引相同,但在本例中 primaryKey 是一个数组:

    In [286]: s_multi = pd.Series(1, index=pd.MultiIndex.from_product([("a", "b"), (0, 1)]))
    
    In [287]: build_table_schema(s_multi)
    Out[287]: 
    {'fields': [{'name': 'level_0', 'type': 'string'},
      {'name': 'level_1', 'type': 'integer'},
      {'name': 'values', 'type': 'integer'}],
     'primaryKey': FrozenList(['level_0', 'level_1']),
     'pandas_version': '1.4.0'}
    
  • 默认命名大致遵循以下规则:

    • 对于系列, object.name 使用的是。如果为None,则名称为 values

    • DataFrames ,则使用列名的字符串化版本

    • Index (不 MultiIndex ), index.name 被使用,并回退到 index 如果这不是一个问题的话。

    • MultiIndexmi.names 使用的是。如果任何级别没有名称,则 level_<i> 是使用的。

read_json 也接受 orient='table' 作为一种争论。这允许以可往返的方式保存元数据,例如数据类型和索引名。

In [288]: df = pd.DataFrame(
   .....:     {
   .....:         "foo": [1, 2, 3, 4],
   .....:         "bar": ["a", "b", "c", "d"],
   .....:         "baz": pd.date_range("2018-01-01", freq="d", periods=4),
   .....:         "qux": pd.Categorical(["a", "b", "c", "c"]),
   .....:     },
   .....:     index=pd.Index(range(4), name="idx"),
   .....: )
   .....: 

In [289]: df
Out[289]: 
     foo bar        baz qux
idx                        
0      1   a 2018-01-01   a
1      2   b 2018-01-02   b
2      3   c 2018-01-03   c
3      4   d 2018-01-04   c

In [290]: df.dtypes
Out[290]: 
foo             int64
bar            object
baz    datetime64[ns]
qux          category
dtype: object

In [291]: df.to_json("test.json", orient="table")

In [292]: new_df = pd.read_json("test.json", orient="table")

In [293]: new_df
Out[293]: 
     foo bar        baz qux
idx                        
0      1   a 2018-01-01   a
1      2   b 2018-01-02   b
2      3   c 2018-01-03   c
3      4   d 2018-01-04   c

In [294]: new_df.dtypes
Out[294]: 
foo             int64
bar            object
baz    datetime64[ns]
qux          category
dtype: object

请注意,文本字符串‘index’作为 Index 不能往返,任何名称都不能以 'level_' 在一个 MultiIndex 。默认情况下,在 DataFrame.to_json() 以指示缺失的值,并且后续读取无法区分意图。

In [295]: df.index.name = "index"

In [296]: df.to_json("test.json", orient="table")

In [297]: new_df = pd.read_json("test.json", orient="table")

In [298]: print(new_df.index.name)
None

在使用时 orient='table' 以及用户定义的 ExtensionArray ,则生成的架构将包含一个附加的 extDtype 输入各自的 fields 元素。这个额外的键不是标准的,但支持扩展类型的JSON往返。 read_json(df.to_json(orient="table"), orient="table") )。

这个 extDtype 密钥包含扩展名,如果您已正确注册 ExtensionDtype ,Pandas将使用该名称在注册表中执行查找,并将序列化数据重新转换为您的自定义数据类型。

HTML#

正在阅读HTML语言内容#

警告

我们 高度鼓励 请您阅读 HTML Table Parsing gotchas 下面介绍了有关BeautifulSoup4/html5lib/lxml解析器的问题。

最高层 read_html() 函数可以接受一个HTML字符串/文件/URL,并将把HTML表解析成Pandas列表 DataFrames 。让我们来看几个例子。

备注

read_html 返回一个 listDataFrame 对象,即使在HTML内容中只包含一个表。

读取不带选项的URL:

In [299]: url = "https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list"

In [300]: dfs = pd.read_html(url)

In [301]: dfs
Out[301]: 
[                         Bank NameBank           CityCity StateSt  CertCert              Acquiring InstitutionAI Closing DateClosing  FundFund
 0                    Almena State Bank             Almena      KS     15426                          Equity Bank    October 23, 2020     10538
 1           First City Bank of Florida  Fort Walton Beach      FL     16748            United Fidelity Bank, fsb    October 16, 2020     10537
 2                 The First State Bank      Barboursville      WV     14361                       MVB Bank, Inc.       April 3, 2020     10536
 3                   Ericson State Bank            Ericson      NE     18265           Farmers and Merchants Bank   February 14, 2020     10535
 4     City National Bank of New Jersey             Newark      NJ     21111                      Industrial Bank    November 1, 2019     10534
 ..                                 ...                ...     ...       ...                                  ...                 ...       ...
 558                 Superior Bank, FSB           Hinsdale      IL     32646                Superior Federal, FSB       July 27, 2001      6004
 559                Malta National Bank              Malta      OH      6629                    North Valley Bank         May 3, 2001      4648
 560    First Alliance Bank & Trust Co.         Manchester      NH     34264  Southern New Hampshire Bank & Trust    February 2, 2001      4647
 561  National State Bank of Metropolis         Metropolis      IL      3815              Banterra Bank of Marion   December 14, 2000      4646
 562                   Bank of Honolulu           Honolulu      HI     21029                   Bank of the Orient    October 13, 2000      4645
 
 [563 rows x 7 columns]]

备注

上述URL中的数据每周一都会更改,因此上面的结果数据和下面的数据可能会略有不同。

从上面的URL读入文件的内容并将其传递给 read_html 作为字符串:

In [302]: with open(file_path, "r") as f:
   .....:     dfs = pd.read_html(f.read())
   .....: 

In [303]: dfs
Out[303]: 
[                                    Bank Name          City  ST   CERT                Acquiring Institution       Closing Date       Updated Date
 0    Banks of Wisconsin d/b/a Bank of Kenosha       Kenosha  WI  35386                North Shore Bank, FSB       May 31, 2013       May 31, 2013
 1                        Central Arizona Bank    Scottsdale  AZ  34527                   Western State Bank       May 14, 2013       May 20, 2013
 2                                Sunrise Bank      Valdosta  GA  58185                         Synovus Bank       May 10, 2013       May 21, 2013
 3                       Pisgah Community Bank     Asheville  NC  58701                   Capital Bank, N.A.       May 10, 2013       May 14, 2013
 4                         Douglas County Bank  Douglasville  GA  21649                  Hamilton State Bank     April 26, 2013       May 16, 2013
 ..                                        ...           ...  ..    ...                                  ...                ...                ...
 501                        Superior Bank, FSB      Hinsdale  IL  32646                Superior Federal, FSB      July 27, 2001       June 5, 2012
 502                       Malta National Bank         Malta  OH   6629                    North Valley Bank        May 3, 2001  November 18, 2002
 503           First Alliance Bank & Trust Co.    Manchester  NH  34264  Southern New Hampshire Bank & Trust   February 2, 2001  February 18, 2003
 504         National State Bank of Metropolis    Metropolis  IL   3815              Banterra Bank of Marion  December 14, 2000     March 17, 2005
 505                          Bank of Honolulu      Honolulu  HI  21029                   Bank of the Orient   October 13, 2000     March 17, 2005
 
 [506 rows x 7 columns]]

您甚至可以传入一个 StringIO 如果您有这样的意愿:

In [304]: with open(file_path, "r") as f:
   .....:     sio = StringIO(f.read())
   .....: 

In [305]: dfs = pd.read_html(sio)

In [306]: dfs
Out[306]: 
[                                    Bank Name          City  ST   CERT                Acquiring Institution       Closing Date       Updated Date
 0    Banks of Wisconsin d/b/a Bank of Kenosha       Kenosha  WI  35386                North Shore Bank, FSB       May 31, 2013       May 31, 2013
 1                        Central Arizona Bank    Scottsdale  AZ  34527                   Western State Bank       May 14, 2013       May 20, 2013
 2                                Sunrise Bank      Valdosta  GA  58185                         Synovus Bank       May 10, 2013       May 21, 2013
 3                       Pisgah Community Bank     Asheville  NC  58701                   Capital Bank, N.A.       May 10, 2013       May 14, 2013
 4                         Douglas County Bank  Douglasville  GA  21649                  Hamilton State Bank     April 26, 2013       May 16, 2013
 ..                                        ...           ...  ..    ...                                  ...                ...                ...
 501                        Superior Bank, FSB      Hinsdale  IL  32646                Superior Federal, FSB      July 27, 2001       June 5, 2012
 502                       Malta National Bank         Malta  OH   6629                    North Valley Bank        May 3, 2001  November 18, 2002
 503           First Alliance Bank & Trust Co.    Manchester  NH  34264  Southern New Hampshire Bank & Trust   February 2, 2001  February 18, 2003
 504         National State Bank of Metropolis    Metropolis  IL   3815              Banterra Bank of Marion  December 14, 2000     March 17, 2005
 505                          Bank of Honolulu      Honolulu  HI  21029                   Bank of the Orient   October 13, 2000     March 17, 2005
 
 [506 rows x 7 columns]]

备注

下面的示例不是由IPython赋值器运行的,因为拥有如此多的网络访问函数会降低文档构建的速度。如果您发现错误或示例不能运行,请毫不犹豫地报告 pandas GitHub issues page

阅读URL并匹配包含特定文本的表格:

match = "Metcalf Bank"
df_list = pd.read_html(url, match=match)

指定标题行(默认情况下 <th><td> 元素中的元素 <thead> 中包含多行,则用于形成列索引 <thead> 则创建多重索引);如果指定,则从数据减去分析的标题元素中获取标题行 (<th> 元素)。

dfs = pd.read_html(url, header=0)

指定索引列:

dfs = pd.read_html(url, index_col=0)

指定要跳过的行数:

dfs = pd.read_html(url, skiprows=0)

使用列表指定要跳过的行数 (range 同样有效):

dfs = pd.read_html(url, skiprows=range(2))

指定一个HTML属性:

dfs1 = pd.read_html(url, attrs={"id": "table"})
dfs2 = pd.read_html(url, attrs={"class": "sortable"})
print(np.array_equal(dfs1[0], dfs2[0]))  # Should be True

指定应转换为NaN的值:

dfs = pd.read_html(url, na_values=["No Acquirer"])

指定是否保留默认的NaN值集:

dfs = pd.read_html(url, keep_default_na=False)

指定列的转换器。这对于具有前导零的数字文本数据很有用。默认情况下,数值列转换为数值类型,前导零丢失。为了避免这种情况,我们可以将这些列转换为字符串。

url_mcc = "https://en.wikipedia.org/wiki/Mobile_country_code"
dfs = pd.read_html(
    url_mcc,
    match="Telekom Albania",
    header=0,
    converters={"MNC": str},
)

使用以上几种方法的某种组合:

dfs = pd.read_html(url, match="Metcalf Bank", index_col=0)

读Pandas故事 to_html 输出(有一些浮点精度损失):

df = pd.DataFrame(np.random.randn(2, 2))
s = df.to_html(float_format="{0:.40g}".format)
dfin = pd.read_html(s, index_col=0)

这个 lxml 如果这是您提供的唯一解析器,则后端将在解析失败时引发错误。如果您只有一个解析器,您可以只提供一个字符串,但是,如果函数需要一个字符串序列,那么传递一个带有一个字符串的列表是一种很好的做法。您可以使用:

dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor=["lxml"])

或者你可以通过 flavor='lxml' 不列出清单:

dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor="lxml")

但是,如果您安装了BS4和html5lib并通过 None['lxml', 'bs4'] 那么解析很有可能会成功。请注意 一旦解析成功,该函数将返回

dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor=["lxml", "bs4"])

正在写入到HTML文件#

DataFrame 对象有一个实例方法 to_html 这将呈现 DataFrame 作为一个HTML表。函数参数与方法中的参数相同 to_string 如上所述。

备注

并不是所有可能的选项 DataFrame.to_html 为简洁起见,在此列出。看见 to_html() 以获取全套选项。

In [307]: df = pd.DataFrame(np.random.randn(2, 2))

In [308]: df
Out[308]: 
          0         1
0 -0.184744  0.496971
1 -0.856240  1.857977

In [309]: print(df.to_html())  # raw html
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>-0.184744</td>
      <td>0.496971</td>
    </tr>
    <tr>
      <th>1</th>
      <td>-0.856240</td>
      <td>1.857977</td>
    </tr>
  </tbody>
</table>

HTML:

0 1
0 -0.184744 0.496971
1 -0.856240 1.857977

这个 columns 参数将限制显示的列:

In [310]: print(df.to_html(columns=[0]))
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>-0.184744</td>
    </tr>
    <tr>
      <th>1</th>
      <td>-0.856240</td>
    </tr>
  </tbody>
</table>

HTML:

0
0 -0.184744
1 -0.856240

float_format 采用可调用的Python来控制浮点值的精度:

In [311]: print(df.to_html(float_format="{0:.10f}".format))
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>-0.1847438576</td>
      <td>0.4969711327</td>
    </tr>
    <tr>
      <th>1</th>
      <td>-0.8562396763</td>
      <td>1.8579766508</td>
    </tr>
  </tbody>
</table>

HTML:

0 1
0 -0.1847438576 0.4969711327
1 -0.8562396763 1.8579766508

bold_rows 默认情况下会将行标签设置为粗体,但您可以将其关闭:

In [312]: print(df.to_html(bold_rows=False))
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0</td>
      <td>-0.184744</td>
      <td>0.496971</td>
    </tr>
    <tr>
      <td>1</td>
      <td>-0.856240</td>
      <td>1.857977</td>
    </tr>
  </tbody>
</table>
0 1
0 -0.184744 0.496971
1 -0.856240 1.857977

这个 classes 参数提供了为生成的HTML表提供css类的能力。请注意,这些类是 附加的 到现有的 'dataframe' 班级。

In [313]: print(df.to_html(classes=["awesome_table_class", "even_more_awesome_class"]))
<table border="1" class="dataframe awesome_table_class even_more_awesome_class">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>-0.184744</td>
      <td>0.496971</td>
    </tr>
    <tr>
      <th>1</th>
      <td>-0.856240</td>
      <td>1.857977</td>
    </tr>
  </tbody>
</table>

这个 render_links 参数提供了将超链接添加到包含URL的单元格的功能。

In [314]: url_df = pd.DataFrame(
   .....:     {
   .....:         "name": ["Python", "pandas"],
   .....:         "url": ["https://www.python.org/", "https://pandas.pydata.org"],
   .....:     }
   .....: )
   .....: 

In [315]: print(url_df.to_html(render_links=True))
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>name</th>
      <th>url</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Python</td>
      <td><a href="https://www.python.org/" target="_blank">https://www.python.org/</a></td>
    </tr>
    <tr>
      <th>1</th>
      <td>pandas</td>
      <td><a href="https://pandas.pydata.org" target="_blank">https://pandas.pydata.org</a></td>
    </tr>
  </tbody>
</table>

HTML:

name url
0 Python https://www.python.org/
1 pandas https://pandas.pydata.org

最后, escape 参数允许您控制是否在生成的HTML中转义“<”、“>”和“&”字符(默认情况下是 True )。因此,要获得没有转义字符传递的HTML escape=False

In [316]: df = pd.DataFrame({"a": list("&<>"), "b": np.random.randn(3)})

逃脱:

In [317]: print(df.to_html())
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>&amp;</td>
      <td>-0.474063</td>
    </tr>
    <tr>
      <th>1</th>
      <td>&lt;</td>
      <td>-0.230305</td>
    </tr>
    <tr>
      <th>2</th>
      <td>&gt;</td>
      <td>-0.400654</td>
    </tr>
  </tbody>
</table>
a b
0 & -0.474063
1 < -0.230305
2 > -0.400654

未逃脱的:

In [318]: print(df.to_html(escape=False))
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>&</td>
      <td>-0.474063</td>
    </tr>
    <tr>
      <th>1</th>
      <td><</td>
      <td>-0.230305</td>
    </tr>
    <tr>
      <th>2</th>
      <td>></td>
      <td>-0.400654</td>
    </tr>
  </tbody>
</table>
a b
0 & -0.474063
1 < -0.230305
2 > -0.400654

备注

某些浏览器在呈现前两个HTML表格时可能不会显示出差异。

超文本标记语言表格解析Gotchas#

在顶级Pandas io函数中,围绕用于解析HTML表的库存在一些版本控制问题 read_html

存在以下问题 lxml

  • 优势

    • lxml 是非常快的。

    • lxml 需要正确安装Cython。

  • 缺点

    • lxml 会吗? not 对其解析结果做出任何保证 除非 它是被给予的 strictly valid markup.

    • 有鉴于此,我们选择允许您(用户)使用 lxml 后端,但是 该后端将使用 html5lib 如果 lxml 无法解析

    • 因此,它是 强烈推荐 这两个都要安装 BeautifulSoup4html5lib, 这样,即使在以下情况下,您仍将获得有效结果(前提是其他所有内容都有效) lxml 失败了。

存在以下问题 BeautifulSoup4 使用 lxml 作为后端

  • 上述问题在这里也适用,因为 BeautifulSoup4 本质上只是一个解析器后端的包装器。

存在以下问题 BeautifulSoup4 使用 html5lib 作为后端

  • 优势

    • html5lib 远比 lxml 并因此处理 real-life markup 以一种更理智的方式,而不是简单地,例如,在没有通知你的情况下删除一个元素。

    • html5lib generates valid HTML5 markup from invalid markup automatically. This is extremely important for parsing HTML tables, since it guarantees a valid document. However, that does NOT mean that it is "correct", since the process of fixing markup does not have a single definition.

    • html5lib 是纯Python,除了它自己的安装之外,不需要额外的构建步骤。

  • 缺点

    • 使用的最大缺点是 html5lib 就是它慢得像糖浆。然而,考虑到Web上的许多表都不够大,所以解析算法运行时并不重要。瓶颈更有可能出现在通过Web从URL读取原始文本的过程中,即IO(输入-输出)。对于非常大的表,情况可能并非如此。

Latex#

1.3.0 新版功能.

目前没有从LaTeX读取的方法,只有输出方法。

写入LaTeX文件#

备注

DataFrame and Styler对象当前有一个 to_latex method. We recommend using the Styler.to_latex() 方法已结束 DataFrame.to_latex() 由于前者在条件样式方面具有更大的灵活性,而后者在未来可能会被弃用。

查看以下内容的文档 Styler.to_latex 给出了条件样式的示例,并解释了其关键字参数的操作。

对于简单的应用程序,以下模式就足够了。

In [319]: df = pd.DataFrame([[1, 2], [3, 4]], index=["a", "b"], columns=["c", "d"])

In [320]: print(df.style.to_latex())
\begin{tabular}{lrr}
 & c & d \\
a & 1 & 2 \\
b & 3 & 4 \\
\end{tabular}

若要在输出前格式化值,请将 Styler.format 方法。

In [321]: print(df.style.format("€ {}").to_latex())
\begin{tabular}{lrr}
 & c & d \\
a & € 1 & € 2 \\
b & € 3 & € 4 \\
\end{tabular}

XML#

正在阅读XML#

1.3.0 新版功能.

最高层 read_xml() 函数可以接受XML字符串/文件/URL,并将节点和属性解析为Pandas DataFrame

备注

由于没有设计类型可以在许多方面有所不同的标准XML结构, read_xml 与较浅、较浅的版本配合使用效果最佳。如果XML文档嵌套得很深,请使用 stylesheet 将XML转换为更简单的版本的功能。

让我们来看几个例子。

读取一个XML字符串:

In [322]: xml = """<?xml version="1.0" encoding="UTF-8"?>
   .....: <bookstore>
   .....:   <book category="cooking">
   .....:     <title lang="en">Everyday Italian</title>
   .....:     <author>Giada De Laurentiis</author>
   .....:     <year>2005</year>
   .....:     <price>30.00</price>
   .....:   </book>
   .....:   <book category="children">
   .....:     <title lang="en">Harry Potter</title>
   .....:     <author>J K. Rowling</author>
   .....:     <year>2005</year>
   .....:     <price>29.99</price>
   .....:   </book>
   .....:   <book category="web">
   .....:     <title lang="en">Learning XML</title>
   .....:     <author>Erik T. Ray</author>
   .....:     <year>2003</year>
   .....:     <price>39.95</price>
   .....:   </book>
   .....: </bookstore>"""
   .....: 

In [323]: df = pd.read_xml(xml)

In [324]: df
Out[324]: 
   category             title               author  year  price
0   cooking  Everyday Italian  Giada De Laurentiis  2005  30.00
1  children      Harry Potter         J K. Rowling  2005  29.99
2       web      Learning XML          Erik T. Ray  2003  39.95

读取不带选项的URL:

In [325]: df = pd.read_xml("https://www.w3schools.com/xml/books.xml")

In [326]: df
Out[326]: 
   category              title                  author  year  price      cover
0   cooking   Everyday Italian     Giada De Laurentiis  2005  30.00       None
1  children       Harry Potter            J K. Rowling  2005  29.99       None
2       web  XQuery Kick Start  Vaidyanathan Nagarajan  2003  49.99       None
3       web       Learning XML             Erik T. Ray  2003  39.95  paperback

读入“books.xml”文件的内容并将其传递给 read_xml 作为字符串:

In [327]: with open(file_path, "r") as f:
   .....:     df = pd.read_xml(f.read())
   .....: 

In [328]: df
Out[328]: 
   category             title               author  year  price
0   cooking  Everyday Italian  Giada De Laurentiis  2005  30.00
1  children      Harry Potter         J K. Rowling  2005  29.99
2       web      Learning XML          Erik T. Ray  2003  39.95

将“books.xml”的内容读入为 StringIOBytesIO 并将其传递给 read_xml

In [329]: with open(file_path, "r") as f:
   .....:     sio = StringIO(f.read())
   .....: 

In [330]: df = pd.read_xml(sio)

In [331]: df
Out[331]: 
   category             title               author  year  price
0   cooking  Everyday Italian  Giada De Laurentiis  2005  30.00
1  children      Harry Potter         J K. Rowling  2005  29.99
2       web      Learning XML          Erik T. Ray  2003  39.95
In [332]: with open(file_path, "rb") as f:
   .....:     bio = BytesIO(f.read())
   .....: 

In [333]: df = pd.read_xml(bio)

In [334]: df
Out[334]: 
   category             title               author  year  price
0   cooking  Everyday Italian  Giada De Laurentiis  2005  30.00
1  children      Harry Potter         J K. Rowling  2005  29.99
2       web      Learning XML          Erik T. Ray  2003  39.95

甚至可以从AWS S3存储桶中读取XML,如Python Software Foundation的IRS990表格:

In [335]: df = pd.read_xml(
   .....:     "s3://irs-form-990/201923199349319487_public.xml",
   .....:     xpath=".//irs:Form990PartVIISectionAGrp",
   .....:     namespaces={"irs": "http://www.irs.gov/efile"}
   .....: )
   .....: 
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:139, in import_optional_dependency(name, extra, errors, min_version)
    138 try:
--> 139     module = importlib.import_module(name)
    140 except ImportError:

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'fsspec'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Input In [335], in <cell line: 1>()
----> 1 df = pd.read_xml(
      2     "s3://irs-form-990/201923199349319487_public.xml",
      3     xpath=".//irs:Form990PartVIISectionAGrp",
      4     namespaces={"irs": "http://www.irs.gov/efile"}
      5 )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/util/_decorators.py:317, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    311 if len(args) > num_allow_args:
    312     warnings.warn(
    313         msg.format(arguments=arguments),
    314         FutureWarning,
    315         stacklevel=stacklevel,
    316     )
--> 317 return func(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/xml.py:1208, in read_xml(path_or_buffer, xpath, namespaces, elems_only, attrs_only, names, dtype, converters, parse_dates, encoding, parser, stylesheet, iterparse, compression, storage_options)
    957 @deprecate_nonkeyword_arguments(
    958     version=None, allowed_args=["path_or_buffer"], stacklevel=2
    959 )
   (...)
    980     storage_options: StorageOptions = None,
    981 ) -> DataFrame:
    982     r"""
    983     Read XML document into a ``DataFrame`` object.
    984 
   (...)
   1205     2  triangle      180    3.0
   1206     """
-> 1208     return _parse(
   1209         path_or_buffer=path_or_buffer,
   1210         xpath=xpath,
   1211         namespaces=namespaces,
   1212         elems_only=elems_only,
   1213         attrs_only=attrs_only,
   1214         names=names,
   1215         dtype=dtype,
   1216         converters=converters,
   1217         parse_dates=parse_dates,
   1218         encoding=encoding,
   1219         parser=parser,
   1220         stylesheet=stylesheet,
   1221         iterparse=iterparse,
   1222         compression=compression,
   1223         storage_options=storage_options,
   1224     )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/xml.py:946, in _parse(path_or_buffer, xpath, namespaces, elems_only, attrs_only, names, dtype, converters, parse_dates, encoding, parser, stylesheet, iterparse, compression, storage_options, **kwargs)
    943 else:
    944     raise ValueError("Values for parser can only be lxml or etree.")
--> 946 data_dicts = p.parse_data()
    948 return _data_to_frame(
    949     data=data_dicts,
    950     dtype=dtype,
   (...)
    953     **kwargs,
    954 )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/xml.py:527, in _LxmlFrameParser.parse_data(self)
    524 from lxml.etree import XML
    526 if self.iterparse is None:
--> 527     self.xml_doc = XML(self._parse_doc(self.path_or_buffer))
    529     if self.stylesheet:
    530         self.xsl_doc = XML(self._parse_doc(self.stylesheet))

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/xml.py:739, in _LxmlFrameParser._parse_doc(self, raw_doc)
    731 def _parse_doc(self, raw_doc) -> bytes:
    732     from lxml.etree import (
    733         XMLParser,
    734         fromstring,
    735         parse,
    736         tostring,
    737     )
--> 739     handle_data = get_data_from_filepath(
    740         filepath_or_buffer=raw_doc,
    741         encoding=self.encoding,
    742         compression=self.compression,
    743         storage_options=self.storage_options,
    744     )
    746     with preprocess_data(handle_data) as xml_data:
    747         curr_parser = XMLParser(encoding=self.encoding)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/xml.py:808, in get_data_from_filepath(filepath_or_buffer, encoding, compression, storage_options)
    797     filepath_or_buffer = stringify_path(filepath_or_buffer)
    799 if (
    800     isinstance(filepath_or_buffer, str)
    801     and not filepath_or_buffer.startswith(("<?xml", "<"))
   (...)
    806     or file_exists(filepath_or_buffer)
    807 ):
--> 808     with get_handle(
    809         filepath_or_buffer,
    810         "r",
    811         encoding=encoding,
    812         compression=compression,
    813         storage_options=storage_options,
    814     ) as handle_obj:
    815         filepath_or_buffer = (
    816             # error: Incompatible types in assignment (expression has type
    817             # "Union[str, IO[str]]", variable has type "Union[Union[str,
   (...)
    821             else handle_obj.handle
    822         )
    824 return filepath_or_buffer

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/common.py:667, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    664     codecs.lookup_error(errors)
    666 # open URLs
--> 667 ioargs = _get_filepath_or_buffer(
    668     path_or_buf,
    669     encoding=encoding,
    670     compression=compression,
    671     mode=mode,
    672     storage_options=storage_options,
    673 )
    675 handle = ioargs.filepath_or_buffer
    676 handles: list[BaseBuffer]

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/common.py:361, in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
    359 if filepath_or_buffer.startswith("s3n://"):
    360     filepath_or_buffer = filepath_or_buffer.replace("s3n://", "s3://")
--> 361 fsspec = import_optional_dependency("fsspec")
    363 # If botocore is installed we fallback to reading with anon=True
    364 # to allow reads from public buckets
    365 err_types_to_retry_with_anon: list[Any] = []

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
    140 except ImportError:
    141     if errors == "raise":
--> 142         raise ImportError(msg)
    143     else:
    144         return None

ImportError: Missing optional dependency 'fsspec'.  Use pip or conda to install fsspec.

In [336]: df
Out[336]: 
   category             title               author  year  price
0   cooking  Everyday Italian  Giada De Laurentiis  2005  30.00
1  children      Harry Potter         J K. Rowling  2005  29.99
2       web      Learning XML          Erik T. Ray  2003  39.95

使用 lxml 作为默认设置 parser ,您可以访问扩展了Python的ElementTree API的功能齐全的XML库。一个强大的工具是能够使用更具表现力的XPath选择性或有条件地查询节点:

In [337]: df = pd.read_xml(file_path, xpath="//book[year=2005]")

In [338]: df
Out[338]: 
   category             title               author  year  price
0   cooking  Everyday Italian  Giada De Laurentiis  2005  30.00
1  children      Harry Potter         J K. Rowling  2005  29.99

仅指定要分析的元素或属性:

In [339]: df = pd.read_xml(file_path, elems_only=True)

In [340]: df
Out[340]: 
              title               author  year  price
0  Everyday Italian  Giada De Laurentiis  2005  30.00
1      Harry Potter         J K. Rowling  2005  29.99
2      Learning XML          Erik T. Ray  2003  39.95
In [341]: df = pd.read_xml(file_path, attrs_only=True)

In [342]: df
Out[342]: 
   category
0   cooking
1  children
2       web

XML文档可以有带前缀的命名空间和不带前缀的默认命名空间,这两种命名空间都用特殊属性表示 xmlns 。为了通过名字空间上下文下的节点进行解析, xpath 必须引用前缀。

例如,下面的XML包含带有前缀的命名空间, doc ,和URI位于 https://example.com 。为了解析 doc:row 节点, namespaces 必须使用。

In [343]: xml = """<?xml version='1.0' encoding='utf-8'?>
   .....: <doc:data xmlns:doc="https://example.com">
   .....:   <doc:row>
   .....:     <doc:shape>square</doc:shape>
   .....:     <doc:degrees>360</doc:degrees>
   .....:     <doc:sides>4.0</doc:sides>
   .....:   </doc:row>
   .....:   <doc:row>
   .....:     <doc:shape>circle</doc:shape>
   .....:     <doc:degrees>360</doc:degrees>
   .....:     <doc:sides/>
   .....:   </doc:row>
   .....:   <doc:row>
   .....:     <doc:shape>triangle</doc:shape>
   .....:     <doc:degrees>180</doc:degrees>
   .....:     <doc:sides>3.0</doc:sides>
   .....:   </doc:row>
   .....: </doc:data>"""
   .....: 

In [344]: df = pd.read_xml(xml,
   .....:                  xpath="//doc:row",
   .....:                  namespaces={"doc": "https://example.com"})
   .....: 

In [345]: df
Out[345]: 
      shape  degrees  sides
0    square      360    4.0
1    circle      360    NaN
2  triangle      180    3.0

类似地,XML文档可以有一个不带前缀的默认名称空间。未能分配临时前缀将不返回任何节点并引发 ValueError 。但分配给 any 用于更正URI的临时名称允许按节点进行解析。

In [346]: xml = """<?xml version='1.0' encoding='utf-8'?>
   .....: <data xmlns="https://example.com">
   .....:  <row>
   .....:    <shape>square</shape>
   .....:    <degrees>360</degrees>
   .....:    <sides>4.0</sides>
   .....:  </row>
   .....:  <row>
   .....:    <shape>circle</shape>
   .....:    <degrees>360</degrees>
   .....:    <sides/>
   .....:  </row>
   .....:  <row>
   .....:    <shape>triangle</shape>
   .....:    <degrees>180</degrees>
   .....:    <sides>3.0</sides>
   .....:  </row>
   .....: </data>"""
   .....: 

In [347]: df = pd.read_xml(xml,
   .....:                  xpath="//pandas:row",
   .....:                  namespaces={"pandas": "https://example.com"})
   .....: 

In [348]: df
Out[348]: 
      shape  degrees  sides
0    square      360    4.0
1    circle      360    NaN
2  triangle      180    3.0

但是,如果XPath没有引用节点名称,例如Default, /* ,那么 namespaces 不是必需的。

使用 lxml 作为解析器,您可以使用XSLT脚本扁平化嵌套的XML文档,该脚本也可以是字符串/文件/URL类型。作为背景, XSLT 是一种用特殊的XML文件编写的专用语言,可以将原始的XML文档转换为其他的XML、HTML,甚至文本(CSV、JSON等)使用XSLT处理器。

例如,考虑芝加哥“L”游乐设施的这种有点嵌套的结构,其中Station和Rides元素将数据封装在它们自己的部分中。使用下面的XSLT, lxml 可以将原始嵌套文档转换为更平面的输出(如下所示),以便更轻松地解析为 DataFrame

In [349]: xml = """<?xml version='1.0' encoding='utf-8'?>
   .....:  <response>
   .....:   <row>
   .....:     <station id="40850" name="Library"/>
   .....:     <month>2020-09-01T00:00:00</month>
   .....:     <rides>
   .....:       <avg_weekday_rides>864.2</avg_weekday_rides>
   .....:       <avg_saturday_rides>534</avg_saturday_rides>
   .....:       <avg_sunday_holiday_rides>417.2</avg_sunday_holiday_rides>
   .....:     </rides>
   .....:   </row>
   .....:   <row>
   .....:     <station id="41700" name="Washington/Wabash"/>
   .....:     <month>2020-09-01T00:00:00</month>
   .....:     <rides>
   .....:       <avg_weekday_rides>2707.4</avg_weekday_rides>
   .....:       <avg_saturday_rides>1909.8</avg_saturday_rides>
   .....:       <avg_sunday_holiday_rides>1438.6</avg_sunday_holiday_rides>
   .....:     </rides>
   .....:   </row>
   .....:   <row>
   .....:     <station id="40380" name="Clark/Lake"/>
   .....:     <month>2020-09-01T00:00:00</month>
   .....:     <rides>
   .....:       <avg_weekday_rides>2949.6</avg_weekday_rides>
   .....:       <avg_saturday_rides>1657</avg_saturday_rides>
   .....:       <avg_sunday_holiday_rides>1453.8</avg_sunday_holiday_rides>
   .....:     </rides>
   .....:   </row>
   .....:  </response>"""
   .....: 

In [350]: xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
   .....:    <xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
   .....:    <xsl:strip-space elements="*"/>
   .....:    <xsl:template match="/response">
   .....:       <xsl:copy>
   .....:         <xsl:apply-templates select="row"/>
   .....:       </xsl:copy>
   .....:    </xsl:template>
   .....:    <xsl:template match="row">
   .....:       <xsl:copy>
   .....:         <station_id><xsl:value-of select="station/@id"/></station_id>
   .....:         <station_name><xsl:value-of select="station/@name"/></station_name>
   .....:         <xsl:copy-of select="month|rides/*"/>
   .....:       </xsl:copy>
   .....:    </xsl:template>
   .....:  </xsl:stylesheet>"""
   .....: 

In [351]: output = """<?xml version='1.0' encoding='utf-8'?>
   .....:  <response>
   .....:    <row>
   .....:       <station_id>40850</station_id>
   .....:       <station_name>Library</station_name>
   .....:       <month>2020-09-01T00:00:00</month>
   .....:       <avg_weekday_rides>864.2</avg_weekday_rides>
   .....:       <avg_saturday_rides>534</avg_saturday_rides>
   .....:       <avg_sunday_holiday_rides>417.2</avg_sunday_holiday_rides>
   .....:    </row>
   .....:    <row>
   .....:       <station_id>41700</station_id>
   .....:       <station_name>Washington/Wabash</station_name>
   .....:       <month>2020-09-01T00:00:00</month>
   .....:       <avg_weekday_rides>2707.4</avg_weekday_rides>
   .....:       <avg_saturday_rides>1909.8</avg_saturday_rides>
   .....:       <avg_sunday_holiday_rides>1438.6</avg_sunday_holiday_rides>
   .....:    </row>
   .....:    <row>
   .....:       <station_id>40380</station_id>
   .....:       <station_name>Clark/Lake</station_name>
   .....:       <month>2020-09-01T00:00:00</month>
   .....:       <avg_weekday_rides>2949.6</avg_weekday_rides>
   .....:       <avg_saturday_rides>1657</avg_saturday_rides>
   .....:       <avg_sunday_holiday_rides>1453.8</avg_sunday_holiday_rides>
   .....:    </row>
   .....:  </response>"""
   .....: 

In [352]: df = pd.read_xml(xml, stylesheet=xsl)

In [353]: df
Out[353]: 
   station_id       station_name                month  avg_weekday_rides  avg_saturday_rides  avg_sunday_holiday_rides
0       40850            Library  2020-09-01T00:00:00              864.2               534.0                     417.2
1       41700  Washington/Wabash  2020-09-01T00:00:00             2707.4              1909.8                    1438.6
2       40380         Clark/Lake  2020-09-01T00:00:00             2949.6              1657.0                    1453.8

对于范围在数百兆字节到千兆字节之间的非常大的XML文件, pandas.read_xml() 支持解析这样大的文件,使用 lxml's iterparseetree's iterparse 它们是迭代遍历XML树并提取特定元素和属性的高效内存方法。而不会将整棵树保存在内存中。

1.5.0 新版功能.

要使用此功能,必须将物理XML文件路径传递到 read_xml 并使用 iterparse 争论。文件不应被压缩或指向在线资源,而应存储在本地磁盘上。另外, iterparse 应该是一个字典,其中键是文档中的重复节点(成为行),值是作为重复节点的后代(即,子节点、孙节点)的任何元素或属性的列表。由于在此方法中未使用XPath,因此后代不需要彼此共享相同的关系。下面显示了在维基百科非常大的(12 GB+)最新文章数据转储中阅读的示例。

In [1]: df = pd.read_xml(
...         "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
...         iterparse = {"page": ["title", "ns", "id"]}
...     )
...     df
Out[2]:
                                                     title   ns        id
0                                       Gettysburg Address    0     21450
1                                                Main Page    0     42950
2                            Declaration by United Nations    0      8435
3             Constitution of the United States of America    0      8435
4                     Declaration of Independence (Israel)    0     17858
...                                                    ...  ...       ...
3578760               Page:Black cat 1897 07 v2 n10.pdf/17  104    219649
3578761               Page:Black cat 1897 07 v2 n10.pdf/43  104    219649
3578762               Page:Black cat 1897 07 v2 n10.pdf/44  104    219649
3578763      The History of Tom Jones, a Foundling/Book IX    0  12084291
3578764  Page:Shakespeare of Stratford (1926) Yale.djvu/91  104     21450

[3578765 rows x 3 columns]

编写XML#

1.3.0 新版功能.

DataFrame 对象有一个实例方法 to_xml 这将呈现 DataFrame 作为一个XML文档。

备注

此方法不支持XML的特殊属性,包括DTD、CDATA、XSD架构、处理指令、注释等。仅支持根级别的命名空间。然而, stylesheet 允许在初始输出后更改设计。

让我们来看几个例子。

编写不带选项的XML:

In [354]: geom_df = pd.DataFrame(
   .....:     {
   .....:         "shape": ["square", "circle", "triangle"],
   .....:         "degrees": [360, 360, 180],
   .....:         "sides": [4, np.nan, 3],
   .....:     }
   .....: )
   .....: 

In [355]: print(geom_df.to_xml())
<?xml version='1.0' encoding='utf-8'?>
<data>
  <row>
    <index>0</index>
    <shape>square</shape>
    <degrees>360</degrees>
    <sides>4.0</sides>
  </row>
  <row>
    <index>1</index>
    <shape>circle</shape>
    <degrees>360</degrees>
    <sides/>
  </row>
  <row>
    <index>2</index>
    <shape>triangle</shape>
    <degrees>180</degrees>
    <sides>3.0</sides>
  </row>
</data>

使用新的根名和行名编写一个XML:

In [356]: print(geom_df.to_xml(root_name="geometry", row_name="objects"))
<?xml version='1.0' encoding='utf-8'?>
<geometry>
  <objects>
    <index>0</index>
    <shape>square</shape>
    <degrees>360</degrees>
    <sides>4.0</sides>
  </objects>
  <objects>
    <index>1</index>
    <shape>circle</shape>
    <degrees>360</degrees>
    <sides/>
  </objects>
  <objects>
    <index>2</index>
    <shape>triangle</shape>
    <degrees>180</degrees>
    <sides>3.0</sides>
  </objects>
</geometry>

编写以属性为中心的XML:

In [357]: print(geom_df.to_xml(attr_cols=geom_df.columns.tolist()))
<?xml version='1.0' encoding='utf-8'?>
<data>
  <row index="0" shape="square" degrees="360" sides="4.0"/>
  <row index="1" shape="circle" degrees="360"/>
  <row index="2" shape="triangle" degrees="180" sides="3.0"/>
</data>

混合编写元素和属性:

In [358]: print(
   .....:     geom_df.to_xml(
   .....:         index=False,
   .....:         attr_cols=['shape'],
   .....:         elem_cols=['degrees', 'sides'])
   .....: )
   .....: 
<?xml version='1.0' encoding='utf-8'?>
<data>
  <row shape="square">
    <degrees>360</degrees>
    <sides>4.0</sides>
  </row>
  <row shape="circle">
    <degrees>360</degrees>
    <sides/>
  </row>
  <row shape="triangle">
    <degrees>180</degrees>
    <sides>3.0</sides>
  </row>
</data>

任何 DataFrames 对于级别由下划线分隔的XML元素名称,将展平具有层次结构的列:

In [359]: ext_geom_df = pd.DataFrame(
   .....:     {
   .....:         "type": ["polygon", "other", "polygon"],
   .....:         "shape": ["square", "circle", "triangle"],
   .....:         "degrees": [360, 360, 180],
   .....:         "sides": [4, np.nan, 3],
   .....:     }
   .....: )
   .....: 

In [360]: pvt_df = ext_geom_df.pivot_table(index='shape',
   .....:                                  columns='type',
   .....:                                  values=['degrees', 'sides'],
   .....:                                  aggfunc='sum')
   .....: 

In [361]: pvt_df
Out[361]: 
         degrees         sides        
type       other polygon other polygon
shape                                 
circle     360.0     NaN   0.0     NaN
square       NaN   360.0   NaN     4.0
triangle     NaN   180.0   NaN     3.0

In [362]: print(pvt_df.to_xml())
<?xml version='1.0' encoding='utf-8'?>
<data>
  <row>
    <shape>circle</shape>
    <degrees_other>360.0</degrees_other>
    <degrees_polygon/>
    <sides_other>0.0</sides_other>
    <sides_polygon/>
  </row>
  <row>
    <shape>square</shape>
    <degrees_other/>
    <degrees_polygon>360.0</degrees_polygon>
    <sides_other/>
    <sides_polygon>4.0</sides_polygon>
  </row>
  <row>
    <shape>triangle</shape>
    <degrees_other/>
    <degrees_polygon>180.0</degrees_polygon>
    <sides_other/>
    <sides_polygon>3.0</sides_polygon>
  </row>
</data>

使用默认命名空间编写一个XML:

In [363]: print(geom_df.to_xml(namespaces={"": "https://example.com"}))
<?xml version='1.0' encoding='utf-8'?>
<data xmlns="https://example.com">
  <row>
    <index>0</index>
    <shape>square</shape>
    <degrees>360</degrees>
    <sides>4.0</sides>
  </row>
  <row>
    <index>1</index>
    <shape>circle</shape>
    <degrees>360</degrees>
    <sides/>
  </row>
  <row>
    <index>2</index>
    <shape>triangle</shape>
    <degrees>180</degrees>
    <sides>3.0</sides>
  </row>
</data>

编写一个带有命名空间前缀的XML:

In [364]: print(
   .....:     geom_df.to_xml(namespaces={"doc": "https://example.com"},
   .....:                    prefix="doc")
   .....: )
   .....: 
<?xml version='1.0' encoding='utf-8'?>
<doc:data xmlns:doc="https://example.com">
  <doc:row>
    <doc:index>0</doc:index>
    <doc:shape>square</doc:shape>
    <doc:degrees>360</doc:degrees>
    <doc:sides>4.0</doc:sides>
  </doc:row>
  <doc:row>
    <doc:index>1</doc:index>
    <doc:shape>circle</doc:shape>
    <doc:degrees>360</doc:degrees>
    <doc:sides/>
  </doc:row>
  <doc:row>
    <doc:index>2</doc:index>
    <doc:shape>triangle</doc:shape>
    <doc:degrees>180</doc:degrees>
    <doc:sides>3.0</doc:sides>
  </doc:row>
</doc:data>

编写一个不带声明或漂亮字体的XML:

In [365]: print(
   .....:     geom_df.to_xml(xml_declaration=False,
   .....:                    pretty_print=False)
   .....: )
   .....: 
<data><row><index>0</index><shape>square</shape><degrees>360</degrees><sides>4.0</sides></row><row><index>1</index><shape>circle</shape><degrees>360</degrees><sides/></row><row><index>2</index><shape>triangle</shape><degrees>180</degrees><sides>3.0</sides></row></data>

编写一个XML并使用样式表进行转换:

In [366]: xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
   .....:    <xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
   .....:    <xsl:strip-space elements="*"/>
   .....:    <xsl:template match="/data">
   .....:      <geometry>
   .....:        <xsl:apply-templates select="row"/>
   .....:      </geometry>
   .....:    </xsl:template>
   .....:    <xsl:template match="row">
   .....:      <object index="{index}">
   .....:        <xsl:if test="shape!='circle'">
   .....:            <xsl:attribute name="type">polygon</xsl:attribute>
   .....:        </xsl:if>
   .....:        <xsl:copy-of select="shape"/>
   .....:        <property>
   .....:            <xsl:copy-of select="degrees|sides"/>
   .....:        </property>
   .....:      </object>
   .....:    </xsl:template>
   .....:  </xsl:stylesheet>"""
   .....: 

In [367]: print(geom_df.to_xml(stylesheet=xsl))
<?xml version="1.0"?>
<geometry>
  <object index="0" type="polygon">
    <shape>square</shape>
    <property>
      <degrees>360</degrees>
      <sides>4.0</sides>
    </property>
  </object>
  <object index="1">
    <shape>circle</shape>
    <property>
      <degrees>360</degrees>
      <sides/>
    </property>
  </object>
  <object index="2" type="polygon">
    <shape>triangle</shape>
    <property>
      <degrees>180</degrees>
      <sides>3.0</sides>
    </property>
  </object>
</geometry>

XML结束语#

  • 所有的XML文档都遵循 W3C specifications 。两者都有 etreelxml 解析器将无法解析任何格式不正确或遵循XML语法规则的标记文档。请注意,除非遵循XHTML规范,否则HTML不是XML文档。然而,其他流行的标记类型,包括KML、XAML、RSS、MusicML、MathML也是兼容的 XML schemas

  • 出于上述原因,如果您的应用程序在PANDA操作之前构建XML,请使用适当的DOM库,如 etreelxml 构建必要的文档,而不是通过字符串连接或正则表达式调整。请始终记住,XML是一种 特价 带有标记规则的文本文件。

  • 对于非常大的XML文件(从几百MB到GB),XPath和XSLT可能成为内存密集型操作。确保有足够的可用RAM来读写大型的XML文件(大约是文本大小的5倍)。

  • 因为XSLT是一种编程语言,所以请谨慎使用它,因为这样的脚本可能会在您的环境中带来安全风险,并可能运行大型或无限的递归操作。在完全运行之前,始终在小片段上测试脚本。

  • 这个 etree 解析器支持两者的所有功能 read_xmlto_xml 除了复杂的XPath和任何XSLT。虽然功能有限,但 etree 仍然是一个可靠且有能力的解析器和树构建器。它的表现可能会落后于 lxml 在一定程度上适用于较大的文件,但对于小到中等大小的文件相对不明显。

Excel文件#

这个 read_excel() 方法可以读取Excel 2007+ (.xlsx )文件使用 openpyxl Python模块。Excel 2003 (.xls )文件可以使用以下方式读取 xlrd 。二进制Excel (.xlsb )文件可以使用以下方式读取 pyxlsb 。这个 to_excel() 实例方法用于保存 DataFrame 转到Excel。一般而言,语义类似于使用 csv 数据。请参阅 cookbook 一些先进的策略。

警告

这个 xlwt 旧式写作包 .xls excel files is no longer maintained. The xlrd 套餐现在只供阅读老式的 .xls 文件。

在Pandas 1.3.0之前,默认参数 engine=None to read_excel() would result in using the xlrd engine in many cases, including new Excel 2007+ (.xlsx) files. pandas will now default to using the openpyxl 引擎。

强烈建议您安装 openpyxl 阅读Excel 2007+ (.xlsx )文件。 Please do not report issues when using ``xlrd`` to read ``.xlsx`` files. 这不再受支持,请切换到使用 openpyxl 取而代之的是。

尝试使用 xlwt 引擎将引发一个 FutureWarning 除非该选项 io.excel.xls.writer 设置为 "xlwt" 。虽然此选项现在已弃用,并且还将引发 FutureWarning ,它可以全局设置并抑制警告。建议用户写下 .xlsx 文件使用 openpyxl 换成了发动机。

正在读取Excel文件#

在最基本的用例中, read_excel 获取指向Excel文件的路径,而 sheet_name 指示要分析哪个工作表。

# Returns a DataFrame
pd.read_excel("path_to_file.xls", sheet_name="Sheet1")

ExcelFile 班级#

为了便于使用同一文件中的多张工作表, ExcelFile 类可用于包装文件,并可传递到 read_excel 由于文件只读入内存一次,因此读取多张图纸会带来性能优势。

xlsx = pd.ExcelFile("path_to_file.xls")
df = pd.read_excel(xlsx, "Sheet1")

这个 ExcelFile 类还可以用作上下文管理器。

with pd.ExcelFile("path_to_file.xls") as xls:
    df1 = pd.read_excel(xls, "Sheet1")
    df2 = pd.read_excel(xls, "Sheet2")

这个 sheet_names 属性将在文件中生成工作表名称的列表。

的主要使用情形 ExcelFile 正在分析具有不同参数的多个工作表:

data = {}
# For when Sheet1's format differs from Sheet2
with pd.ExcelFile("path_to_file.xls") as xls:
    data["Sheet1"] = pd.read_excel(xls, "Sheet1", index_col=None, na_values=["NA"])
    data["Sheet2"] = pd.read_excel(xls, "Sheet2", index_col=1)

请注意,如果对所有工作表使用相同的解析参数,则只需将工作表名称列表传递到 read_excel 在不影响性能的情况下。

# using the ExcelFile class
data = {}
with pd.ExcelFile("path_to_file.xls") as xls:
    data["Sheet1"] = pd.read_excel(xls, "Sheet1", index_col=None, na_values=["NA"])
    data["Sheet2"] = pd.read_excel(xls, "Sheet2", index_col=None, na_values=["NA"])

# equivalent using the read_excel function
data = pd.read_excel(
    "path_to_file.xls", ["Sheet1", "Sheet2"], index_col=None, na_values=["NA"]
)

ExcelFile 也可以使用 xlrd.book.Book 对象作为参数。这允许用户控制如何读取EXCEL文件。例如,可以通过调用 xlrd.open_workbook() 使用 on_demand=True

import xlrd

xlrd_book = xlrd.open_workbook("path_to_file.xls", on_demand=True)
with pd.ExcelFile(xlrd_book) as xls:
    df1 = pd.read_excel(xls, "Sheet1")
    df2 = pd.read_excel(xls, "Sheet2")

指定图纸#

备注

第二个论点是 sheet_name ,不要与之混淆 ExcelFile.sheet_names

备注

ExcelFile的属性 sheet_names 提供对图纸列表的访问。

  • 这些论据 sheet_name 允许指定要读取的一张或多张图纸。

  • 的默认值 sheet_name 为0,表示读取第一张

  • 传递一个字符串以引用工作簿中特定工作表的名称。

  • 传递一个整数以引用工作表的索引。索引遵循Python约定,从0开始。

  • 传递字符串或整数的列表,以返回指定工作表的词典。

  • 传递一个 None 返回所有可用工作表的词典。

# Returns a DataFrame
pd.read_excel("path_to_file.xls", "Sheet1", index_col=None, na_values=["NA"])

使用图纸索引:

# Returns a DataFrame
pd.read_excel("path_to_file.xls", 0, index_col=None, na_values=["NA"])

使用所有默认值:

# Returns a DataFrame
pd.read_excel("path_to_file.xls")

使用None获取所有工作表:

# Returns a dictionary of DataFrames
pd.read_excel("path_to_file.xls", sheet_name=None)

使用列表获取多张图纸:

# Returns the 1st and 4th sheet, as a dictionary of DataFrames.
pd.read_excel("path_to_file.xls", sheet_name=["Sheet1", 3])

read_excel 可以读取多个工作表,通过设置 sheet_name 添加到图纸名称列表、图纸位置列表或 None 以读取所有工作表。可以通过图纸索引或图纸名称分别使用整数或字符串指定图纸。

阅读a MultiIndex#

read_excel 能读懂 MultiIndex 索引,方法是将列列表传递给 index_col 和一个 MultiIndex 通过将行列表传递给 header 。如果不是 indexcolumns 具有序列化的级别名称,这些名称也将通过指定组成级别的行/列来读入。

例如,要读入一个 MultiIndex 不带名称的索引:

In [368]: df = pd.DataFrame(
   .....:     {"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]},
   .....:     index=pd.MultiIndex.from_product([["a", "b"], ["c", "d"]]),
   .....: )
   .....: 

In [369]: df.to_excel("path_to_file.xlsx")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [369], in <cell line: 1>()
----> 1 df.to_excel("path_to_file.xlsx")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/generic.py:2237, in NDFrame.to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes, storage_options)
   2224 from pandas.io.formats.excel import ExcelFormatter
   2226 formatter = ExcelFormatter(
   2227     df,
   2228     na_rep=na_rep,
   (...)
   2235     inf_rep=inf_rep,
   2236 )
-> 2237 formatter.write(
   2238     excel_writer,
   2239     sheet_name=sheet_name,
   2240     startrow=startrow,
   2241     startcol=startcol,
   2242     freeze_panes=freeze_panes,
   2243     engine=engine,
   2244     storage_options=storage_options,
   2245 )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/formats/excel.py:896, in ExcelFormatter.write(self, writer, sheet_name, startrow, startcol, freeze_panes, engine, storage_options)
    892     need_save = False
    893 else:
    894     # error: Cannot instantiate abstract class 'ExcelWriter' with abstract
    895     # attributes 'engine', 'save', 'supported_extensions' and 'write_cells'
--> 896     writer = ExcelWriter(  # type: ignore[abstract]
    897         writer, engine=engine, storage_options=storage_options
    898     )
    899     need_save = True
    901 try:

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_openpyxl.py:55, in OpenpyxlWriter.__init__(self, path, engine, date_format, datetime_format, mode, storage_options, if_sheet_exists, engine_kwargs, **kwargs)
     42 def __init__(
     43     self,
     44     path: FilePath | WriteExcelBuffer | ExcelWriter,
   (...)
     53 ) -> None:
     54     # Use the openpyxl module as the Excel writer.
---> 55     from openpyxl.workbook import Workbook
     57     engine_kwargs = combine_kwargs(engine_kwargs, kwargs)
     59     super().__init__(
     60         path,
     61         mode=mode,
   (...)
     64         engine_kwargs=engine_kwargs,
     65     )

ModuleNotFoundError: No module named 'openpyxl'

In [370]: df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1])
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [370], in <cell line: 1>()
----> 1 df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1])

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/util/_decorators.py:317, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    311 if len(args) > num_allow_args:
    312     warnings.warn(
    313         msg.format(arguments=arguments),
    314         FutureWarning,
    315         stacklevel=stacklevel,
    316     )
--> 317 return func(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:458, in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, decimal, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
    456 if not isinstance(io, ExcelFile):
    457     should_close = True
--> 458     io = ExcelFile(io, storage_options=storage_options, engine=engine)
    459 elif engine and engine != io.engine:
    460     raise ValueError(
    461         "Engine should not be specified when passing "
    462         "an ExcelFile - ExcelFile already has the engine set"
    463     )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:1482, in ExcelFile.__init__(self, path_or_buffer, engine, storage_options)
   1480     ext = "xls"
   1481 else:
-> 1482     ext = inspect_excel_format(
   1483         content_or_path=path_or_buffer, storage_options=storage_options
   1484     )
   1485     if ext is None:
   1486         raise ValueError(
   1487             "Excel file format cannot be determined, you must specify "
   1488             "an engine manually."
   1489         )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:1355, in inspect_excel_format(content_or_path, storage_options)
   1352 if isinstance(content_or_path, bytes):
   1353     content_or_path = BytesIO(content_or_path)
-> 1355 with get_handle(
   1356     content_or_path, "rb", storage_options=storage_options, is_text=False
   1357 ) as handle:
   1358     stream = handle.handle
   1359     stream.seek(0)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/common.py:795, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    786         handle = open(
    787             handle,
    788             ioargs.mode,
   (...)
    791             newline="",
    792         )
    793     else:
    794         # Binary mode
--> 795         handle = open(handle, ioargs.mode)
    796     handles.append(handle)
    798 # Convert BytesIO or file objects passed with an encoding

FileNotFoundError: [Errno 2] No such file or directory: 'path_to_file.xlsx'

In [371]: df
Out[371]: 
     a  b
a c  1  5
  d  2  6
b c  3  7
  d  4  8

如果索引具有级别名称,则它们也将使用相同的参数进行解析。

In [372]: df.index = df.index.set_names(["lvl1", "lvl2"])

In [373]: df.to_excel("path_to_file.xlsx")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [373], in <cell line: 1>()
----> 1 df.to_excel("path_to_file.xlsx")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/generic.py:2237, in NDFrame.to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes, storage_options)
   2224 from pandas.io.formats.excel import ExcelFormatter
   2226 formatter = ExcelFormatter(
   2227     df,
   2228     na_rep=na_rep,
   (...)
   2235     inf_rep=inf_rep,
   2236 )
-> 2237 formatter.write(
   2238     excel_writer,
   2239     sheet_name=sheet_name,
   2240     startrow=startrow,
   2241     startcol=startcol,
   2242     freeze_panes=freeze_panes,
   2243     engine=engine,
   2244     storage_options=storage_options,
   2245 )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/formats/excel.py:896, in ExcelFormatter.write(self, writer, sheet_name, startrow, startcol, freeze_panes, engine, storage_options)
    892     need_save = False
    893 else:
    894     # error: Cannot instantiate abstract class 'ExcelWriter' with abstract
    895     # attributes 'engine', 'save', 'supported_extensions' and 'write_cells'
--> 896     writer = ExcelWriter(  # type: ignore[abstract]
    897         writer, engine=engine, storage_options=storage_options
    898     )
    899     need_save = True
    901 try:

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_openpyxl.py:55, in OpenpyxlWriter.__init__(self, path, engine, date_format, datetime_format, mode, storage_options, if_sheet_exists, engine_kwargs, **kwargs)
     42 def __init__(
     43     self,
     44     path: FilePath | WriteExcelBuffer | ExcelWriter,
   (...)
     53 ) -> None:
     54     # Use the openpyxl module as the Excel writer.
---> 55     from openpyxl.workbook import Workbook
     57     engine_kwargs = combine_kwargs(engine_kwargs, kwargs)
     59     super().__init__(
     60         path,
     61         mode=mode,
   (...)
     64         engine_kwargs=engine_kwargs,
     65     )

ModuleNotFoundError: No module named 'openpyxl'

In [374]: df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1])
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [374], in <cell line: 1>()
----> 1 df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1])

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/util/_decorators.py:317, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    311 if len(args) > num_allow_args:
    312     warnings.warn(
    313         msg.format(arguments=arguments),
    314         FutureWarning,
    315         stacklevel=stacklevel,
    316     )
--> 317 return func(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:458, in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, decimal, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
    456 if not isinstance(io, ExcelFile):
    457     should_close = True
--> 458     io = ExcelFile(io, storage_options=storage_options, engine=engine)
    459 elif engine and engine != io.engine:
    460     raise ValueError(
    461         "Engine should not be specified when passing "
    462         "an ExcelFile - ExcelFile already has the engine set"
    463     )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:1482, in ExcelFile.__init__(self, path_or_buffer, engine, storage_options)
   1480     ext = "xls"
   1481 else:
-> 1482     ext = inspect_excel_format(
   1483         content_or_path=path_or_buffer, storage_options=storage_options
   1484     )
   1485     if ext is None:
   1486         raise ValueError(
   1487             "Excel file format cannot be determined, you must specify "
   1488             "an engine manually."
   1489         )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:1355, in inspect_excel_format(content_or_path, storage_options)
   1352 if isinstance(content_or_path, bytes):
   1353     content_or_path = BytesIO(content_or_path)
-> 1355 with get_handle(
   1356     content_or_path, "rb", storage_options=storage_options, is_text=False
   1357 ) as handle:
   1358     stream = handle.handle
   1359     stream.seek(0)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/common.py:795, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    786         handle = open(
    787             handle,
    788             ioargs.mode,
   (...)
    791             newline="",
    792         )
    793     else:
    794         # Binary mode
--> 795         handle = open(handle, ioargs.mode)
    796     handles.append(handle)
    798 # Convert BytesIO or file objects passed with an encoding

FileNotFoundError: [Errno 2] No such file or directory: 'path_to_file.xlsx'

In [375]: df
Out[375]: 
           a  b
lvl1 lvl2      
a    c     1  5
     d     2  6
b    c     3  7
     d     4  8

如果源文件同时具有 MultiIndex 索引和列,指定每个列的列表应传递给 index_colheader

In [376]: df.columns = pd.MultiIndex.from_product([["a"], ["b", "d"]], names=["c1", "c2"])

In [377]: df.to_excel("path_to_file.xlsx")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [377], in <cell line: 1>()
----> 1 df.to_excel("path_to_file.xlsx")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/generic.py:2237, in NDFrame.to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes, storage_options)
   2224 from pandas.io.formats.excel import ExcelFormatter
   2226 formatter = ExcelFormatter(
   2227     df,
   2228     na_rep=na_rep,
   (...)
   2235     inf_rep=inf_rep,
   2236 )
-> 2237 formatter.write(
   2238     excel_writer,
   2239     sheet_name=sheet_name,
   2240     startrow=startrow,
   2241     startcol=startcol,
   2242     freeze_panes=freeze_panes,
   2243     engine=engine,
   2244     storage_options=storage_options,
   2245 )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/formats/excel.py:896, in ExcelFormatter.write(self, writer, sheet_name, startrow, startcol, freeze_panes, engine, storage_options)
    892     need_save = False
    893 else:
    894     # error: Cannot instantiate abstract class 'ExcelWriter' with abstract
    895     # attributes 'engine', 'save', 'supported_extensions' and 'write_cells'
--> 896     writer = ExcelWriter(  # type: ignore[abstract]
    897         writer, engine=engine, storage_options=storage_options
    898     )
    899     need_save = True
    901 try:

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_openpyxl.py:55, in OpenpyxlWriter.__init__(self, path, engine, date_format, datetime_format, mode, storage_options, if_sheet_exists, engine_kwargs, **kwargs)
     42 def __init__(
     43     self,
     44     path: FilePath | WriteExcelBuffer | ExcelWriter,
   (...)
     53 ) -> None:
     54     # Use the openpyxl module as the Excel writer.
---> 55     from openpyxl.workbook import Workbook
     57     engine_kwargs = combine_kwargs(engine_kwargs, kwargs)
     59     super().__init__(
     60         path,
     61         mode=mode,
   (...)
     64         engine_kwargs=engine_kwargs,
     65     )

ModuleNotFoundError: No module named 'openpyxl'

In [378]: df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1], header=[0, 1])
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [378], in <cell line: 1>()
----> 1 df = pd.read_excel("path_to_file.xlsx", index_col=[0, 1], header=[0, 1])

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/util/_decorators.py:317, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    311 if len(args) > num_allow_args:
    312     warnings.warn(
    313         msg.format(arguments=arguments),
    314         FutureWarning,
    315         stacklevel=stacklevel,
    316     )
--> 317 return func(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:458, in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, decimal, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
    456 if not isinstance(io, ExcelFile):
    457     should_close = True
--> 458     io = ExcelFile(io, storage_options=storage_options, engine=engine)
    459 elif engine and engine != io.engine:
    460     raise ValueError(
    461         "Engine should not be specified when passing "
    462         "an ExcelFile - ExcelFile already has the engine set"
    463     )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:1482, in ExcelFile.__init__(self, path_or_buffer, engine, storage_options)
   1480     ext = "xls"
   1481 else:
-> 1482     ext = inspect_excel_format(
   1483         content_or_path=path_or_buffer, storage_options=storage_options
   1484     )
   1485     if ext is None:
   1486         raise ValueError(
   1487             "Excel file format cannot be determined, you must specify "
   1488             "an engine manually."
   1489         )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:1355, in inspect_excel_format(content_or_path, storage_options)
   1352 if isinstance(content_or_path, bytes):
   1353     content_or_path = BytesIO(content_or_path)
-> 1355 with get_handle(
   1356     content_or_path, "rb", storage_options=storage_options, is_text=False
   1357 ) as handle:
   1358     stream = handle.handle
   1359     stream.seek(0)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/common.py:795, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    786         handle = open(
    787             handle,
    788             ioargs.mode,
   (...)
    791             newline="",
    792         )
    793     else:
    794         # Binary mode
--> 795         handle = open(handle, ioargs.mode)
    796     handles.append(handle)
    798 # Convert BytesIO or file objects passed with an encoding

FileNotFoundError: [Errno 2] No such file or directory: 'path_to_file.xlsx'

In [379]: df
Out[379]: 
c1         a   
c2         b  d
lvl1 lvl2      
a    c     1  5
     d     2  6
b    c     3  7
     d     4  8

中指定的列中缺少值 index_col 将向前填充,以允许往返 to_excelmerged_cells=True 。若要避免向前填充缺少的值,请使用 set_index 读取数据后,而不是 index_col

解析特定列#

通常情况下,用户会插入列以在Excel中执行临时计算,而您可能不想读入这些列。 read_excel 花了一个 usecols 关键字,以允许您指定要分析的列子集。

在 1.0.0 版更改.

传入一个整数,用于 usecols 将不再起作用。请传递一个从0到的整数列表 usecols 取而代之的是包容性。

您可以将一组逗号分隔的Excel列和区域指定为字符串:

pd.read_excel("path_to_file.xls", "Sheet1", usecols="A,C:E")

如果 usecols 是一个整数列表,则假定它是要解析的文件列索引。

pd.read_excel("path_to_file.xls", "Sheet1", usecols=[0, 2, 3])

元素顺序被忽略,因此 usecols=[0, 1] 是否与 [1, 0]

如果 usecols 是字符串列表,则假定每个字符串对应于用户在 names 或从文档标题行中推断。这些字符串定义将解析哪些列:

pd.read_excel("path_to_file.xls", "Sheet1", usecols=["foo", "bar"])

元素顺序被忽略,因此 usecols=['baz', 'joe'] 是否与 ['joe', 'baz']

如果 usecols 是可调用的,则将根据列名计算可调用函数,并返回可调用函数计算结果为 True

pd.read_excel("path_to_file.xls", "Sheet1", usecols=lambda x: x.isalpha())

解析日期#

在读取Excel文件时,类似日期时间的值通常会自动转换为适当的数据类型。但如果您有一列字符串, look 类似于日期(但在EXCEL中实际上不是格式化为日期),您可以使用 parse_dates 关键字将这些字符串解析为日期时间:

pd.read_excel("path_to_file.xls", "Sheet1", parse_dates=["date_strings"])

单元转换器#

属性可以转换Excel单元格的内容 converters 选择。例如,要将列转换为布尔值:

pd.read_excel("path_to_file.xls", "Sheet1", converters={"MyBools": bool})

此选项处理缺少的值,并将转换器中的异常视为缺少的数据。转换是逐个单元格应用的,而不是作为一个整体应用于列,因此不能保证数组数据类型。例如,缺少值的一列整数不能转换为具有整数dtype的数组,因为NaN严格地是一个浮点数。您可以手动屏蔽丢失的数据以恢复整数数据类型:

def cfun(x):
    return int(x) if x else -1


pd.read_excel("path_to_file.xls", "Sheet1", converters={"MyInts": cfun})

数据类型规格#

作为转换器的替代方法,可以使用 dtype 关键字,该关键字采用将列名映射到类型的字典。若要在不进行类型推断的情况下解释数据,请使用 strobject

pd.read_excel("path_to_file.xls", dtype={"MyInts": "int64", "MyText": str})

正在编写Excel文件#

将Excel文件写入磁盘#

要编写一个 DataFrame 对象添加到Excel文件的工作表中,则可以使用 to_excel 实例方法。这些论点在很大程度上是相同的 to_csv 如上所述,第一个参数是EXCEL文件的名称,可选的第二个参数是 DataFrame 应该被写下来。例如:

df.to_excel("path_to_file.xlsx", sheet_name="Sheet1")

带有的文件 .xls 扩展将使用以下代码编写 xlwt 以及那些拥有 .xlsx 扩展将使用以下代码编写 xlsxwriter (如果可用)或 openpyxl

这个 DataFrame 将以一种试图模仿REPL输出的方式编写。这个 index_label 将放在第二行而不是第一行。属性,可以将其放在第一行。 merge_cells 选项输入 to_excel()False

df.to_excel("path_to_file.xlsx", index_label="label", merge_cells=False)

为了单独编写 DataFrames 若要在单个Excel文件中分隔工作表,可以传递一个 ExcelWriter

with pd.ExcelWriter("path_to_file.xlsx") as writer:
    df1.to_excel(writer, sheet_name="Sheet1")
    df2.to_excel(writer, sheet_name="Sheet2")

将Excel文件写入内存#

Pandas支持将Excel文件写入类似缓冲区的对象,如 StringIOBytesIO 使用 ExcelWriter

from io import BytesIO

bio = BytesIO()

# By setting the 'engine' in the ExcelWriter constructor.
writer = pd.ExcelWriter(bio, engine="xlsxwriter")
df.to_excel(writer, sheet_name="Sheet1")

# Save the workbook
writer.save()

# Seek to the beginning and read to copy the workbook to a variable in memory
bio.seek(0)
workbook = bio.read()

备注

engine 是可选的,但建议使用。设置引擎将确定生成的工作簿的版本。设置 engine='xlrd' 将生成Excel2003格式的工作簿(XLS)。使用以下任一选项 'openpyxl''xlsxwriter' 将生成Excel2007格式的工作簿(Xlsx)。如果省略,则生成Excel2007格式的工作簿。

Excel编写器引擎#

1.2.0 版后已移除: 作为 xlwt 不再维护包,则 xlwt 未来版本的Pandas将不再使用发动机。这是Pandas中唯一支持写入的引擎 .xls 文件。

Pandas通过两种方法选择Excel写手:

  1. 这个 engine 关键字参数

  2. 文件扩展名(通过在配置选项中指定的默认值)

默认情况下,Pandas使用 XlsxWriter.xlsxopenpyxl.xlsm ,以及 xlwt.xls 档案。如果您安装了多个引擎,则可以通过设置默认引擎 setting the config options io.excel.xlsx.writerio.excel.xls.writer 。大Pandas将会回到 openpyxl.xlsx 文件,如果 Xlsxwriter 不可用。

要指定要使用的编写器,可以将引擎关键字参数传递给 to_excel 并向 ExcelWriter 。内置引擎包括:

  • openpyxl :需要2.4版或更高版本

  • xlsxwriter

  • xlwt

# By setting the 'engine' in the DataFrame 'to_excel()' methods.
df.to_excel("path_to_file.xlsx", sheet_name="Sheet1", engine="xlsxwriter")

# By setting the 'engine' in the ExcelWriter constructor.
writer = pd.ExcelWriter("path_to_file.xlsx", engine="xlsxwriter")

# Or via pandas configuration.
from pandas import options  # noqa: E402

options.io.excel.xlsx.writer = "xlsxwriter"

df.to_excel("path_to_file.xlsx", sheet_name="Sheet1")

样式和格式#

从Pandas创建的Excel工作表的外观可以使用上的以下参数进行修改 DataFrame %s to_excel 方法。

  • float_format :浮点数的格式字符串(默认 None )。

  • freeze_panes :表示要冻结的最低行和最右侧列的两个整数的元组。每个参数都是从1开始的,因此(1,1)将冻结第一行和第一列(默认 None )。

使用 Xlsxwriter 引擎提供了许多选项来控制使用 to_excel 方法。优秀的例子可以在 Xlsxwriter 此处的文档:https://xlsxwriter.readthedocs.io/working_with_pandas.html

OpenDocument电子表格#

0.25 新版功能.

这个 read_excel() 方法还可以读取OpenDocument电子表格 odfpy 模块。读取OpenDocument电子表格的语义和功能与可以执行的操作匹配 Excel files 使用 engine='odf'

# Returns a DataFrame
pd.read_excel("path_to_file.ods", engine="odf")

备注

目前,Pandas只支持 阅读 OpenDocument电子表格。不执行写入。

二进制Excel(.xlsb)文件#

1.0.0 新版功能.

这个 read_excel() 方法还可以使用 pyxlsb 模块。读取二进制Excel文件的语义和功能与可以执行的操作基本匹配 Excel files 使用 engine='pyxlsb'pyxlsb 不识别文件中的日期时间类型,将改为返回浮点数。

# Returns a DataFrame
pd.read_excel("path_to_file.xlsb", engine="pyxlsb")

备注

目前,Pandas只支持 阅读 二进制Excel文件。不执行写入。

剪贴板#

获取数据的一种便捷方法是使用 read_clipboard() 方法,该方法获取剪贴板缓冲区的内容并将它们传递给 read_csv 方法。例如,您可以将以下文本复制到剪贴板(在许多操作系统上按CTRL-C):

  A B C
x 1 4 p
y 2 5 q
z 3 6 r

然后将数据直接导入到 DataFrame 拨打以下电话:

>>> clipdf = pd.read_clipboard()
>>> clipdf
  A B C
x 1 4 p
y 2 5 q
z 3 6 r

这个 to_clipboard 方法可用于将 DataFrame 到剪贴板上。之后,您可以将剪贴板内容粘贴到其他应用程序中(许多操作系统上的CTRL-V)。在这里,我们演示了如何编写 DataFrame 写到剪贴板上,然后把它读回来。

>>> df = pd.DataFrame(
...     {"A": [1, 2, 3], "B": [4, 5, 6], "C": ["p", "q", "r"]}, index=["x", "y", "z"]
... )

>>> df
  A B C
x 1 4 p
y 2 5 q
z 3 6 r
>>> df.to_clipboard()
>>> pd.read_clipboard()
  A B C
x 1 4 p
y 2 5 q
z 3 6 r

我们可以看到,我们得到了之前写入剪贴板的相同内容。

备注

您可能需要在Linux上安装XClip或XSEL(带有PyQt5、PyQt4或qtpy)才能使用这些方法。

酸洗#

所有的Pandas物品都配备了 to_pickle 使用Python的 cPickle 模块使用PICLE格式将数据结构保存到磁盘。

In [380]: df
Out[380]: 
c1         a   
c2         b  d
lvl1 lvl2      
a    c     1  5
     d     2  6
b    c     3  7
     d     4  8

In [381]: df.to_pickle("foo.pkl")

这个 read_pickle 中的函数 pandas 命名空间可用于从文件加载任何腌制Pandas对象(或任何其他腌制对象):

In [382]: pd.read_pickle("foo.pkl")
Out[382]: 
c1         a   
c2         b  d
lvl1 lvl2      
a    c     1  5
     d     2  6
b    c     3  7
     d     4  8

警告

加载从不可信来源接收的已腌渍数据可能是不安全的。

请参阅:https://docs.python.org/3/library/pickle.html

警告

read_pickle() 仅保证向后兼容至Pandas0.20.3版

压缩的泡菜文件#

read_pickle()DataFrame.to_pickle()Series.to_pickle() 可以读写压缩的泡菜文件。的压缩类型 gzipbz2xzzstd 支持读写。这个 zip 文件格式仅支持读取,且只能包含一个要读取的数据文件。

压缩类型可以是显式参数,也可以从文件扩展名推断。如果为‘Infer’,则使用 gzipbz2zipxzzstd 如果文件名以 '.gz''.bz2''.zip''.xz' ,或 '.zst' ,分别为。

压缩参数也可以是 dict 以便将选项传递给压缩协议。它必须有一个 'method' 设置为压缩协议名称的密钥,必须是{‘压缩’'gzip''bz2''xz''zstd' }。所有其他键-值对都被传递到底层压缩库。

In [383]: df = pd.DataFrame(
   .....:     {
   .....:         "A": np.random.randn(1000),
   .....:         "B": "foo",
   .....:         "C": pd.date_range("20130101", periods=1000, freq="s"),
   .....:     }
   .....: )
   .....: 

In [384]: df
Out[384]: 
            A    B                   C
0   -0.288267  foo 2013-01-01 00:00:00
1   -0.084905  foo 2013-01-01 00:00:01
2    0.004772  foo 2013-01-01 00:00:02
3    1.382989  foo 2013-01-01 00:00:03
4    0.343635  foo 2013-01-01 00:00:04
..        ...  ...                 ...
995 -0.220893  foo 2013-01-01 00:16:35
996  0.492996  foo 2013-01-01 00:16:36
997 -0.461625  foo 2013-01-01 00:16:37
998  1.361779  foo 2013-01-01 00:16:38
999 -1.197988  foo 2013-01-01 00:16:39

[1000 rows x 3 columns]

使用显式压缩类型:

In [385]: df.to_pickle("data.pkl.compress", compression="gzip")

In [386]: rt = pd.read_pickle("data.pkl.compress", compression="gzip")

In [387]: rt
Out[387]: 
            A    B                   C
0   -0.288267  foo 2013-01-01 00:00:00
1   -0.084905  foo 2013-01-01 00:00:01
2    0.004772  foo 2013-01-01 00:00:02
3    1.382989  foo 2013-01-01 00:00:03
4    0.343635  foo 2013-01-01 00:00:04
..        ...  ...                 ...
995 -0.220893  foo 2013-01-01 00:16:35
996  0.492996  foo 2013-01-01 00:16:36
997 -0.461625  foo 2013-01-01 00:16:37
998  1.361779  foo 2013-01-01 00:16:38
999 -1.197988  foo 2013-01-01 00:16:39

[1000 rows x 3 columns]

从扩展名推断压缩类型:

In [388]: df.to_pickle("data.pkl.xz", compression="infer")

In [389]: rt = pd.read_pickle("data.pkl.xz", compression="infer")

In [390]: rt
Out[390]: 
            A    B                   C
0   -0.288267  foo 2013-01-01 00:00:00
1   -0.084905  foo 2013-01-01 00:00:01
2    0.004772  foo 2013-01-01 00:00:02
3    1.382989  foo 2013-01-01 00:00:03
4    0.343635  foo 2013-01-01 00:00:04
..        ...  ...                 ...
995 -0.220893  foo 2013-01-01 00:16:35
996  0.492996  foo 2013-01-01 00:16:36
997 -0.461625  foo 2013-01-01 00:16:37
998  1.361779  foo 2013-01-01 00:16:38
999 -1.197988  foo 2013-01-01 00:16:39

[1000 rows x 3 columns]

默认情况下,是“推断”:

In [391]: df.to_pickle("data.pkl.gz")

In [392]: rt = pd.read_pickle("data.pkl.gz")

In [393]: rt
Out[393]: 
            A    B                   C
0   -0.288267  foo 2013-01-01 00:00:00
1   -0.084905  foo 2013-01-01 00:00:01
2    0.004772  foo 2013-01-01 00:00:02
3    1.382989  foo 2013-01-01 00:00:03
4    0.343635  foo 2013-01-01 00:00:04
..        ...  ...                 ...
995 -0.220893  foo 2013-01-01 00:16:35
996  0.492996  foo 2013-01-01 00:16:36
997 -0.461625  foo 2013-01-01 00:16:37
998  1.361779  foo 2013-01-01 00:16:38
999 -1.197988  foo 2013-01-01 00:16:39

[1000 rows x 3 columns]

In [394]: df["A"].to_pickle("s1.pkl.bz2")

In [395]: rt = pd.read_pickle("s1.pkl.bz2")

In [396]: rt
Out[396]: 
0     -0.288267
1     -0.084905
2      0.004772
3      1.382989
4      0.343635
         ...   
995   -0.220893
996    0.492996
997   -0.461625
998    1.361779
999   -1.197988
Name: A, Length: 1000, dtype: float64

将选项传递给压缩协议以加快压缩速度:

In [397]: df.to_pickle("data.pkl.gz", compression={"method": "gzip", "compresslevel": 1})

消息包#

大Pandas支持 msgpack 已在1.0.0版中删除。建议使用 pickle 取而代之的是。

或者,您也可以使用Arrow IPC串行化格式进行Pandas对象的有线传输。有关yarrow的文档,请参见 here

HDF5(PyTables)#

HDFStore is a dict-like object which reads and writes pandas using the high performance HDF5 format using the excellent PyTables 类库。请参阅 cookbook 获取一些高级策略

警告

Pandas使用PyTables读写HDF5文件,这允许使用PICLE序列化对象数据类型的数据。加载从不可信来源接收的已腌渍数据可能是不安全的。

有关详细信息,请参阅:https://docs.python.org/3/library/pickle.html

In [398]: store = pd.HDFStore("store.h5")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:139, in import_optional_dependency(name, extra, errors, min_version)
    138 try:
--> 139     module = importlib.import_module(name)
    140 except ImportError:

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'tables'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Input In [398], in <cell line: 1>()
----> 1 store = pd.HDFStore("store.h5")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/pytables.py:573, in HDFStore.__init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
    570 if "format" in kwargs:
    571     raise ValueError("format is not a defined argument for HDFStore")
--> 573 tables = import_optional_dependency("tables")
    575 if complib is not None and complib not in tables.filters.all_complibs:
    576     raise ValueError(
    577         f"complib only supports {tables.filters.all_complibs} compression."
    578     )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
    140 except ImportError:
    141     if errors == "raise":
--> 142         raise ImportError(msg)
    143     else:
    144         return None

ImportError: Missing optional dependency 'pytables'.  Use pip or conda to install pytables.

In [399]: print(store)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [399], in <cell line: 1>()
----> 1 print(store)

NameError: name 'store' is not defined

可以将对象写入文件,就像将键-值对添加到字典一样:

In [400]: index = pd.date_range("1/1/2000", periods=8)

In [401]: s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

In [402]: df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"])

# store.put('s', s) is an equivalent method
In [403]: store["s"] = s
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [403], in <cell line: 1>()
----> 1 store["s"] = s

NameError: name 'store' is not defined

In [404]: store["df"] = df
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [404], in <cell line: 1>()
----> 1 store["df"] = df

NameError: name 'store' is not defined

In [405]: store
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [405], in <cell line: 1>()
----> 1 store

NameError: name 'store' is not defined

在当前或以后的Python会话中,您可以检索存储的对象:

# store.get('df') is an equivalent method
In [406]: store["df"]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [406], in <cell line: 1>()
----> 1 store["df"]

NameError: name 'store' is not defined

# dotted (attribute) access provides get as well
In [407]: store.df
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [407], in <cell line: 1>()
----> 1 store.df

NameError: name 'store' is not defined

删除键指定的对象:

# store.remove('df') is an equivalent method
In [408]: del store["df"]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [408], in <cell line: 1>()
----> 1 del store["df"]

NameError: name 'store' is not defined

In [409]: store
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [409], in <cell line: 1>()
----> 1 store

NameError: name 'store' is not defined

关闭商店并使用上下文管理器:

In [410]: store.close()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [410], in <cell line: 1>()
----> 1 store.close()

NameError: name 'store' is not defined

In [411]: store
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [411], in <cell line: 1>()
----> 1 store

NameError: name 'store' is not defined

In [412]: store.is_open
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [412], in <cell line: 1>()
----> 1 store.is_open

NameError: name 'store' is not defined

# Working with, and automatically closing the store using a context manager
In [413]: with pd.HDFStore("store.h5") as store:
   .....:     store.keys()
   .....: 
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:139, in import_optional_dependency(name, extra, errors, min_version)
    138 try:
--> 139     module = importlib.import_module(name)
    140 except ImportError:

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'tables'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Input In [413], in <cell line: 1>()
----> 1 with pd.HDFStore("store.h5") as store:
      2     store.keys()

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/pytables.py:573, in HDFStore.__init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
    570 if "format" in kwargs:
    571     raise ValueError("format is not a defined argument for HDFStore")
--> 573 tables = import_optional_dependency("tables")
    575 if complib is not None and complib not in tables.filters.all_complibs:
    576     raise ValueError(
    577         f"complib only supports {tables.filters.all_complibs} compression."
    578     )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
    140 except ImportError:
    141     if errors == "raise":
--> 142         raise ImportError(msg)
    143     else:
    144         return None

ImportError: Missing optional dependency 'pytables'.  Use pip or conda to install pytables.

读写接口#

HDFStore 支持顶级API,使用 read_hdf 供阅读和阅读 to_hdf 对于写作,类似于如何 read_csvto_csv 工作。

In [414]: df_tl = pd.DataFrame({"A": list(range(5)), "B": list(range(5))})

In [415]: df_tl.to_hdf("store_tl.h5", "table", append=True)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:139, in import_optional_dependency(name, extra, errors, min_version)
    138 try:
--> 139     module = importlib.import_module(name)
    140 except ImportError:

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'tables'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Input In [415], in <cell line: 1>()
----> 1 df_tl.to_hdf("store_tl.h5", "table", append=True)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/generic.py:2655, in NDFrame.to_hdf(self, path_or_buf, key, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding)
   2651 from pandas.io import pytables
   2653 # Argument 3 to "to_hdf" has incompatible type "NDFrame"; expected
   2654 # "Union[DataFrame, Series]" [arg-type]
-> 2655 pytables.to_hdf(
   2656     path_or_buf,
   2657     key,
   2658     self,  # type: ignore[arg-type]
   2659     mode=mode,
   2660     complevel=complevel,
   2661     complib=complib,
   2662     append=append,
   2663     format=format,
   2664     index=index,
   2665     min_itemsize=min_itemsize,
   2666     nan_rep=nan_rep,
   2667     dropna=dropna,
   2668     data_columns=data_columns,
   2669     errors=errors,
   2670     encoding=encoding,
   2671 )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/pytables.py:312, in to_hdf(path_or_buf, key, value, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding)
    310 path_or_buf = stringify_path(path_or_buf)
    311 if isinstance(path_or_buf, str):
--> 312     with HDFStore(
    313         path_or_buf, mode=mode, complevel=complevel, complib=complib
    314     ) as store:
    315         f(store)
    316 else:

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/pytables.py:573, in HDFStore.__init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
    570 if "format" in kwargs:
    571     raise ValueError("format is not a defined argument for HDFStore")
--> 573 tables = import_optional_dependency("tables")
    575 if complib is not None and complib not in tables.filters.all_complibs:
    576     raise ValueError(
    577         f"complib only supports {tables.filters.all_complibs} compression."
    578     )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
    140 except ImportError:
    141     if errors == "raise":
--> 142         raise ImportError(msg)
    143     else:
    144         return None

ImportError: Missing optional dependency 'pytables'.  Use pip or conda to install pytables.

In [416]: pd.read_hdf("store_tl.h5", "table", where=["index>2"])
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [416], in <cell line: 1>()
----> 1 pd.read_hdf("store_tl.h5", "table", where=["index>2"])

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/pytables.py:428, in read_hdf(path_or_buf, key, mode, errors, where, start, stop, columns, iterator, chunksize, **kwargs)
    425     exists = False
    427 if not exists:
--> 428     raise FileNotFoundError(f"File {path_or_buf} does not exist")
    430 store = HDFStore(path_or_buf, mode=mode, errors=errors, **kwargs)
    431 # can't auto open/close if we are using an iterator
    432 # so delegate to the iterator

FileNotFoundError: File store_tl.h5 does not exist

默认情况下,HDFStore不会删除所有缺失的行。此行为可通过设置 dropna=True

In [417]: df_with_missing = pd.DataFrame(
   .....:     {
   .....:         "col1": [0, np.nan, 2],
   .....:         "col2": [1, np.nan, np.nan],
   .....:     }
   .....: )
   .....: 

In [418]: df_with_missing
Out[418]: 
   col1  col2
0   0.0   1.0
1   NaN   NaN
2   2.0   NaN

In [419]: df_with_missing.to_hdf("file.h5", "df_with_missing", format="table", mode="w")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:139, in import_optional_dependency(name, extra, errors, min_version)
    138 try:
--> 139     module = importlib.import_module(name)
    140 except ImportError:

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'tables'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Input In [419], in <cell line: 1>()
----> 1 df_with_missing.to_hdf("file.h5", "df_with_missing", format="table", mode="w")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/generic.py:2655, in NDFrame.to_hdf(self, path_or_buf, key, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding)
   2651 from pandas.io import pytables
   2653 # Argument 3 to "to_hdf" has incompatible type "NDFrame"; expected
   2654 # "Union[DataFrame, Series]" [arg-type]
-> 2655 pytables.to_hdf(
   2656     path_or_buf,
   2657     key,
   2658     self,  # type: ignore[arg-type]
   2659     mode=mode,
   2660     complevel=complevel,
   2661     complib=complib,
   2662     append=append,
   2663     format=format,
   2664     index=index,
   2665     min_itemsize=min_itemsize,
   2666     nan_rep=nan_rep,
   2667     dropna=dropna,
   2668     data_columns=data_columns,
   2669     errors=errors,
   2670     encoding=encoding,
   2671 )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/pytables.py:312, in to_hdf(path_or_buf, key, value, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding)
    310 path_or_buf = stringify_path(path_or_buf)
    311 if isinstance(path_or_buf, str):
--> 312     with HDFStore(
    313         path_or_buf, mode=mode, complevel=complevel, complib=complib
    314     ) as store:
    315         f(store)
    316 else:

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/pytables.py:573, in HDFStore.__init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
    570 if "format" in kwargs:
    571     raise ValueError("format is not a defined argument for HDFStore")
--> 573 tables = import_optional_dependency("tables")
    575 if complib is not None and complib not in tables.filters.all_complibs:
    576     raise ValueError(
    577         f"complib only supports {tables.filters.all_complibs} compression."
    578     )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
    140 except ImportError:
    141     if errors == "raise":
--> 142         raise ImportError(msg)
    143     else:
    144         return None

ImportError: Missing optional dependency 'pytables'.  Use pip or conda to install pytables.

In [420]: pd.read_hdf("file.h5", "df_with_missing")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [420], in <cell line: 1>()
----> 1 pd.read_hdf("file.h5", "df_with_missing")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/pytables.py:428, in read_hdf(path_or_buf, key, mode, errors, where, start, stop, columns, iterator, chunksize, **kwargs)
    425     exists = False
    427 if not exists:
--> 428     raise FileNotFoundError(f"File {path_or_buf} does not exist")
    430 store = HDFStore(path_or_buf, mode=mode, errors=errors, **kwargs)
    431 # can't auto open/close if we are using an iterator
    432 # so delegate to the iterator

FileNotFoundError: File file.h5 does not exist

In [421]: df_with_missing.to_hdf(
   .....:     "file.h5", "df_with_missing", format="table", mode="w", dropna=True
   .....: )
   .....: 
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:139, in import_optional_dependency(name, extra, errors, min_version)
    138 try:
--> 139     module = importlib.import_module(name)
    140 except ImportError:

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'tables'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Input In [421], in <cell line: 1>()
----> 1 df_with_missing.to_hdf(
      2     "file.h5", "df_with_missing", format="table", mode="w", dropna=True
      3 )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/generic.py:2655, in NDFrame.to_hdf(self, path_or_buf, key, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding)
   2651 from pandas.io import pytables
   2653 # Argument 3 to "to_hdf" has incompatible type "NDFrame"; expected
   2654 # "Union[DataFrame, Series]" [arg-type]
-> 2655 pytables.to_hdf(
   2656     path_or_buf,
   2657     key,
   2658     self,  # type: ignore[arg-type]
   2659     mode=mode,
   2660     complevel=complevel,
   2661     complib=complib,
   2662     append=append,
   2663     format=format,
   2664     index=index,
   2665     min_itemsize=min_itemsize,
   2666     nan_rep=nan_rep,
   2667     dropna=dropna,
   2668     data_columns=data_columns,
   2669     errors=errors,
   2670     encoding=encoding,
   2671 )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/pytables.py:312, in to_hdf(path_or_buf, key, value, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding)
    310 path_or_buf = stringify_path(path_or_buf)
    311 if isinstance(path_or_buf, str):
--> 312     with HDFStore(
    313         path_or_buf, mode=mode, complevel=complevel, complib=complib
    314     ) as store:
    315         f(store)
    316 else:

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/pytables.py:573, in HDFStore.__init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
    570 if "format" in kwargs:
    571     raise ValueError("format is not a defined argument for HDFStore")
--> 573 tables = import_optional_dependency("tables")
    575 if complib is not None and complib not in tables.filters.all_complibs:
    576     raise ValueError(
    577         f"complib only supports {tables.filters.all_complibs} compression."
    578     )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
    140 except ImportError:
    141     if errors == "raise":
--> 142         raise ImportError(msg)
    143     else:
    144         return None

ImportError: Missing optional dependency 'pytables'.  Use pip or conda to install pytables.

In [422]: pd.read_hdf("file.h5", "df_with_missing")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [422], in <cell line: 1>()
----> 1 pd.read_hdf("file.h5", "df_with_missing")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/pytables.py:428, in read_hdf(path_or_buf, key, mode, errors, where, start, stop, columns, iterator, chunksize, **kwargs)
    425     exists = False
    427 if not exists:
--> 428     raise FileNotFoundError(f"File {path_or_buf} does not exist")
    430 store = HDFStore(path_or_buf, mode=mode, errors=errors, **kwargs)
    431 # can't auto open/close if we are using an iterator
    432 # so delegate to the iterator

FileNotFoundError: File file.h5 does not exist

固定格式#

上面的示例显示了使用 put ,它将HDF5写入 PyTables 在固定数组格式中,称为 fixed 格式化。这些类型的商店是 not 一旦写入即可追加(不过您可以简单地将其删除并重写)。他们也不是 可查询 ;它们必须被完整地取回。它们也不支持具有非唯一列名的数据帧。这个 fixed 格式商店提供非常快的书写和略快的阅读速度 table 商店。使用时,默认情况下指定此格式 putto_hdf 或通过 format='fixed'format='f'

警告

A fixed 格式将引发 TypeError 如果您尝试使用 where

>>> pd.DataFrame(np.random.randn(10, 2)).to_hdf("test_fixed.h5", "df")
>>> pd.read_hdf("test_fixed.h5", "df", where="index>5")
TypeError: cannot pass a where specification when reading a fixed format.
           this store must be selected in its entirety

表格格式#

HDFStore 支持另一个 PyTables 在磁盘上格式化, table 格式化。从概念上讲, table 的形状非常类似于DataFrame,具有行和列。一个 table 可以在同一会话或其他会话中附加。此外,还支持删除和查询类型的操作。此格式由指定 format='table'format='t'appendputto_hdf

此格式也可以设置为选项 pd.set_option('io.hdf.default_format','table') 要启用 put/append/to_hdf 默认情况下存储在 table 格式化。

In [423]: store = pd.HDFStore("store.h5")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:139, in import_optional_dependency(name, extra, errors, min_version)
    138 try:
--> 139     module = importlib.import_module(name)
    140 except ImportError:

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'tables'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Input In [423], in <cell line: 1>()
----> 1 store = pd.HDFStore("store.h5")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/pytables.py:573, in HDFStore.__init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
    570 if "format" in kwargs:
    571     raise ValueError("format is not a defined argument for HDFStore")
--> 573 tables = import_optional_dependency("tables")
    575 if complib is not None and complib not in tables.filters.all_complibs:
    576     raise ValueError(
    577         f"complib only supports {tables.filters.all_complibs} compression."
    578     )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
    140 except ImportError:
    141     if errors == "raise":
--> 142         raise ImportError(msg)
    143     else:
    144         return None

ImportError: Missing optional dependency 'pytables'.  Use pip or conda to install pytables.

In [424]: df1 = df[0:4]

In [425]: df2 = df[4:]

# append data (creates a table automatically)
In [426]: store.append("df", df1)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [426], in <cell line: 1>()
----> 1 store.append("df", df1)

NameError: name 'store' is not defined

In [427]: store.append("df", df2)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [427], in <cell line: 1>()
----> 1 store.append("df", df2)

NameError: name 'store' is not defined

In [428]: store
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [428], in <cell line: 1>()
----> 1 store

NameError: name 'store' is not defined

# select the entire object
In [429]: store.select("df")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [429], in <cell line: 1>()
----> 1 store.select("df")

NameError: name 'store' is not defined

# the type of stored data
In [430]: store.root.df._v_attrs.pandas_type
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [430], in <cell line: 1>()
----> 1 store.root.df._v_attrs.pandas_type

NameError: name 'store' is not defined

备注

您还可以创建一个 table 通过路过 format='table'format='t' 发送到 put 手术开始了。

分层密钥#

可以将存储区的密钥指定为字符串。它们可以是类似于分层路径名的格式(例如 foo/bar/bah ),这将生成子商店的层次结构(或 Groups 用PyTables的话来说)。可以指定不带前导‘/’的密钥,并且 始终 绝对(例如,‘foo’指的是‘/foo’)。删除操作可以删除子存储中的所有内容,并 下面 ,就这样吧 小心

In [431]: store.put("foo/bar/bah", df)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [431], in <cell line: 1>()
----> 1 store.put("foo/bar/bah", df)

NameError: name 'store' is not defined

In [432]: store.append("food/orange", df)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [432], in <cell line: 1>()
----> 1 store.append("food/orange", df)

NameError: name 'store' is not defined

In [433]: store.append("food/apple", df)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [433], in <cell line: 1>()
----> 1 store.append("food/apple", df)

NameError: name 'store' is not defined

In [434]: store
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [434], in <cell line: 1>()
----> 1 store

NameError: name 'store' is not defined

# a list of keys are returned
In [435]: store.keys()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [435], in <cell line: 1>()
----> 1 store.keys()

NameError: name 'store' is not defined

# remove all nodes under this level
In [436]: store.remove("food")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [436], in <cell line: 1>()
----> 1 store.remove("food")

NameError: name 'store' is not defined

In [437]: store
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [437], in <cell line: 1>()
----> 1 store

NameError: name 'store' is not defined

属性遍历组层次结构。 walk 方法,该方法将为每个组密钥及其内容的相对密钥生成一个元组。

In [438]: for (path, subgroups, subkeys) in store.walk():
   .....:     for subgroup in subgroups:
   .....:         print("GROUP: {}/{}".format(path, subgroup))
   .....:     for subkey in subkeys:
   .....:         key = "/".join([path, subkey])
   .....:         print("KEY: {}".format(key))
   .....:         print(store.get(key))
   .....: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [438], in <cell line: 1>()
----> 1 for (path, subgroups, subkeys) in store.walk():
      2     for subgroup in subgroups:
      3         print("GROUP: {}/{}".format(path, subgroup))

NameError: name 'store' is not defined

警告

对于存储在根节点下的项,不能如上所述作为点(属性)访问来检索分层密钥。

In [8]: store.foo.bar.bah
AttributeError: 'HDFStore' object has no attribute 'foo'

# you can directly access the actual PyTables node but using the root node
In [9]: store.root.foo.bar.bah
Out[9]:
/foo/bar/bah (Group) ''
  children := ['block0_items' (Array), 'block0_values' (Array), 'axis0' (Array), 'axis1' (Array)]

相反,应使用显式的基于字符串的键:

In [439]: store["foo/bar/bah"]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [439], in <cell line: 1>()
----> 1 store["foo/bar/bah"]

NameError: name 'store' is not defined

存储类型#

在表中存储混合类型#

支持存储混合数据类型的数据。使用追加的列的最大大小将字符串存储为固定宽度。后续尝试追加更长的字符串将引发 ValueError

通过 min_itemsize={{`values`: size}} 作为要追加的参数,将为字符串列设置更大的最小值。储存 floats, strings, ints, bools, datetime64 目前均受支持。对于字符串列,传递 nan_rep = 'nan' 追加将更改磁盘上的默认NaN表示(可进行转换 np.nan ),则默认为 nan

In [440]: df_mixed = pd.DataFrame(
   .....:     {
   .....:         "A": np.random.randn(8),
   .....:         "B": np.random.randn(8),
   .....:         "C": np.array(np.random.randn(8), dtype="float32"),
   .....:         "string": "string",
   .....:         "int": 1,
   .....:         "bool": True,
   .....:         "datetime64": pd.Timestamp("20010102"),
   .....:     },
   .....:     index=list(range(8)),
   .....: )
   .....: 

In [441]: df_mixed.loc[df_mixed.index[3:5], ["A", "B", "string", "datetime64"]] = np.nan

In [442]: store.append("df_mixed", df_mixed, min_itemsize={"values": 50})
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [442], in <cell line: 1>()
----> 1 store.append("df_mixed", df_mixed, min_itemsize={"values": 50})

NameError: name 'store' is not defined

In [443]: df_mixed1 = store.select("df_mixed")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [443], in <cell line: 1>()
----> 1 df_mixed1 = store.select("df_mixed")

NameError: name 'store' is not defined

In [444]: df_mixed1
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [444], in <cell line: 1>()
----> 1 df_mixed1

NameError: name 'df_mixed1' is not defined

In [445]: df_mixed1.dtypes.value_counts()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [445], in <cell line: 1>()
----> 1 df_mixed1.dtypes.value_counts()

NameError: name 'df_mixed1' is not defined

# we have provided a minimum string column size
In [446]: store.root.df_mixed.table
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [446], in <cell line: 1>()
----> 1 store.root.df_mixed.table

NameError: name 'store' is not defined

存储多索引数据帧#

存储多索引 DataFrames AS表非常类似于从同类索引中存储/选择 DataFrames

In [447]: index = pd.MultiIndex(
   .....:     levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]],
   .....:     codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
   .....:     names=["foo", "bar"],
   .....: )
   .....: 

In [448]: df_mi = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"])

In [449]: df_mi
Out[449]: 
                  A         B         C
foo bar                                
foo one    0.667450  0.169405 -1.358046
    two   -0.105563  0.492195  0.076693
    three  0.213685 -0.285283 -1.210529
bar one   -1.408386  0.941577 -0.342447
    two    0.222031  0.052607  2.093214
baz two    1.064908  1.778161 -0.913867
    three -0.030004 -0.399846 -1.234765
qux one    0.081323 -0.268494  0.168016
    two   -0.898283 -0.218499  1.408028
    three -1.267828 -0.689263  0.520995

In [450]: store.append("df_mi", df_mi)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [450], in <cell line: 1>()
----> 1 store.append("df_mi", df_mi)

NameError: name 'store' is not defined

In [451]: store.select("df_mi")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [451], in <cell line: 1>()
----> 1 store.select("df_mi")

NameError: name 'store' is not defined

# the levels are automatically included as data columns
In [452]: store.select("df_mi", "foo=bar")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [452], in <cell line: 1>()
----> 1 store.select("df_mi", "foo=bar")

NameError: name 'store' is not defined

备注

这个 index 关键字是保留的,不能用作级别名称。

正在查询#

查询表#

selectdelete 操作有一个可选的标准,可以指定该标准只选择/删除数据的一个子集。这允许您拥有一个非常大的磁盘上的表,并且只检索数据的一部分。

查询是使用 Term 类,作为布尔表达式。

  • indexcolumns 是受支持的索引器 DataFrames

  • 如果 data_columns ,则可以将它们用作附加索引器。

  • 多索引中的级别名称,具有默认名称 level_0level_1 ,…如未提供,请填写。

有效的比较运算符包括:

=, ==, !=, >, >=, <, <=

有效的布尔表达式与以下各项结合使用:

  • | :或

  • & :和

  • () :用于分组

这些规则类似于布尔表达式在Pandas中用于索引的方式。

备注

  • = will be automatically expanded to the comparison operator ==

  • ~ 是NOT运算符,但只能在非常有限的情况下使用

  • 如果传递了表达式的列表/元组,它们将通过 &

以下是有效的表达式:

  • 'index >= date'

  • "columns = ['A', 'D']"

  • "columns in ['A', 'D']"

  • 'columns = A'

  • 'columns == A'

  • "~(columns = ['A', 'B'])"

  • 'index > df.index[3] & string = "bar"'

  • '(index > df.index[3] & index <= df.index[6]) | string = "bar"'

  • "ts >= Timestamp('2012-02-01')"

  • "major_axis>=20130101"

这个 indexers 位于子表达式的左侧:

columns, major_axis, ts

子表达式的右侧(在比较运算符后)可以是:

  • 将被评估的功能,例如 Timestamp('2012-02-01')

  • 字符串,例如 "bar"

  • 类似日期的,例如 20130101 ,或 "20130101"

  • 列表,例如 "['A', 'B']"

  • 在本地名称空间中定义的变量,例如 date

备注

建议不要通过将字符串内插到查询表达式中来将字符串传递给查询。只需将感兴趣的字符串赋给一个变量,并在表达式中使用该变量。例如,执行以下操作

string = "HolyMoly'"
store.select("df", "index == string")

而不是这个

string = "HolyMoly'"
store.select('df', f'index == {string}')

后者将 not 工作,并将引发 SyntaxError 。请注意,在 string 变量。

如果你 must 内插,使用 '%r' 格式说明符

store.select("df", "index == %r" % string)

它将引用 string

以下是一些例子:

In [453]: dfq = pd.DataFrame(
   .....:     np.random.randn(10, 4),
   .....:     columns=list("ABCD"),
   .....:     index=pd.date_range("20130101", periods=10),
   .....: )
   .....: 

In [454]: store.append("dfq", dfq, format="table", data_columns=True)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [454], in <cell line: 1>()
----> 1 store.append("dfq", dfq, format="table", data_columns=True)

NameError: name 'store' is not defined

使用布尔表达式,并进行内联函数求值。

In [455]: store.select("dfq", "index>pd.Timestamp('20130104') & columns=['A', 'B']")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [455], in <cell line: 1>()
----> 1 store.select("dfq", "index>pd.Timestamp('20130104') & columns=['A', 'B']")

NameError: name 'store' is not defined

使用内联列引用。

In [456]: store.select("dfq", where="A>0 or C>0")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [456], in <cell line: 1>()
----> 1 store.select("dfq", where="A>0 or C>0")

NameError: name 'store' is not defined

这个 columns 关键字来选择要返回的列的列表,这等效于将 'columns=list_of_columns_to_filter'

In [457]: store.select("df", "columns=['A', 'B']")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [457], in <cell line: 1>()
----> 1 store.select("df", "columns=['A', 'B']")

NameError: name 'store' is not defined

startstop 可以指定参数来限制总搜索空间。这些是根据表中的总行数计算的。

备注

select 将引发一个 ValueError 如果查询表达式具有未知的变量引用。通常,这意味着您正在尝试选择符合以下条件的列 not 数据列。

select 将引发一个 SyntaxError 如果查询表达式无效。

查询时间增量64 [ns]#

您可以使用 timedelta64[ns] 键入。术语可以按以下格式指定: <float>(<unit>) ,其中浮点数可以是有符号的(和小数),单位可以是 D,s,ms,us,ns 对于时间三角洲来说。下面是一个例子:

In [458]: from datetime import timedelta

In [459]: dftd = pd.DataFrame(
   .....:     {
   .....:         "A": pd.Timestamp("20130101"),
   .....:         "B": [
   .....:             pd.Timestamp("20130101") + timedelta(days=i, seconds=10)
   .....:             for i in range(10)
   .....:         ],
   .....:     }
   .....: )
   .....: 

In [460]: dftd["C"] = dftd["A"] - dftd["B"]

In [461]: dftd
Out[461]: 
           A                   B                  C
0 2013-01-01 2013-01-01 00:00:10  -1 days +23:59:50
1 2013-01-01 2013-01-02 00:00:10  -2 days +23:59:50
2 2013-01-01 2013-01-03 00:00:10  -3 days +23:59:50
3 2013-01-01 2013-01-04 00:00:10  -4 days +23:59:50
4 2013-01-01 2013-01-05 00:00:10  -5 days +23:59:50
5 2013-01-01 2013-01-06 00:00:10  -6 days +23:59:50
6 2013-01-01 2013-01-07 00:00:10  -7 days +23:59:50
7 2013-01-01 2013-01-08 00:00:10  -8 days +23:59:50
8 2013-01-01 2013-01-09 00:00:10  -9 days +23:59:50
9 2013-01-01 2013-01-10 00:00:10 -10 days +23:59:50

In [462]: store.append("dftd", dftd, data_columns=True)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [462], in <cell line: 1>()
----> 1 store.append("dftd", dftd, data_columns=True)

NameError: name 'store' is not defined

In [463]: store.select("dftd", "C<'-3.5D'")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [463], in <cell line: 1>()
----> 1 store.select("dftd", "C<'-3.5D'")

NameError: name 'store' is not defined

查询多索引#

从以下选项中选择 MultiIndex 可以通过使用级别的名称来实现。

In [464]: df_mi.index.names
Out[464]: FrozenList(['foo', 'bar'])

In [465]: store.select("df_mi", "foo=baz and bar=two")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [465], in <cell line: 1>()
----> 1 store.select("df_mi", "foo=baz and bar=two")

NameError: name 'store' is not defined

如果 MultiIndex 标高名称为 None ,这些级别将通过 level_n 关键字与 n 的级别 MultiIndex 您想要从中进行选择。

In [466]: index = pd.MultiIndex(
   .....:     levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]],
   .....:     codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
   .....: )
   .....: 

In [467]: df_mi_2 = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"])

In [468]: df_mi_2
Out[468]: 
                  A         B         C
foo one    0.856838  1.491776  0.001283
    two    0.701816 -1.097917  0.102588
    three  0.661740  0.443531  0.559313
bar one   -0.459055 -1.222598 -0.455304
    two   -0.781163  0.826204 -0.530057
baz two    0.296135  1.366810  1.073372
    three -0.994957  0.755314  2.119746
qux one   -2.628174 -0.089460 -0.133636
    two    0.337920 -0.634027  0.421107
    three  0.604303  1.053434  1.109090

In [469]: store.append("df_mi_2", df_mi_2)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [469], in <cell line: 1>()
----> 1 store.append("df_mi_2", df_mi_2)

NameError: name 'store' is not defined

# the levels are automatically included as data columns with keyword level_n
In [470]: store.select("df_mi_2", "level_0=foo and level_1=two")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [470], in <cell line: 1>()
----> 1 store.select("df_mi_2", "level_0=foo and level_1=two")

NameError: name 'store' is not defined

标引#

可以使用以下命令创建/修改表的索引 create_table_index 在表中已经存在数据之后(在和之后 append/put 操作)。创建表索引是 高度 很受鼓舞。这将大大加快您的查询速度 select 将索引维度作为 where

备注

索引是在可索引对象和您指定的任何数据列上自动创建的。此行为可以通过传递 index=Falseappend

# we have automagically already created an index (in the first section)
In [471]: i = store.root.df.table.cols.index.index
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [471], in <cell line: 1>()
----> 1 i = store.root.df.table.cols.index.index

NameError: name 'store' is not defined

In [472]: i.optlevel, i.kind
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [472], in <cell line: 1>()
----> 1 i.optlevel, i.kind

AttributeError: 'DatetimeIndex' object has no attribute 'optlevel'

# change an index by passing new parameters
In [473]: store.create_table_index("df", optlevel=9, kind="full")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [473], in <cell line: 1>()
----> 1 store.create_table_index("df", optlevel=9, kind="full")

NameError: name 'store' is not defined

In [474]: i = store.root.df.table.cols.index.index
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [474], in <cell line: 1>()
----> 1 i = store.root.df.table.cols.index.index

NameError: name 'store' is not defined

In [475]: i.optlevel, i.kind
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [475], in <cell line: 1>()
----> 1 i.optlevel, i.kind

AttributeError: 'DatetimeIndex' object has no attribute 'optlevel'

通常,在将大量数据追加到存储区时,关闭每个追加的索引创建,然后在末尾重新创建索引是很有用的。

In [476]: df_1 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB"))

In [477]: df_2 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB"))

In [478]: st = pd.HDFStore("appends.h5", mode="w")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:139, in import_optional_dependency(name, extra, errors, min_version)
    138 try:
--> 139     module = importlib.import_module(name)
    140 except ImportError:

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'tables'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Input In [478], in <cell line: 1>()
----> 1 st = pd.HDFStore("appends.h5", mode="w")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/pytables.py:573, in HDFStore.__init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
    570 if "format" in kwargs:
    571     raise ValueError("format is not a defined argument for HDFStore")
--> 573 tables = import_optional_dependency("tables")
    575 if complib is not None and complib not in tables.filters.all_complibs:
    576     raise ValueError(
    577         f"complib only supports {tables.filters.all_complibs} compression."
    578     )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
    140 except ImportError:
    141     if errors == "raise":
--> 142         raise ImportError(msg)
    143     else:
    144         return None

ImportError: Missing optional dependency 'pytables'.  Use pip or conda to install pytables.

In [479]: st.append("df", df_1, data_columns=["B"], index=False)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [479], in <cell line: 1>()
----> 1 st.append("df", df_1, data_columns=["B"], index=False)

NameError: name 'st' is not defined

In [480]: st.append("df", df_2, data_columns=["B"], index=False)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [480], in <cell line: 1>()
----> 1 st.append("df", df_2, data_columns=["B"], index=False)

NameError: name 'st' is not defined

In [481]: st.get_storer("df").table
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [481], in <cell line: 1>()
----> 1 st.get_storer("df").table

NameError: name 'st' is not defined

然后在完成追加时创建索引。

In [482]: st.create_table_index("df", columns=["B"], optlevel=9, kind="full")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [482], in <cell line: 1>()
----> 1 st.create_table_index("df", columns=["B"], optlevel=9, kind="full")

NameError: name 'st' is not defined

In [483]: st.get_storer("df").table
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [483], in <cell line: 1>()
----> 1 st.get_storer("df").table

NameError: name 'st' is not defined

In [484]: st.close()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [484], in <cell line: 1>()
----> 1 st.close()

NameError: name 'st' is not defined

看见 here 有关如何在现有存储上创建完全排序索引(CSI)的信息。

通过数据列进行查询#

您可以指定(和索引)希望能够执行查询的某些列( indexable 列,您始终可以查询这些列)。例如,假设您想要在磁盘上执行这个常见操作,并只返回与此查询匹配的帧。您可以指定 data_columns = True 将所有列强制设置为 data_columns

In [485]: df_dc = df.copy()

In [486]: df_dc["string"] = "foo"

In [487]: df_dc.loc[df_dc.index[4:6], "string"] = np.nan

In [488]: df_dc.loc[df_dc.index[7:9], "string"] = "bar"

In [489]: df_dc["string2"] = "cool"

In [490]: df_dc.loc[df_dc.index[1:3], ["B", "C"]] = 1.0

In [491]: df_dc
Out[491]: 
                   A         B         C string string2
2000-01-01  1.334065  0.521036  0.930384    foo    cool
2000-01-02 -1.613932  1.000000  1.000000    foo    cool
2000-01-03 -0.585314  1.000000  1.000000    foo    cool
2000-01-04  0.632369 -1.249657  0.975593    foo    cool
2000-01-05  1.060617 -0.143682  0.218423    NaN    cool
2000-01-06  3.050329  1.317933 -0.963725    NaN    cool
2000-01-07 -0.539452 -0.771133  0.023751    foo    cool
2000-01-08  0.649464 -1.736427  0.197288    bar    cool

# on-disk operations
In [492]: store.append("df_dc", df_dc, data_columns=["B", "C", "string", "string2"])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [492], in <cell line: 1>()
----> 1 store.append("df_dc", df_dc, data_columns=["B", "C", "string", "string2"])

NameError: name 'store' is not defined

In [493]: store.select("df_dc", where="B > 0")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [493], in <cell line: 1>()
----> 1 store.select("df_dc", where="B > 0")

NameError: name 'store' is not defined

# getting creative
In [494]: store.select("df_dc", "B > 0 & C > 0 & string == foo")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [494], in <cell line: 1>()
----> 1 store.select("df_dc", "B > 0 & C > 0 & string == foo")

NameError: name 'store' is not defined

# this is in-memory version of this type of selection
In [495]: df_dc[(df_dc.B > 0) & (df_dc.C > 0) & (df_dc.string == "foo")]
Out[495]: 
                   A         B         C string string2
2000-01-01  1.334065  0.521036  0.930384    foo    cool
2000-01-02 -1.613932  1.000000  1.000000    foo    cool
2000-01-03 -0.585314  1.000000  1.000000    foo    cool

# we have automagically created this index and the B/C/string/string2
# columns are stored separately as ``PyTables`` columns
In [496]: store.root.df_dc.table
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [496], in <cell line: 1>()
----> 1 store.root.df_dc.table

NameError: name 'store' is not defined

如果将许多列放入 data columns ,因此由用户指定这些。此外,在第一次追加/放置操作之后,您不能更改数据列(也不能更改索引表)(当然,您只需读入数据并创建一个新表!)

迭代器#

你可以过去了 iterator=Truechunksize=number_in_a_chunkselectselect_as_multiple 返回结果的迭代器。缺省值是在块中返回50,000行。

In [497]: for df in store.select("df", chunksize=3):
   .....:     print(df)
   .....: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [497], in <cell line: 1>()
----> 1 for df in store.select("df", chunksize=3):
      2     print(df)

NameError: name 'store' is not defined

备注

您还可以将迭代器用于 read_hdf 它将打开,然后在完成迭代时自动关闭存储。

for df in pd.read_hdf("store.h5", "df", chunksize=3):
    print(df)

请注意,ChunkSize关键字适用于 来源 排好了。因此,如果您正在进行查询,那么块大小将细分表和所应用的查询中的总行数,并在可能不相等的块上返回迭代器。

以下是生成查询并使用它创建大小相等的返回块的诀窍。

In [498]: dfeq = pd.DataFrame({"number": np.arange(1, 11)})

In [499]: dfeq
Out[499]: 
   number
0       1
1       2
2       3
3       4
4       5
5       6
6       7
7       8
8       9
9      10

In [500]: store.append("dfeq", dfeq, data_columns=["number"])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [500], in <cell line: 1>()
----> 1 store.append("dfeq", dfeq, data_columns=["number"])

NameError: name 'store' is not defined

In [501]: def chunks(l, n):
   .....:     return [l[i: i + n] for i in range(0, len(l), n)]
   .....: 

In [502]: evens = [2, 4, 6, 8, 10]

In [503]: coordinates = store.select_as_coordinates("dfeq", "number=evens")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [503], in <cell line: 1>()
----> 1 coordinates = store.select_as_coordinates("dfeq", "number=evens")

NameError: name 'store' is not defined

In [504]: for c in chunks(coordinates, 2):
   .....:     print(store.select("dfeq", where=c))
   .....: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [504], in <cell line: 1>()
----> 1 for c in chunks(coordinates, 2):
      2     print(store.select("dfeq", where=c))

NameError: name 'coordinates' is not defined

高级查询#

选择单列#

若要检索单个可索引列或数据列,请使用方法 select_column 。例如,这将使您能够非常快速地获取索引。这些函数返回一个 Series 按行号索引的结果的。这些服务器当前不接受 where 选择器。

In [505]: store.select_column("df_dc", "index")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [505], in <cell line: 1>()
----> 1 store.select_column("df_dc", "index")

NameError: name 'store' is not defined

In [506]: store.select_column("df_dc", "string")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [506], in <cell line: 1>()
----> 1 store.select_column("df_dc", "string")

NameError: name 'store' is not defined
选择坐标#

有时,您希望获得查询的坐标(也称为索引位置)。这将返回一个 Int64Index 所产生的位置。这些坐标也可以传递给后续 where 运营部。

In [507]: df_coord = pd.DataFrame(
   .....:     np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000)
   .....: )
   .....: 

In [508]: store.append("df_coord", df_coord)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [508], in <cell line: 1>()
----> 1 store.append("df_coord", df_coord)

NameError: name 'store' is not defined

In [509]: c = store.select_as_coordinates("df_coord", "index > 20020101")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [509], in <cell line: 1>()
----> 1 c = store.select_as_coordinates("df_coord", "index > 20020101")

NameError: name 'store' is not defined

In [510]: c
Out[510]: 
['b', 'c', 'a', 'b']
Categories (3, object): ['b', 'c', 'a']

In [511]: store.select("df_coord", where=c)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [511], in <cell line: 1>()
----> 1 store.select("df_coord", where=c)

NameError: name 'store' is not defined
使用WHERE掩码选择#

有时,您查询可能涉及创建要选择的行列表。通常是这样的 mask 会是一个结果 index 从索引操作中删除。此示例选择的日期时间索引的月份为5。

In [512]: df_mask = pd.DataFrame(
   .....:     np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000)
   .....: )
   .....: 

In [513]: store.append("df_mask", df_mask)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [513], in <cell line: 1>()
----> 1 store.append("df_mask", df_mask)

NameError: name 'store' is not defined

In [514]: c = store.select_column("df_mask", "index")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [514], in <cell line: 1>()
----> 1 c = store.select_column("df_mask", "index")

NameError: name 'store' is not defined

In [515]: where = c[pd.DatetimeIndex(c).month == 5].index
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/arrays/datetimes.py:2251, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
   2250 try:
-> 2251     values, tz_parsed = conversion.datetime_to_datetime64(data)
   2252     # If tzaware, these values represent unix timestamps, so we
   2253     #  return them as i8 to distinguish from wall times

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/_libs/tslibs/conversion.pyx:337, in pandas._libs.tslibs.conversion.datetime_to_datetime64()

TypeError: Unrecognized value type: <class 'str'>

During handling of the above exception, another exception occurred:

ParserError                               Traceback (most recent call last)
Input In [515], in <cell line: 1>()
----> 1 where = c[pd.DatetimeIndex(c).month == 5].index

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/indexes/datetimes.py:328, in DatetimeIndex.__new__(cls, data, freq, tz, normalize, closed, ambiguous, dayfirst, yearfirst, dtype, copy, name)
    324 # - Cases checked above all return/raise before reaching here - #
    326 name = maybe_extract_name(name, data, cls)
--> 328 dtarr = DatetimeArray._from_sequence_not_strict(
    329     data,
    330     dtype=dtype,
    331     copy=copy,
    332     tz=tz,
    333     freq=freq,
    334     dayfirst=dayfirst,
    335     yearfirst=yearfirst,
    336     ambiguous=ambiguous,
    337 )
    339 subarr = cls._simple_new(dtarr, name=name)
    340 return subarr

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/arrays/datetimes.py:364, in DatetimeArray._from_sequence_not_strict(cls, data, dtype, copy, tz, freq, dayfirst, yearfirst, ambiguous)
    360 freq = freq if freq is not lib.no_default else None
    362 freq, freq_infer = dtl.maybe_infer_freq(freq)
--> 364 subarr, tz, inferred_freq = _sequence_to_dt64ns(
    365     data,
    366     dtype=dtype,
    367     copy=copy,
    368     tz=tz,
    369     dayfirst=dayfirst,
    370     yearfirst=yearfirst,
    371     ambiguous=ambiguous,
    372 )
    374 freq, freq_infer = dtl.validate_inferred_freq(freq, inferred_freq, freq_infer)
    375 if explicit_none:

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/arrays/datetimes.py:2120, in _sequence_to_dt64ns(data, dtype, copy, tz, dayfirst, yearfirst, ambiguous, allow_mixed, require_iso8601)
   2116     data = data.astype(np.int64)
   2117 else:
   2118     # data comes back here as either i8 to denote UTC timestamps
   2119     #  or M8[ns] to denote wall times
-> 2120     data, inferred_tz = objects_to_datetime64ns(
   2121         data,
   2122         dayfirst=dayfirst,
   2123         yearfirst=yearfirst,
   2124         allow_object=False,
   2125         allow_mixed=allow_mixed,
   2126         require_iso8601=require_iso8601,
   2127     )
   2128     if tz and inferred_tz:
   2129         #  two timezones: convert to intended from base UTC repr
   2130         if data.dtype == "i8":
   2131             # GH#42505
   2132             # by convention, these are _already_ UTC, e.g

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/arrays/datetimes.py:2256, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
   2254         return values.view("i8"), tz_parsed
   2255     except (ValueError, TypeError):
-> 2256         raise err
   2257 except OverflowError as err:
   2258     # Exception is raised when a part of date is greater than 32 bit signed int
   2259     raise OutOfBoundsDatetime("Out of bounds nanosecond timestamp") from err

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/arrays/datetimes.py:2239, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
   2237 order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
   2238 try:
-> 2239     result, tz_parsed = tslib.array_to_datetime(
   2240         data.ravel("K"),
   2241         errors=errors,
   2242         utc=utc,
   2243         dayfirst=dayfirst,
   2244         yearfirst=yearfirst,
   2245         require_iso8601=require_iso8601,
   2246         allow_mixed=allow_mixed,
   2247     )
   2248     result = result.reshape(data.shape, order=order)
   2249 except ValueError as err:

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/_libs/tslib.pyx:380, in pandas._libs.tslib.array_to_datetime()

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/_libs/tslib.pyx:617, in pandas._libs.tslib.array_to_datetime()

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/_libs/tslib.pyx:761, in pandas._libs.tslib._array_to_datetime_object()

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/_libs/tslib.pyx:752, in pandas._libs.tslib._array_to_datetime_object()

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/_libs/tslibs/parsing.pyx:281, in pandas._libs.tslibs.parsing.parse_datetime_string()

File /usr/lib/python3/dist-packages/dateutil/parser/_parser.py:1374, in parse(timestr, parserinfo, **kwargs)
   1372     return parser(parserinfo).parse(timestr, **kwargs)
   1373 else:
-> 1374     return DEFAULTPARSER.parse(timestr, **kwargs)

File /usr/lib/python3/dist-packages/dateutil/parser/_parser.py:649, in parser.parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    646 res, skipped_tokens = self._parse(timestr, **kwargs)
    648 if res is None:
--> 649     raise ParserError("Unknown string format: %s", timestr)
    651 if len(res) == 0:
    652     raise ParserError("String does not contain a date: %s", timestr)

ParserError: Unknown string format: b

In [516]: store.select("df_mask", where=where)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [516], in <cell line: 1>()
----> 1 store.select("df_mask", where=where)

NameError: name 'store' is not defined
存储器对象#

如果要检查存储的对象,请通过 get_storer 。您可以通过编程方式使用它来表示获取对象中的行数。

In [517]: store.get_storer("df_dc").nrows
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [517], in <cell line: 1>()
----> 1 store.get_storer("df_dc").nrows

NameError: name 'store' is not defined

多个表查询#

这些方法 append_to_multipleselect_as_multiple 可以一次从多个表中追加/选择。我们的想法是有一个表(称为选择器表),您可以为大多数/所有列编制索引,并执行查询。其他表是索引与选择器表的索引匹配的数据表。然后,您可以在选择器表上执行非常快速的查询,同时还可以获得大量数据。这种方法类似于拥有一个非常宽的表,但可以实现更高效的查询。

这个 append_to_multiple 方法将给定的单个DataFrame拆分成多个表 d ,一个字典,它将表名映射到您想要在该表中的列的列表。如果 None 用来代替列表,则该表将具有给定DataFrame的其余未指定列。这一论点 selector 定义哪个表是选择器表(您可以从中进行查询)。这一论点 dropna 将从输入中删除行 DataFrame 以确保表同步。这意味着,如果正在写入的其中一个表的行完全 np.NaN ,则该行将从所有表中删除。

如果 dropna 是假的, 用户负责同步表 。完全记住这一点 np.Nan 行不会写入HDFStore,因此如果选择调用 dropna=False ,某些表可能比其他表具有更多的行,因此 select_as_multiple 可能无法工作,或者可能返回意外结果。

In [518]: df_mt = pd.DataFrame(
   .....:     np.random.randn(8, 6),
   .....:     index=pd.date_range("1/1/2000", periods=8),
   .....:     columns=["A", "B", "C", "D", "E", "F"],
   .....: )
   .....: 

In [519]: df_mt["foo"] = "bar"

In [520]: df_mt.loc[df_mt.index[1], ("A", "B")] = np.nan

# you can also create the tables individually
In [521]: store.append_to_multiple(
   .....:     {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
   .....: )
   .....: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [521], in <cell line: 1>()
----> 1 store.append_to_multiple(
      2     {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
      3 )

NameError: name 'store' is not defined

In [522]: store
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [522], in <cell line: 1>()
----> 1 store

NameError: name 'store' is not defined

# individual tables were created
In [523]: store.select("df1_mt")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [523], in <cell line: 1>()
----> 1 store.select("df1_mt")

NameError: name 'store' is not defined

In [524]: store.select("df2_mt")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [524], in <cell line: 1>()
----> 1 store.select("df2_mt")

NameError: name 'store' is not defined

# as a multiple
In [525]: store.select_as_multiple(
   .....:     ["df1_mt", "df2_mt"],
   .....:     where=["A>0", "B>0"],
   .....:     selector="df1_mt",
   .....: )
   .....: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [525], in <cell line: 1>()
----> 1 store.select_as_multiple(
      2     ["df1_mt", "df2_mt"],
      3     where=["A>0", "B>0"],
      4     selector="df1_mt",
      5 )

NameError: name 'store' is not defined

从表中删除#

您可以通过指定一个 where 。在删除行时,了解 PyTables 通过擦除行来删除行,然后 搬家 以下是数据。因此,根据数据的方向,删除可能是一项非常昂贵的操作。为了获得最佳性能,有必要将要删除的维度作为 indexables

数据(在磁盘上)是按照 indexables 。这里有一个简单的用例。存储面板类型的数据,日期在 major_axis 和ID在 minor_axis 。然后,数据按如下方式交织:

  • date_1
    • id_1

    • id_2

    • id_n

  • date_2
    • id_1

    • id_n

应该清楚的是,对 major_axis 将相当快,因为删除了一个区块,然后移动了以下数据。从另一方面来说,对 minor_axis 将是非常昂贵的。在这种情况下,几乎可以肯定使用 where 这将选择除缺失数据之外的所有数据。

警告

请注意,HDF5 不回收空间 自动保存在h5文件中。因此,重复删除(或移除节点)并再次添加, 将倾向于增加文件大小

重新打包并清理 该文件,使用 ptrepack

备注和注意事项#

压缩#

PyTables 允许压缩存储的数据。这适用于所有类型的商店,而不仅仅是桌子。有两个参数用于控制压缩: complevelcomplib

  • complevel 指定是否要压缩数据以及压缩的难易程度。 complevel=0complevel=None 禁用压缩和 0<complevel<10 启用压缩。

  • complib 指定要使用的压缩库。如果未指定任何内容,则默认库 zlib 使用的是。压缩库通常会针对良好的压缩速率或速度进行优化,结果将取决于数据类型。选择哪种类型的压缩取决于您的特定需求和数据。支持的压缩库列表:

    • zlib: 默认的压缩库。在压缩方面是经典的,达到了很好的压缩率,但有点慢。

    • lzo: 快速压缩和解压缩。

    • bzip2: 很好的压缩率。

    • blosc: 快速压缩和解压缩。

      支持替代BLOSC压缩机:

      • blosc:blosclz This is the default compressor for blosc

      • blosc:lz4: 一种紧凑、非常受欢迎的快速压缩机。

      • blosc:lz4hc: LZ4的一个调整版本,以速度为代价产生了更好的压缩比。

      • blosc:snappy: 一种流行的压缩机,在许多地方使用。

      • blosc:zlib: 经典;比以前的要慢一些,但获得了更好的压缩比。

      • blosc:zstd: 一个非常平衡的编解码器;它提供了上面其他压缩比中最好的,并且以相当快的速度。

    如果 complib 被定义为不同于列出的库 ValueError 已发布异常。

备注

属性指定的库 complib 选项在您的平台上缺失,压缩默认为 zlib 不用再多费口舌了。

对文件中的所有对象启用压缩:

store_compressed = pd.HDFStore(
    "store_compressed.h5", complevel=9, complib="blosc:blosclz"
)

或未启用压缩的商店中的动态压缩(这仅适用于表):

store.append("df", df, complib="zlib", complevel=5)

Ptrepack#

PyTables 当表在写入之后被压缩时,提供了更好的写入性能,而不是在一开始就打开压缩。您可以使用提供的 PyTables 实用程序 ptrepack 。此外, ptrepack 可以在事后更改压缩级别。

ptrepack --chunkshape=auto --propindexes --complevel=9 --complib=blosc in.h5 out.h5

更有甚者 ptrepack in.h5 out.h5 将要 重新打包 允许您重复使用以前删除的空间的文件。或者,您可以简单地删除文件并重新写入,或者使用 copy 方法。

注意事项#

警告

HDFStorenot-threadsafe for writing 。潜在的 PyTables 仅支持并发读取(通过线程或进程)。如果你需要阅读和写作 在同一时间 ,您需要在单个进程的单个线程中序列化这些操作。否则,您将损坏您的数据。请参阅 (GH2397 )了解更多信息。

  • 如果使用锁来管理多个进程之间的写访问,则可能需要使用 fsync() 在释放写锁定之前。为了方便起见,您可以使用 store.flush(fsync=True) 为你做这件事。

  • 一旦成为 table 创建的列(DataFrame)是固定的;只能追加完全相同的列

  • 请注意,时区(例如, pytz.timezone('US/Eastern') )在不同时区版本之间不一定相同。因此,如果使用时区库的一个版本将数据本地化到HDFStore中的特定时区,并使用另一个版本更新该数据,则数据将被转换为UTC,因为这些时区被认为不相等。使用相同版本的时区库或使用 tz_convert 使用更新后的时区定义。

警告

PyTables 将显示一个 NaturalNameWarning 如果列名不能用作属性选择器。 天然 标识符只包含字母、数字和下划线,并且不能以数字开头。其他标识符不能在 where 子句,通常都不是一个好主意。

DataTypes#

HDFStore 会将对象数据类型映射到 PyTables 基础数据类型。这意味着以下类型是已知有效的:

类型

表示缺少的值

浮动: float64, float32, float16

np.nan

整数: int64, int32, int8, uint64,uint32, uint8

布尔值

datetime64[ns]

NaT

timedelta64[ns]

NaT

分类:请参阅下一节

对象: strings

np.nan

unicode 不支持列,并且 都会失败

分类数据#

您可以写入包含以下内容的数据 category 将数据类型转换为 HDFStore 。查询的工作方式与对象数组相同。然而, category 以更高效的方式存储数据类型的数据。

In [526]: dfcat = pd.DataFrame(
   .....:     {"A": pd.Series(list("aabbcdba")).astype("category"), "B": np.random.randn(8)}
   .....: )
   .....: 

In [527]: dfcat
Out[527]: 
   A         B
0  a  0.477849
1  a  0.283128
2  b -2.045700
3  b -0.338206
4  c -0.423113
5  d  2.314361
6  b -0.033100
7  a -0.965461

In [528]: dfcat.dtypes
Out[528]: 
A    category
B     float64
dtype: object

In [529]: cstore = pd.HDFStore("cats.h5", mode="w")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:139, in import_optional_dependency(name, extra, errors, min_version)
    138 try:
--> 139     module = importlib.import_module(name)
    140 except ImportError:

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'tables'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Input In [529], in <cell line: 1>()
----> 1 cstore = pd.HDFStore("cats.h5", mode="w")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/pytables.py:573, in HDFStore.__init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
    570 if "format" in kwargs:
    571     raise ValueError("format is not a defined argument for HDFStore")
--> 573 tables = import_optional_dependency("tables")
    575 if complib is not None and complib not in tables.filters.all_complibs:
    576     raise ValueError(
    577         f"complib only supports {tables.filters.all_complibs} compression."
    578     )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
    140 except ImportError:
    141     if errors == "raise":
--> 142         raise ImportError(msg)
    143     else:
    144         return None

ImportError: Missing optional dependency 'pytables'.  Use pip or conda to install pytables.

In [530]: cstore.append("dfcat", dfcat, format="table", data_columns=["A"])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [530], in <cell line: 1>()
----> 1 cstore.append("dfcat", dfcat, format="table", data_columns=["A"])

NameError: name 'cstore' is not defined

In [531]: result = cstore.select("dfcat", where="A in ['b', 'c']")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [531], in <cell line: 1>()
----> 1 result = cstore.select("dfcat", where="A in ['b', 'c']")

NameError: name 'cstore' is not defined

In [532]: result
Out[532]: 
metrics  b_sum  c_mean
a                     
0          2.0     0.5
1          2.0     0.5
2          2.0     0.5

In [533]: result.dtypes
Out[533]: 
metrics
b_sum     float64
c_mean    float64
dtype: object

字符串列#

min_itemsize

的基础实现 HDFStore 对字符串列使用固定的列宽(项大小)。字符串列的项大小计算为传递给 HDFStore在第一个附录中 。后续追加,可能会为列引入字符串 更大 超过列可以容纳的范围,将引发异常(否则,您可能会以静默方式截断这些列,从而导致信息丢失)。将来,我们可能会放宽这一限制,并允许发生用户指定的截断。

经过 min_itemsize 在第一次创建表时,预先指定特定字符串列的最小长度。 min_itemsize 可以是整数,也可以是将列名映射到整数的DICT。你可以过去了 values 作为一把钥匙,让所有 可转位的data_columns 拥有此最小项大小(_I)。

传递一个 min_itemsize Dict将导致所有传递的列被创建为 data_columns 自动的。

备注

如果您没有传递任何 data_columns ,然后 min_itemsize 将是传递的任何字符串的最大长度

In [534]: dfs = pd.DataFrame({"A": "foo", "B": "bar"}, index=list(range(5)))

In [535]: dfs
Out[535]: 
     A    B
0  foo  bar
1  foo  bar
2  foo  bar
3  foo  bar
4  foo  bar

# A and B have a size of 30
In [536]: store.append("dfs", dfs, min_itemsize=30)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [536], in <cell line: 1>()
----> 1 store.append("dfs", dfs, min_itemsize=30)

NameError: name 'store' is not defined

In [537]: store.get_storer("dfs").table
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [537], in <cell line: 1>()
----> 1 store.get_storer("dfs").table

NameError: name 'store' is not defined

# A is created as a data_column with a size of 30
# B is size is calculated
In [538]: store.append("dfs2", dfs, min_itemsize={"A": 30})
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [538], in <cell line: 1>()
----> 1 store.append("dfs2", dfs, min_itemsize={"A": 30})

NameError: name 'store' is not defined

In [539]: store.get_storer("dfs2").table
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [539], in <cell line: 1>()
----> 1 store.get_storer("dfs2").table

NameError: name 'store' is not defined

nan_rep

字符串列将序列化 np.nan (缺失的值)使用 nan_rep 字符串表示法。缺省为字符串值 nan 。您可能会无意中将实际的 nan 值转换为缺失的值。

In [540]: dfss = pd.DataFrame({"A": ["foo", "bar", "nan"]})

In [541]: dfss
Out[541]: 
     A
0  foo
1  bar
2  nan

In [542]: store.append("dfss", dfss)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [542], in <cell line: 1>()
----> 1 store.append("dfss", dfss)

NameError: name 'store' is not defined

In [543]: store.select("dfss")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [543], in <cell line: 1>()
----> 1 store.select("dfss")

NameError: name 'store' is not defined

# here you need to specify a different nan rep
In [544]: store.append("dfss2", dfss, nan_rep="_nan_")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [544], in <cell line: 1>()
----> 1 store.append("dfss2", dfss, nan_rep="_nan_")

NameError: name 'store' is not defined

In [545]: store.select("dfss2")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [545], in <cell line: 1>()
----> 1 store.select("dfss2")

NameError: name 'store' is not defined

外部兼容性#

HDFStore 写入 table 以适合产生Pandas对象的无损往返的特定格式来格式化对象。为了实现外部兼容性, HDFStore 可以阅读本地语言 PyTables 格式化表格。

可以编写一个 HDFStore 可以轻松导入到中的对象 R 使用 rhdf5 类库 (Package website )。创建表格式存储,如下所示:

In [546]: df_for_r = pd.DataFrame(
   .....:     {
   .....:         "first": np.random.rand(100),
   .....:         "second": np.random.rand(100),
   .....:         "class": np.random.randint(0, 2, (100,)),
   .....:     },
   .....:     index=range(100),
   .....: )
   .....: 

In [547]: df_for_r.head()
Out[547]: 
      first    second  class
0  0.864919  0.852910      0
1  0.030579  0.412962      1
2  0.015226  0.978410      0
3  0.498512  0.686761      0
4  0.232163  0.328185      1

In [548]: store_export = pd.HDFStore("export.h5")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:139, in import_optional_dependency(name, extra, errors, min_version)
    138 try:
--> 139     module = importlib.import_module(name)
    140 except ImportError:

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'tables'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Input In [548], in <cell line: 1>()
----> 1 store_export = pd.HDFStore("export.h5")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/pytables.py:573, in HDFStore.__init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
    570 if "format" in kwargs:
    571     raise ValueError("format is not a defined argument for HDFStore")
--> 573 tables = import_optional_dependency("tables")
    575 if complib is not None and complib not in tables.filters.all_complibs:
    576     raise ValueError(
    577         f"complib only supports {tables.filters.all_complibs} compression."
    578     )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
    140 except ImportError:
    141     if errors == "raise":
--> 142         raise ImportError(msg)
    143     else:
    144         return None

ImportError: Missing optional dependency 'pytables'.  Use pip or conda to install pytables.

In [549]: store_export.append("df_for_r", df_for_r, data_columns=df_dc.columns)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [549], in <cell line: 1>()
----> 1 store_export.append("df_for_r", df_for_r, data_columns=df_dc.columns)

NameError: name 'store_export' is not defined

In [550]: store_export
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [550], in <cell line: 1>()
----> 1 store_export

NameError: name 'store_export' is not defined

在R中,该文件可以读入一个 data.frame 对象使用 rhdf5 类库。下面的示例函数从值中读取相应的列名和数据值,并将它们组装成 data.frame

# Load values and column names for all datasets from corresponding nodes and
# insert them into one data.frame object.

library(rhdf5)

loadhdf5data <- function(h5File) {

listing <- h5ls(h5File)
# Find all data nodes, values are stored in *_values and corresponding column
# titles in *_items
data_nodes <- grep("_values", listing$name)
name_nodes <- grep("_items", listing$name)
data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")
columns = list()
for (idx in seq(data_paths)) {
  # NOTE: matrices returned by h5read have to be transposed to obtain
  # required Fortran order!
  data <- data.frame(t(h5read(h5File, data_paths[idx])))
  names <- t(h5read(h5File, name_paths[idx]))
  entry <- data.frame(data)
  colnames(entry) <- names
  columns <- append(columns, entry)
}

data <- data.frame(columns)

return(data)
}

现在,您可以导入 DataFrame 进入R:

> data = loadhdf5data("transfer.hdf5")
> head(data)
         first    second class
1 0.4170220047 0.3266449     0
2 0.7203244934 0.5270581     0
3 0.0001143748 0.8859421     1
4 0.3023325726 0.3572698     1
5 0.1467558908 0.9085352     1
6 0.0923385948 0.6233601     1

备注

R函数列出整个HDF5文件的内容,并将 data.frame 对象,因此只有在存储了多个 DataFrame 对象添加到单个HDF5文件。

性能#

  • tables 与格式化相比,格式化会降低写入性能 fixed 商店。这样做的好处是能够添加/删除和查询(可能是非常大量的数据)。与常规存储相比,写入时间通常更长。查询时间可能相当快,尤其是在索引轴上。

  • 你可以过去了 chunksize=<int>append ,指定写入区块大小(默认为50000)。这将显著降低您在写入时的内存使用量。

  • 你可以过去了 expectedrows=<int> 致第一个 append ,以设置 PyTables 会期待着。这将优化读/写性能。

  • 可以将重复的行写入表,但会在选择中将其过滤掉(最后一项被选中;因此,表在主、次对上是唯一的)

  • A PerformanceWarning will be raised if you are attempting to store types that will be pickled by PyTables (rather than stored as endemic types). See Here 获取更多信息和一些解决方案。

羽毛#

Feather为数据帧提供二进制列式序列化。它旨在提高读取和写入数据帧的效率,并使跨数据分析语言共享数据变得容易。

Feather被设计为忠实地序列化和反序列化DataFrames,支持所有的Pandas数据类型,包括扩展数据类型,如带有tz的范畴和DateTime。

几点注意事项:

  • 该格式将不会写入 Index ,或 MultiIndex 对于 DataFrame 如果提供了非默认错误,则将引发错误。你可以的 .reset_index() 存储索引或 .reset_index(drop=True) 忽视它。

  • 不支持重复列名和非字符串列名称

  • 不支持对象数据类型列中的实际Python对象。这些将在尝试序列化时引发有用的错误消息。

请参阅 Full Documentation

In [551]: df = pd.DataFrame(
   .....:     {
   .....:         "a": list("abc"),
   .....:         "b": list(range(1, 4)),
   .....:         "c": np.arange(3, 6).astype("u1"),
   .....:         "d": np.arange(4.0, 7.0, dtype="float64"),
   .....:         "e": [True, False, True],
   .....:         "f": pd.Categorical(list("abc")),
   .....:         "g": pd.date_range("20130101", periods=3),
   .....:         "h": pd.date_range("20130101", periods=3, tz="US/Eastern"),
   .....:         "i": pd.date_range("20130101", periods=3, freq="ns"),
   .....:     }
   .....: )
   .....: 

In [552]: df
Out[552]: 
   a  b  c    d      e  f          g                         h                             i
0  a  1  3  4.0   True  a 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00.000000000
1  b  2  4  5.0  False  b 2013-01-02 2013-01-02 00:00:00-05:00 2013-01-01 00:00:00.000000001
2  c  3  5  6.0   True  c 2013-01-03 2013-01-03 00:00:00-05:00 2013-01-01 00:00:00.000000002

In [553]: df.dtypes
Out[553]: 
a                        object
b                         int64
c                         uint8
d                       float64
e                          bool
f                      category
g                datetime64[ns]
h    datetime64[ns, US/Eastern]
i                datetime64[ns]
dtype: object

写入到羽化文件。

In [554]: df.to_feather("example.feather")

从羽毛文件中读取。

In [555]: result = pd.read_feather("example.feather")

In [556]: result
Out[556]: 
   a  b  c    d      e  f          g                         h                             i
0  a  1  3  4.0   True  a 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00.000000000
1  b  2  4  5.0  False  b 2013-01-02 2013-01-02 00:00:00-05:00 2013-01-01 00:00:00.000000001
2  c  3  5  6.0   True  c 2013-01-03 2013-01-03 00:00:00-05:00 2013-01-01 00:00:00.000000002

# we preserve dtypes
In [557]: result.dtypes
Out[557]: 
a                        object
b                         int64
c                         uint8
d                       float64
e                          bool
f                      category
g                datetime64[ns]
h    datetime64[ns, US/Eastern]
i                datetime64[ns]
dtype: object

镶木#

Apache Parquet 为数据帧提供分区的二进制列式序列化。它旨在提高读取和写入数据帧的效率,并使跨数据分析语言共享数据变得容易。Parquet可以使用各种压缩技术来尽可能地缩小文件大小,同时仍然保持良好的读取性能。

拼花被设计成忠实地序列化和反序列化 DataFrame S,支持所有的Pandas数据类型,包括扩展数据类型,如带有tz的DateTime。

以下是几点注意事项。

  • 不支持重复列名和非字符串列名称。

  • 这个 pyarrow 引擎始终将索引写入输出,但是 fastparquet 仅写入非默认索引。这一额外的专栏可能会给没有预料到的非Pandas消费者带来问题。方法强制包括或省略索引。 index 参数,而不考虑基础引擎。

  • 如果指定了索引级名称,则必须为字符串。

  • pyarrow 引擎,非字符串类型的分类数据类型可以序列化为拼图,但将反序列化为其原始数据类型。

  • 这个 pyarrow 引擎保留了 ordered 带有字符串类型的分类数据类型的标志。 fastparquet 不会保留 ordered 旗帜。

  • 不支持的类型包括 Interval 和实际的Python对象类型。这些将在尝试序列化时引发有用的错误消息。 Period 支持的类型为yarrow>=0.16.0。

  • 这个 pyarrow 引擎保留扩展数据类型,如可为空的整型和字符串数据类型(需要pyrow>=0.16.0,并要求扩展类型实现所需的协议,请参阅 extension types documentation )。

您可以指定一个 engine 来指导序列化。这可以是以下之一 pyarrow ,或 fastparquet ,或 auto 。如果未指定引擎,则 pd.options.io.parquet.engine 选项被选中;如果这也是 auto ,那么 pyarrow 是尝试过的,并回落到 fastparquet

请参阅的文档 pyarrowfastparquet

备注

这些引擎非常相似,应该读/写几乎相同的拼图格式文件。目前 pyarrow 不支持时间增量数据, fastparquet>=0.1.4 支持时区感知的日期时间。这些库的不同之处在于具有不同的底层依赖项 (fastparquet 通过使用 numba ,而 pyarrow 使用c-库)。

In [558]: df = pd.DataFrame(
   .....:     {
   .....:         "a": list("abc"),
   .....:         "b": list(range(1, 4)),
   .....:         "c": np.arange(3, 6).astype("u1"),
   .....:         "d": np.arange(4.0, 7.0, dtype="float64"),
   .....:         "e": [True, False, True],
   .....:         "f": pd.date_range("20130101", periods=3),
   .....:         "g": pd.date_range("20130101", periods=3, tz="US/Eastern"),
   .....:         "h": pd.Categorical(list("abc")),
   .....:         "i": pd.Categorical(list("abc"), ordered=True),
   .....:     }
   .....: )
   .....: 

In [559]: df
Out[559]: 
   a  b  c    d      e          f                         g  h  i
0  a  1  3  4.0   True 2013-01-01 2013-01-01 00:00:00-05:00  a  a
1  b  2  4  5.0  False 2013-01-02 2013-01-02 00:00:00-05:00  b  b
2  c  3  5  6.0   True 2013-01-03 2013-01-03 00:00:00-05:00  c  c

In [560]: df.dtypes
Out[560]: 
a                        object
b                         int64
c                         uint8
d                       float64
e                          bool
f                datetime64[ns]
g    datetime64[ns, US/Eastern]
h                      category
i                      category
dtype: object

写入镶木地板文件。

In [561]: df.to_parquet("example_pa.parquet", engine="pyarrow")

In [562]: df.to_parquet("example_fp.parquet", engine="fastparquet")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:139, in import_optional_dependency(name, extra, errors, min_version)
    138 try:
--> 139     module = importlib.import_module(name)
    140 except ImportError:

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'fastparquet'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Input In [562], in <cell line: 1>()
----> 1 df.to_parquet("example_fp.parquet", engine="fastparquet")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/util/_decorators.py:207, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    205     else:
    206         kwargs[new_arg_name] = new_arg_value
--> 207 return func(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/frame.py:2855, in DataFrame.to_parquet(self, path, engine, compression, index, partition_cols, storage_options, **kwargs)
   2769 """
   2770 Write a DataFrame to the binary parquet format.
   2771 
   (...)
   2851 >>> content = f.read()
   2852 """
   2853 from pandas.io.parquet import to_parquet
-> 2855 return to_parquet(
   2856     self,
   2857     path,
   2858     engine,
   2859     compression=compression,
   2860     index=index,
   2861     partition_cols=partition_cols,
   2862     storage_options=storage_options,
   2863     **kwargs,
   2864 )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/parquet.py:422, in to_parquet(df, path, engine, compression, index, storage_options, partition_cols, **kwargs)
    420 if isinstance(partition_cols, str):
    421     partition_cols = [partition_cols]
--> 422 impl = get_engine(engine)
    424 path_or_buf: FilePath | WriteBuffer[bytes] = io.BytesIO() if path is None else path
    426 impl.write(
    427     df,
    428     path_or_buf,
   (...)
    433     **kwargs,
    434 )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/parquet.py:65, in get_engine(engine)
     63     return PyArrowImpl()
     64 elif engine == "fastparquet":
---> 65     return FastParquetImpl()
     67 raise ValueError("engine must be one of 'pyarrow', 'fastparquet'")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/parquet.py:262, in FastParquetImpl.__init__(self)
    259 def __init__(self) -> None:
    260     # since pandas is a dependency of fastparquet
    261     # we need to import on first use
--> 262     fastparquet = import_optional_dependency(
    263         "fastparquet", extra="fastparquet is required for parquet support."
    264     )
    265     self.api = fastparquet

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
    140 except ImportError:
    141     if errors == "raise":
--> 142         raise ImportError(msg)
    143     else:
    144         return None

ImportError: Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.

从镶木地板文件中读出。

In [563]: result = pd.read_parquet("example_fp.parquet", engine="fastparquet")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:139, in import_optional_dependency(name, extra, errors, min_version)
    138 try:
--> 139     module = importlib.import_module(name)
    140 except ImportError:

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'fastparquet'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Input In [563], in <cell line: 1>()
----> 1 result = pd.read_parquet("example_fp.parquet", engine="fastparquet")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/parquet.py:497, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs)
    443 @doc(storage_options=_shared_docs["storage_options"])
    444 def read_parquet(
    445     path,
   (...)
    450     **kwargs,
    451 ) -> DataFrame:
    452     """
    453     Load a parquet object from the file path, returning a DataFrame.
    454 
   (...)
    495     DataFrame
    496     """
--> 497     impl = get_engine(engine)
    499     return impl.read(
    500         path,
    501         columns=columns,
   (...)
    504         **kwargs,
    505     )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/parquet.py:65, in get_engine(engine)
     63     return PyArrowImpl()
     64 elif engine == "fastparquet":
---> 65     return FastParquetImpl()
     67 raise ValueError("engine must be one of 'pyarrow', 'fastparquet'")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/parquet.py:262, in FastParquetImpl.__init__(self)
    259 def __init__(self) -> None:
    260     # since pandas is a dependency of fastparquet
    261     # we need to import on first use
--> 262     fastparquet = import_optional_dependency(
    263         "fastparquet", extra="fastparquet is required for parquet support."
    264     )
    265     self.api = fastparquet

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
    140 except ImportError:
    141     if errors == "raise":
--> 142         raise ImportError(msg)
    143     else:
    144         return None

ImportError: Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.

In [564]: result = pd.read_parquet("example_pa.parquet", engine="pyarrow")

In [565]: result.dtypes
Out[565]: 
a                        object
b                         int64
c                         uint8
d                       float64
e                          bool
f                datetime64[ns]
g    datetime64[ns, US/Eastern]
h                      category
i                      category
dtype: object

仅读取镶木地板文件的某些列。

In [566]: result = pd.read_parquet(
   .....:     "example_fp.parquet",
   .....:     engine="fastparquet",
   .....:     columns=["a", "b"],
   .....: )
   .....: 
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:139, in import_optional_dependency(name, extra, errors, min_version)
    138 try:
--> 139     module = importlib.import_module(name)
    140 except ImportError:

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'fastparquet'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Input In [566], in <cell line: 1>()
----> 1 result = pd.read_parquet(
      2     "example_fp.parquet",
      3     engine="fastparquet",
      4     columns=["a", "b"],
      5 )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/parquet.py:497, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs)
    443 @doc(storage_options=_shared_docs["storage_options"])
    444 def read_parquet(
    445     path,
   (...)
    450     **kwargs,
    451 ) -> DataFrame:
    452     """
    453     Load a parquet object from the file path, returning a DataFrame.
    454 
   (...)
    495     DataFrame
    496     """
--> 497     impl = get_engine(engine)
    499     return impl.read(
    500         path,
    501         columns=columns,
   (...)
    504         **kwargs,
    505     )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/parquet.py:65, in get_engine(engine)
     63     return PyArrowImpl()
     64 elif engine == "fastparquet":
---> 65     return FastParquetImpl()
     67 raise ValueError("engine must be one of 'pyarrow', 'fastparquet'")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/parquet.py:262, in FastParquetImpl.__init__(self)
    259 def __init__(self) -> None:
    260     # since pandas is a dependency of fastparquet
    261     # we need to import on first use
--> 262     fastparquet = import_optional_dependency(
    263         "fastparquet", extra="fastparquet is required for parquet support."
    264     )
    265     self.api = fastparquet

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
    140 except ImportError:
    141     if errors == "raise":
--> 142         raise ImportError(msg)
    143     else:
    144         return None

ImportError: Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.

In [567]: result = pd.read_parquet(
   .....:     "example_pa.parquet",
   .....:     engine="pyarrow",
   .....:     columns=["a", "b"],
   .....: )
   .....: 

In [568]: result.dtypes
Out[568]: 
a    object
b     int64
dtype: object

处理索引#

序列化 DataFrame 拼图可以将隐式索引作为输出文件中的一列或多列包括在内。因此,此代码如下:

In [569]: df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

In [570]: df.to_parquet("test.parquet", engine="pyarrow")

使用创建镶木地板文件 列(如果您使用 pyarrow 对于序列化: ab ,以及 __index_level_0__ 。如果您正在使用 fastparquet ,该指数 may or may not 被写入文件中。

这一意外的额外列会导致Amazon RedShift等数据库拒绝该文件,因为目标表中不存在该列。

如果要在写入时省略数据帧的索引,请传递 index=Falseto_parquet()

In [571]: df.to_parquet("test.parquet", index=False)

这将创建一个仅包含两个预期列的拼图文件, ab 。如果你的 DataFrame 具有自定义索引,则在将此文件加载到 DataFrame

通过 index=True 将要 始终 编写索引,即使这不是底层引擎的默认行为。

对拼图文件进行分区#

PARQUET支持基于一个或多个列的值对数据进行分区。

In [572]: df = pd.DataFrame({"a": [0, 0, 1, 1], "b": [0, 1, 0, 1]})

In [573]: df.to_parquet(path="test", engine="pyarrow", partition_cols=["a"], compression=None)

这个 path 指定要将数据保存到的父目录。这个 partition_cols 数据集将按其进行分区的列名。列按给定的顺序进行分区。分区拆分由分区列中的唯一值确定。上面的示例创建了一个分区数据集,如下所示:

test
├── a=0
│   ├── 0bac803e32dc42ae83fddfd029cbdebc.parquet
│   └──  ...
└── a=1
    ├── e6ab24a4f45147b49b54a662f0c412a3.parquet
    └── ...

ORC#

1.0.0 新版功能.

类似于 parquet 格式,则 ORC Format 是数据帧的二进制列式序列化。它旨在提高读取数据帧的效率。Pandas提供了 only ORC格式的读取器, read_orc() 。这需要 pyarrow 类库。

警告

SQL查询#

这个 pandas.io.sql 模块提供了一组查询包装器,以方便数据检索并减少对特定于DB的API的依赖。数据库抽象由SQLAlChemy提供(如果已安装)。此外,您的数据库还需要一个驱动程序库。这类驱动程序的示例包括 psycopg2 对于PostgreSQL或 pymysql 用于MySQL。为 SQLite 默认情况下,它包含在Python的标准库中。中可以找到每种SQL方言支持的驱动程序的概述 SQLAlchemy docs

如果没有安装SQLAlChemy,则只为SQLite提供后备功能(为了向后兼容,还为MySQL提供后备功能,但该功能已被弃用,将在未来的版本中删除)。此模式需要一个支持 Python DB-API

另请参阅一些 cookbook examples 一些先进的策略。

主要功能包括:

read_sql_table \(表_名称,连接[, schema, ...] )

将SQL数据库表读入DataFrame。

read_sql_query \(SQL,CON[, index_col, ...] )

将SQL查询读入DataFrame。

read_sql \(SQL,CON[, index_col, ...] )

将SQL查询或数据库表读入DataFrame。

DataFrame.to_sql \(名称,控制 [, schema, ...] )

将存储在DataFrame中的记录写入SQL数据库。

备注

该函数 read_sql() 是一种方便的包装 read_sql_table()read_sql_query() (为了向后兼容),并将根据提供的输入(数据库表名或SQL查询)委托特定的功能。如果表名称包含特殊字符,则不需要用引号将其引起来。

在下面的示例中,我们使用 SQlite SQL数据库引擎。您可以使用临时的SQLite数据库,其中的数据存储在“内存”中。

要连接到SQLAlChemy,您可以使用 create_engine() 函数从数据库URI创建引擎对象。您只需要为每个要连接的数据库创建一次引擎。有关以下内容的更多信息: create_engine() 和URI格式化,请参见下面的示例和SQLAlChemy documentation

In [574]: from sqlalchemy import create_engine
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [574], in <cell line: 1>()
----> 1 from sqlalchemy import create_engine

ModuleNotFoundError: No module named 'sqlalchemy'

# Create your engine.
In [575]: engine = create_engine("sqlite:///:memory:")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [575], in <cell line: 1>()
----> 1 engine = create_engine("sqlite:///:memory:")

NameError: name 'create_engine' is not defined

如果你想管理你自己的连接,你可以传递其中的一个。下面的示例使用一个在块完成后自动关闭连接的Python上下文管理器打开到数据库的连接。请参阅 SQLAlchemy docs 以了解如何处理数据库连接。

with engine.connect() as conn, conn.begin():
    data = pd.read_sql_table("data", conn)

警告

当您打开到数据库的连接时,您还需要负责关闭它。保持连接打开的副作用可能包括锁定数据库或其他中断行为。

编写DataFrame#

假设以下数据位于 DataFrame data ,我们可以使用以下命令将其插入到数据库中 to_sql()

ID号

日期

Col_1

Col_2

Col_3

26

2012-10-18

X

25.7

真的

42

2012-10-19

Y

-12.4

错误

63

2012-10-20

Z

5.73

真的

In [576]: data
Out[576]: 
   id       Date Col_1  Col_2  Col_3
0  26 2010-10-18     X  27.50   True
1  42 2010-10-19     Y -12.50  False
2  63 2010-10-20     Z   5.73   True

In [577]: data.to_sql("data", engine)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [577], in <cell line: 1>()
----> 1 data.to_sql("data", engine)

NameError: name 'engine' is not defined

对于某些数据库,由于超过数据包大小限制,写入大型DataFrame可能会导致错误。这可以通过设置 chunksize 参数时调用 to_sql 。例如,下面写道 data 以一次1000行为一批的方式发送到数据库:

In [578]: data.to_sql("data_chunked", engine, chunksize=1000)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [578], in <cell line: 1>()
----> 1 data.to_sql("data_chunked", engine, chunksize=1000)

NameError: name 'engine' is not defined

SQL数据类型#

to_sql() 将尝试根据数据的数据类型将您的数据映射到适当的SQL数据类型。当您有数据类型的列时 object ,Pandas将尝试推断数据类型。

属性指定任何列所需的SQL类型,从而始终可以重写默认类型 dtype 争论。此参数需要一个将列名映射到SQLAlChemy类型(或用于sqlite3回退模式的字符串)的字典。例如,指定使用SqlalChemy String 键入而不是缺省值 Text 字符串列的类型:

In [579]: from sqlalchemy.types import String
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [579], in <cell line: 1>()
----> 1 from sqlalchemy.types import String

ModuleNotFoundError: No module named 'sqlalchemy'

In [580]: data.to_sql("data_dtype", engine, dtype={"Col_1": String})
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [580], in <cell line: 1>()
----> 1 data.to_sql("data_dtype", engine, dtype={"Col_1": String})

NameError: name 'engine' is not defined

备注

由于在不同的数据库风格中对时间增量的支持有限,因此类型为 timedelta64 将以纳秒形式作为整数值写入数据库,并将引发警告。

备注

栏目 category Dtype将被转换为密集表示形式,就像您使用 np.asarray(categorical) (例如,对于字符串类别,这会给出一个字符串数组)。正因为如此,读回数据库表确实需要 not 生成一个绝对的。

DateTime数据类型#

使用SQLAlChemy, to_sql() 能够写入时区初值或时区感知的日期时间数据。然而,存储在数据库中的结果数据最终取决于所使用的数据库系统的DATETIME数据所支持的数据类型。

下表列出了一些常见数据库的DateTime数据支持的数据类型。对于日期时间数据,其他数据库方言可能具有不同的数据类型。

数据库

SQL DATETIME类型

时区支持

SQLite

TEXT

不是的

MySQL

TIMESTAMP or DATETIME

不是的

PostgreSQL

TIMESTAMP or TIMESTAMP WITH TIME ZONE

将时区感知数据写入不支持时区的数据库时,数据将以时区原始时间戳的形式写入,这些时间戳是相对于时区的本地时间。

read_sql_table() 还能够读取时区感知的或朴素的日期时间数据。在阅读时 TIMESTAMP WITH TIME ZONE 类型,Pandas会将数据转换为UTC。

插入法#

该参数 method 控制使用的SQL INSERT子句。可能的值包括:

  • None :使用标准SQL INSERT 子句(每行一个)。

  • 'multi': Pass multiple values in a single INSERT clause. It uses a special SQL syntax not supported by all backends. This usually provides better performance for analytic databases like Presto and Redshift, but has worse performance for traditional SQL backend if the table contains many columns. For more information check the SQLAlchemy documentation

  • 可通过签名调用 (pd_table, conn, keys, data_iter) :这可以用来实现基于特定后端方言特性的更高性能的插入方法。

使用PostgreSQL的可调用示例 COPY clause ::

# Alternative to_sql() *method* for DBs that support COPY FROM
import csv
from io import StringIO

def psql_insert_copy(table, conn, keys, data_iter):
    """
    Execute SQL statement inserting data

    Parameters
    ----------
    table : pandas.io.sql.SQLTable
    conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
    keys : list of str
        Column names
    data_iter : Iterable that iterates the values to be inserted
    """
    # gets a DBAPI connection that can provide a cursor
    dbapi_conn = conn.connection
    with dbapi_conn.cursor() as cur:
        s_buf = StringIO()
        writer = csv.writer(s_buf)
        writer.writerows(data_iter)
        s_buf.seek(0)

        columns = ', '.join(['"{}"'.format(k) for k in keys])
        if table.schema:
            table_name = '{}.{}'.format(table.schema, table.name)
        else:
            table_name = table.name

        sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
            table_name, columns)
        cur.copy_expert(sql=sql, file=s_buf)

阅读表格#

read_sql_table() 将在给定表名的情况下读取数据库表,并可选地读取要读取的列子集。

备注

为了使用 read_sql_table() ,你 must 安装了SQLAlChemy可选依赖项。

In [581]: pd.read_sql_table("data", engine)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [581], in <cell line: 1>()
----> 1 pd.read_sql_table("data", engine)

NameError: name 'engine' is not defined

备注

请注意,PANDA从查询输出推断列数据类型,而不是通过在物理数据库模式中查找数据类型。例如,假设 userid is an integer column in a table. Then, intuitively, select userid ... 将返回整数值序列,而 select cast(userid as text) ... 将返回对象值(Str)序列。因此,如果查询输出为空,则所有结果列都将作为对象值返回(因为它们是最通用的)。如果您预见到查询有时会生成空结果,则可能希望在之后显式转换类型以确保数据类型的完整性。

您还可以将列的名称指定为 DataFrame 索引,并指定要读取的列的子集。

In [582]: pd.read_sql_table("data", engine, index_col="id")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [582], in <cell line: 1>()
----> 1 pd.read_sql_table("data", engine, index_col="id")

NameError: name 'engine' is not defined

In [583]: pd.read_sql_table("data", engine, columns=["Col_1", "Col_2"])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [583], in <cell line: 1>()
----> 1 pd.read_sql_table("data", engine, columns=["Col_1", "Col_2"])

NameError: name 'engine' is not defined

您还可以显式强制将列解析为日期:

In [584]: pd.read_sql_table("data", engine, parse_dates=["Date"])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [584], in <cell line: 1>()
----> 1 pd.read_sql_table("data", engine, parse_dates=["Date"])

NameError: name 'engine' is not defined

如果需要,您可以显式指定要传递的格式字符串或参数字典 pandas.to_datetime()

pd.read_sql_table("data", engine, parse_dates={"Date": "%Y-%m-%d"})
pd.read_sql_table(
    "data",
    engine,
    parse_dates={"Date": {"format": "%Y-%m-%d %H:%M:%S"}},
)

您可以使用以下命令检查表是否存在 has_table()

架构支持#

支持对不同架构的读取和写入 schema 中的关键字 read_sql_table()to_sql() 功能。但是请注意,这取决于数据库的风格(SQLite没有模式)。例如:

df.to_sql("table", engine, schema="other_schema")
pd.read_sql_table("table", engine, schema="other_schema")

正在查询#

中使用原始SQL进行查询。 read_sql_query() 功能。在这种情况下,您必须使用适合您的数据库的SQL变量。在使用SQLAlChemy时,您还可以传递与数据库无关的SQLAlChemy表达式语言构造。

In [585]: pd.read_sql_query("SELECT * FROM data", engine)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [585], in <cell line: 1>()
----> 1 pd.read_sql_query("SELECT * FROM data", engine)

NameError: name 'engine' is not defined

当然,您可以指定一个更“复杂”的查询。

In [586]: pd.read_sql_query("SELECT id, Col_1, Col_2 FROM data WHERE id = 42;", engine)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [586], in <cell line: 1>()
----> 1 pd.read_sql_query("SELECT id, Col_1, Col_2 FROM data WHERE id = 42;", engine)

NameError: name 'engine' is not defined

这个 read_sql_query() 函数支持 chunksize 争论。指定此项将返回遍历查询结果块的迭代器:

In [587]: df = pd.DataFrame(np.random.randn(20, 3), columns=list("abc"))

In [588]: df.to_sql("data_chunks", engine, index=False)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [588], in <cell line: 1>()
----> 1 df.to_sql("data_chunks", engine, index=False)

NameError: name 'engine' is not defined
In [589]: for chunk in pd.read_sql_query("SELECT * FROM data_chunks", engine, chunksize=5):
   .....:     print(chunk)
   .....: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [589], in <cell line: 1>()
----> 1 for chunk in pd.read_sql_query("SELECT * FROM data_chunks", engine, chunksize=5):
      2     print(chunk)

NameError: name 'engine' is not defined

您还可以运行普通查询,而无需创建 DataFrame 使用 execute() 。这对于不返回值的查询非常有用,比如INSERT。这在功能上等同于调用 execute 在SQLAlChemy引擎或db连接对象上。同样,您必须使用适合您的数据库的SQL语法变体。

from pandas.io import sql

sql.execute("SELECT * FROM table_name", engine)
sql.execute(
    "INSERT INTO table_name VALUES(?, ?, ?)", engine, params=[("id", 1, 12.2, True)]
)

发动机连接示例#

要连接到SQLAlChemy,您可以使用 create_engine() 函数从数据库URI创建引擎对象。您只需要为每个要连接的数据库创建一次引擎。

from sqlalchemy import create_engine

engine = create_engine("postgresql://scott:tiger@localhost:5432/mydatabase")

engine = create_engine("mysql+mysqldb://scott:tiger@localhost/foo")

engine = create_engine("oracle://scott:tiger@127.0.0.1:1521/sidname")

engine = create_engine("mssql+pyodbc://mydsn")

# sqlite://<nohostname>/<path>
# where <path> is relative:
engine = create_engine("sqlite:///foo.db")

# or absolute, starting with a slash:
engine = create_engine("sqlite:////absolute/path/to/foo.db")

有关更多信息,请参见示例SQLAlChemy documentation

高级SQLAlChemy查询#

您可以使用SQLAlChemy构造来描述您的查询。

使用 sqlalchemy.text() 以后端中立的方式指定查询参数

In [590]: import sqlalchemy as sa
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [590], in <cell line: 1>()
----> 1 import sqlalchemy as sa

ModuleNotFoundError: No module named 'sqlalchemy'

In [591]: pd.read_sql(
   .....:     sa.text("SELECT * FROM data where Col_1=:col1"), engine, params={"col1": "X"}
   .....: )
   .....: 
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [591], in <cell line: 1>()
      1 pd.read_sql(
----> 2     sa.text("SELECT * FROM data where Col_1=:col1"), engine, params={"col1": "X"}
      3 )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/generic.py:5594, in NDFrame.__getattr__(self, name)
   5587 if (
   5588     name not in self._internal_names_set
   5589     and name not in self._metadata
   5590     and name not in self._accessors
   5591     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5592 ):
   5593     return self[name]
-> 5594 return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'text'

如果您有数据库的SQLAlChemy描述,则可以使用SQLAlChemy表达式来表达WHERE条件

In [592]: metadata = sa.MetaData()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [592], in <cell line: 1>()
----> 1 metadata = sa.MetaData()

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/generic.py:5594, in NDFrame.__getattr__(self, name)
   5587 if (
   5588     name not in self._internal_names_set
   5589     and name not in self._metadata
   5590     and name not in self._accessors
   5591     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5592 ):
   5593     return self[name]
-> 5594 return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'MetaData'

In [593]: data_table = sa.Table(
   .....:     "data",
   .....:     metadata,
   .....:     sa.Column("index", sa.Integer),
   .....:     sa.Column("Date", sa.DateTime),
   .....:     sa.Column("Col_1", sa.String),
   .....:     sa.Column("Col_2", sa.Float),
   .....:     sa.Column("Col_3", sa.Boolean),
   .....: )
   .....: 
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [593], in <cell line: 1>()
----> 1 data_table = sa.Table(
      2     "data",
      3     metadata,
      4     sa.Column("index", sa.Integer),
      5     sa.Column("Date", sa.DateTime),
      6     sa.Column("Col_1", sa.String),
      7     sa.Column("Col_2", sa.Float),
      8     sa.Column("Col_3", sa.Boolean),
      9 )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/generic.py:5594, in NDFrame.__getattr__(self, name)
   5587 if (
   5588     name not in self._internal_names_set
   5589     and name not in self._metadata
   5590     and name not in self._accessors
   5591     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5592 ):
   5593     return self[name]
-> 5594 return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'Table'

In [594]: pd.read_sql(sa.select([data_table]).where(data_table.c.Col_3 is True), engine)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [594], in <cell line: 1>()
----> 1 pd.read_sql(sa.select([data_table]).where(data_table.c.Col_3 is True), engine)

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/generic.py:5594, in NDFrame.__getattr__(self, name)
   5587 if (
   5588     name not in self._internal_names_set
   5589     and name not in self._metadata
   5590     and name not in self._accessors
   5591     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5592 ):
   5593     return self[name]
-> 5594 return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'select'

您可以将SQLAlChemy表达式与传递给 read_sql() 使用 sqlalchemy.bindparam()

In [595]: import datetime as dt

In [596]: expr = sa.select([data_table]).where(data_table.c.Date > sa.bindparam("date"))
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [596], in <cell line: 1>()
----> 1 expr = sa.select([data_table]).where(data_table.c.Date > sa.bindparam("date"))

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/generic.py:5594, in NDFrame.__getattr__(self, name)
   5587 if (
   5588     name not in self._internal_names_set
   5589     and name not in self._metadata
   5590     and name not in self._accessors
   5591     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5592 ):
   5593     return self[name]
-> 5594 return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'select'

In [597]: pd.read_sql(expr, engine, params={"date": dt.datetime(2010, 10, 18)})
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [597], in <cell line: 1>()
----> 1 pd.read_sql(expr, engine, params={"date": dt.datetime(2010, 10, 18)})

NameError: name 'engine' is not defined

Sqlite后备#

不使用SQLAlChemy即可支持使用SQLite。此模式需要一个支持 Python DB-API

您可以按如下方式创建连接:

import sqlite3

con = sqlite3.connect(":memory:")

然后发出以下查询:

data.to_sql("data", con)
pd.read_sql_query("SELECT * FROM data", con)

谷歌BigQuery#

警告

从0.20.0开始,Pandas已经将对Google BigQuery的支持拆分到单独的包中 pandas-gbq 。你可以的 pip install pandas-gbq 才能得到它。

这个 pandas-gbq Package提供了从Google BigQuery读取/写入的功能。

Pandas与这个外部套餐整合在一起。如果 pandas-gbq 已安装,则可以使用PANAS方法 pd.read_gbqDataFrame.to_gbq ,它将从 pandas-gbq

可以找到完整的文档 here

Stata格式#

写入Stata格式#

该方法 to_stata() 会将DataFrame写入.dta文件。此文件的格式版本始终为115(Stata 12)。

In [598]: df = pd.DataFrame(np.random.randn(10, 2), columns=list("AB"))

In [599]: df.to_stata("stata.dta")

斯塔塔 数据文件的数据类型支持有限;仅支持244个或更少字符的字符串, int8, int16, int32, float32 and float64 can be stored in .dta files. Additionally, Stata reserves certain values to represent missing data. Exporting a non-missing value that is outside of the permitted range in Stata for a particular data type will retype the variable to the next larger size. For example, int8 values are restricted to lie between -127 and 100 in Stata, and so variables with values above 100 will trigger a conversion to int16. nan values in floating points data types are stored as the basic missing data type (. 在……里面 斯塔塔 )。

备注

无法为整型数据类型导出缺少的数据值。

这个 斯塔塔 编写器可以优雅地处理其他数据类型,包括 int64booluint8uint16uint32 通过强制转换为可以表示数据的最小受支持类型。例如,类型为 uint8 将被演绎成 int8 如果所有值都小于100(非缺失的上限 int8 数据输入 斯塔塔 ),或者,如果值超出此范围,则将变量强制转换为 int16

警告

转换自 int64float64 可能会导致精确度降低,如果 int64 值大于2**53。

警告

StataWriterto_stata() 仅支持最多包含244个字符的固定宽度字符串,这是115版DTA文件格式的限制。正在尝试写入 斯塔塔 字符串长度超过244个字符的DTA文件会引发 ValueError

从Stata格式读取#

顶层函数 read_stata 将读取DTA文件并返回一个 DataFrame 或者是 StataReader 可用于以增量方式读取文件的。

In [600]: pd.read_stata("stata.dta")
Out[600]: 
   index         A         B
0      0  0.608228  1.064810
1      1 -0.780506 -2.736887
2      2  0.143539  1.170191
3      3 -1.573076  0.075792
4      4 -1.722223 -0.774650
5      5  0.803627  0.221665
6      6  0.584637  0.147264
7      7  1.057825 -0.284136
8      8  0.912395  1.552808
9      9  0.189376 -0.109830

指定一个 chunksize 产生一种 StataReader 实例,它可用于读取 chunksize 一次从文件中删除行。这个 StataReader 对象可以用作迭代器。

In [601]: with pd.read_stata("stata.dta", chunksize=3) as reader:
   .....:     for df in reader:
   .....:         print(df.shape)
   .....: 
(3, 3)
(3, 3)
(3, 3)
(1, 3)

要获得更细粒度的控制,请使用 iterator=True 并指定 chunksize 每次调用 read()

In [602]: with pd.read_stata("stata.dta", iterator=True) as reader:
   .....:     chunk1 = reader.read(5)
   .....:     chunk2 = reader.read(5)
   .....: 

目前, index 是作为列检索的。

该参数 convert_categoricals 指示是否应读取值标签并使用它来创建 Categorical 不同于它们的变量。值标签也可以由函数检索 value_labels ,这需要 read() 在使用前被调用。

该参数 convert_missing 指示是否应保留Stata中缺少的值表示形式。如果 False (缺省值),则缺失值表示为 np.nan 。如果 True ,缺少的值使用 StataMissingValue 对象和包含缺失值的列将具有 object 数据类型。

备注

read_stata()StataReader 支持.dta格式113-115(Stata 10-12)、117(Stata 13)和118(Stata 14)。

备注

设置 preserve_dtypes=False 将向上转换为标准的Pandas数据类型: int64 对于所有整数类型和 float64 用于浮点数据。默认情况下,导入时会保留Stata数据类型。

分类数据#

Categorical 可以将数据导出到 斯塔塔 数据文件作为值标记的数据。导出的数据由作为整型数据值的基础类别代码和作为值标签的类别组成。 斯塔塔 没有显式等效于 Categorical 以及有关以下内容的信息 是否 变量的顺序在导出时丢失。

警告

斯塔塔 仅支持字符串值标签,因此 str 在导出数据时对类别调用。正在导出 Categorical 具有非字符串类别的变量会生成警告,如果 str 这些类别的表示并不是唯一的。

Labeled data can similarly be imported from Stata data files as Categorical variables using the keyword argument convert_categoricals (True by default). The keyword argument order_categoricals (True by default) determines whether imported Categorical variables are ordered.

备注

导入分类数据时, 斯塔塔 数据文件未保留,因为 Categorical 变量始终使用整型数据类型 -1n-1 哪里 n 是类别的数量。中的原始值 斯塔塔 数据文件为必填项,可通过设置导入 convert_categoricals=False ,它将导入原始数据(但不会导入变量标签)。原始值可以与导入的分类数据相匹配,因为原始的 斯塔塔 导入的分类变量的数据值和分类编码:缺失的值被分配编码 -1 ,并赋予最小的原始值。 0 ,第二个最小的被赋值 1 依此类推,直到为最大的原始值分配了代码 n-1

备注

斯塔塔 支持部分标记的系列。这些系列具有某些数据值(但不是所有数据值)的值标签。导入部分标记的系列将生成 Categorical 对于有标签的值,使用字符串类别,对于没有标签的值,使用数字类别。

SAS格式#

顶层函数 read_sas() 可以读取(但不能写入)SAS XPORT(.xpt)和(因为 v0.18.0 )SAS7BDAT(.sas7bdat)格式文件。

SAS文件只包含两种值类型:ASCII文本和浮点值(通常为8字节,但有时会被截断)。对于导出文件,不会自动将类型转换为整数、日期或类别。对于SAS7BDAT文件,格式代码可能允许日期变量自动转换为日期。默认情况下,读取整个文件并将其作为 DataFrame

指定一个 chunksize 或使用 iterator=True 获取读取器对象 (XportReaderSAS7BDATReader )用于递增地读取文件。读取器对象还具有包含有关文件及其变量的附加信息的属性。

阅读SAS7BDAT文件:

df = pd.read_sas("sas_data.sas7bdat")

获取迭代器并一次读取100,000行XPORT文件:

def do_something(chunk):
    pass


with pd.read_sas("sas_xport.xpt", chunk=100000) as rdr:
    for chunk in rdr:
        do_something(chunk)

这个 specification 对于xport文件格式,可从SAS网站获得。

没有关于SAS7BDAT格式的官方文档。

SPSS格式#

0.25.0 新版功能.

顶层函数 read_spss() 可以读取(但不能写入)SPSS SAV(.sav)和zsav(.zsav)格式的文件。

SPSS文件包含列名。默认情况下,将读取整个文件,并将分类列转换为 pd.Categorical ,和一个 DataFrame 返回所有列。

指定 usecols 参数以获取列的子集。指定 convert_categoricals=False 避免将分类列转换为 pd.Categorical

阅读SPSS文件:

df = pd.read_spss("spss_data.sav")

中包含的列的子集 usecols ,并避免将分类列转换为 pd.Categorical

df = pd.read_spss(
    "spss_data.sav",
    usecols=["foo", "bar"],
    convert_categoricals=False,
)

提供了有关SAV和zsav文件格式的更多信息 here.

其他文件格式#

Pandas本身只支持具有一组有限文件格式的IO,这些文件格式干净地映射到其表格数据模型。对于在Pandas中读取和写入其他文件格式,我们推荐来自更广泛社区的这些包。

NetCDF#

xarray 提供受Pandas启发的数据结构 DataFrame 用于处理多维数据集,重点是netCDF文件格式,以及与PANDA之间的轻松转换。

性能注意事项#

这是对各种IO方法的非正式比较,使用的是Pandas0.24.2。计时依赖于机器,应该忽略微小的差异。

In [1]: sz = 1000000
In [2]: df = pd.DataFrame({'A': np.random.randn(sz), 'B': [1] * sz})

In [3]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
A    1000000 non-null float64
B    1000000 non-null int64
dtypes: float64(1), int64(1)
memory usage: 15.3 MB

下面将使用以下测试函数来比较几种IO方法的性能:

import numpy as np

import os

sz = 1000000
df = pd.DataFrame({"A": np.random.randn(sz), "B": [1] * sz})

sz = 1000000
np.random.seed(42)
df = pd.DataFrame({"A": np.random.randn(sz), "B": [1] * sz})


def test_sql_write(df):
    if os.path.exists("test.sql"):
        os.remove("test.sql")
    sql_db = sqlite3.connect("test.sql")
    df.to_sql(name="test_table", con=sql_db)
    sql_db.close()


def test_sql_read():
    sql_db = sqlite3.connect("test.sql")
    pd.read_sql_query("select * from test_table", sql_db)
    sql_db.close()


def test_hdf_fixed_write(df):
    df.to_hdf("test_fixed.hdf", "test", mode="w")


def test_hdf_fixed_read():
    pd.read_hdf("test_fixed.hdf", "test")


def test_hdf_fixed_write_compress(df):
    df.to_hdf("test_fixed_compress.hdf", "test", mode="w", complib="blosc")


def test_hdf_fixed_read_compress():
    pd.read_hdf("test_fixed_compress.hdf", "test")


def test_hdf_table_write(df):
    df.to_hdf("test_table.hdf", "test", mode="w", format="table")


def test_hdf_table_read():
    pd.read_hdf("test_table.hdf", "test")


def test_hdf_table_write_compress(df):
    df.to_hdf(
        "test_table_compress.hdf", "test", mode="w", complib="blosc", format="table"
    )


def test_hdf_table_read_compress():
    pd.read_hdf("test_table_compress.hdf", "test")


def test_csv_write(df):
    df.to_csv("test.csv", mode="w")


def test_csv_read():
    pd.read_csv("test.csv", index_col=0)


def test_feather_write(df):
    df.to_feather("test.feather")


def test_feather_read():
    pd.read_feather("test.feather")


def test_pickle_write(df):
    df.to_pickle("test.pkl")


def test_pickle_read():
    pd.read_pickle("test.pkl")


def test_pickle_write_compress(df):
    df.to_pickle("test.pkl.compress", compression="xz")


def test_pickle_read_compress():
    pd.read_pickle("test.pkl.compress", compression="xz")


def test_parquet_write(df):
    df.to_parquet("test.parquet")


def test_parquet_read():
    pd.read_parquet("test.parquet")

在编写时,速度排名前三的函数是 test_feather_writetest_hdf_fixed_writetest_hdf_fixed_write_compress

In [4]: %timeit test_sql_write(df)
3.29 s ± 43.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit test_hdf_fixed_write(df)
19.4 ms ± 560 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit test_hdf_fixed_write_compress(df)
19.6 ms ± 308 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit test_hdf_table_write(df)
449 ms ± 5.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit test_hdf_table_write_compress(df)
448 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit test_csv_write(df)
3.66 s ± 26.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: %timeit test_feather_write(df)
9.75 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit test_pickle_write(df)
30.1 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [12]: %timeit test_pickle_write_compress(df)
4.29 s ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [13]: %timeit test_parquet_write(df)
67.6 ms ± 706 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

在阅读时,速度排名前三位的功能是 test_feather_readtest_pickle_readtest_hdf_fixed_read

In [14]: %timeit test_sql_read()
1.77 s ± 17.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]: %timeit test_hdf_fixed_read()
19.4 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [16]: %timeit test_hdf_fixed_read_compress()
19.5 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [17]: %timeit test_hdf_table_read()
38.6 ms ± 857 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [18]: %timeit test_hdf_table_read_compress()
38.8 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [19]: %timeit test_csv_read()
452 ms ± 9.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [20]: %timeit test_feather_read()
12.4 ms ± 99.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [21]: %timeit test_pickle_read()
18.4 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [22]: %timeit test_pickle_read_compress()
915 ms ± 7.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [23]: %timeit test_parquet_read()
24.4 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

这些文件 test.pkl.compresstest.parquettest.feather 占用的磁盘空间最少(以字节为单位)。

29519500 Oct 10 06:45 test.csv
16000248 Oct 10 06:45 test.feather
8281983  Oct 10 06:49 test.parquet
16000857 Oct 10 06:47 test.pkl
7552144  Oct 10 06:48 test.pkl.compress
34816000 Oct 10 06:42 test.sql
24009288 Oct 10 06:43 test_fixed.hdf
24009288 Oct 10 06:43 test_fixed_compress.hdf
24458940 Oct 10 06:44 test_table.hdf
24458940 Oct 10 06:44 test_table_compress.hdf