版本0.10.1(2013年1月22日)#

这是0.10.0的次要版本,包括新功能、增强功能和错误修复。特别是,Jeff Reback贡献了大量新的HDFStore功能。

一个不受欢迎的API中断,函数采用 inplace 选项已恢复,并添加了弃用警告。

API更改#

  • 函数接受一个 inplace 选项像以前一样返回调用对象。已添加弃用消息

  • Groupby Aggregation Max/Min不再排除非数字数据 (GH2700 )

  • 重新采样空的DataFrame现在返回空的DataFrame,而不是引发异常 (GH2640 )

  • 现在,当在显式指定的整型列中找到NA值而不是将该列转换为浮点型时,文件读取器将引发异常 (GH2631 )

  • DatetimeIndex.Unique现在返回同名的DatetimeIndex和

  • 时区而不是数组 (GH2563 )

新功能#

  • MySQL对数据库的支持(Dan Allan贡献)

HDFStore#

您可能需要升级现有的数据文件。请浏览 兼容性 主要文件中的章节。

通过将列表传递给,可以指定(和索引)希望能够对表执行查询的某些列 data_columns

In [1]: store = pd.HDFStore("store.h5")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:139, in import_optional_dependency(name, extra, errors, min_version)
    138 try:
--> 139     module = importlib.import_module(name)
    140 except ImportError:

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'tables'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Input In [1], in <cell line: 1>()
----> 1 store = pd.HDFStore("store.h5")

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/pytables.py:573, in HDFStore.__init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
    570 if "format" in kwargs:
    571     raise ValueError("format is not a defined argument for HDFStore")
--> 573 tables = import_optional_dependency("tables")
    575 if complib is not None and complib not in tables.filters.all_complibs:
    576     raise ValueError(
    577         f"complib only supports {tables.filters.all_complibs} compression."
    578     )

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
    140 except ImportError:
    141     if errors == "raise":
--> 142         raise ImportError(msg)
    143     else:
    144         return None

ImportError: Missing optional dependency 'pytables'.  Use pip or conda to install pytables.

In [2]: df = pd.DataFrame(
   ...:     np.random.randn(8, 3),
   ...:     index=pd.date_range("1/1/2000", periods=8),
   ...:     columns=["A", "B", "C"],
   ...: )
   ...: 

In [3]: df["string"] = "foo"

In [4]: df.loc[df.index[4:6], "string"] = np.nan

In [5]: df.loc[df.index[7:9], "string"] = "bar"

In [6]: df["string2"] = "cool"

In [7]: df
Out[7]: 
                   A         B         C string string2
2000-01-01  0.469112 -0.282863 -1.509059    foo    cool
2000-01-02 -1.135632  1.212112 -0.173215    foo    cool
2000-01-03  0.119209 -1.044236 -0.861849    foo    cool
2000-01-04 -2.104569 -0.494929  1.071804    foo    cool
2000-01-05  0.721555 -0.706771 -1.039575    NaN    cool
2000-01-06  0.271860 -0.424972  0.567020    NaN    cool
2000-01-07  0.276232 -1.087401 -0.673690    foo    cool
2000-01-08  0.113648 -1.478427  0.524988    bar    cool

# on-disk operations
In [8]: store.append("df", df, data_columns=["B", "C", "string", "string2"])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [8], in <cell line: 1>()
----> 1 store.append("df", df, data_columns=["B", "C", "string", "string2"])

NameError: name 'store' is not defined

In [9]: store.select("df", "B>0 and string=='foo'")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 store.select("df", "B>0 and string=='foo'")

NameError: name 'store' is not defined

# this is in-memory version of this type of selection
In [10]: df[(df.B > 0) & (df.string == "foo")]
Out[10]: 
                   A         B         C string string2
2000-01-02 -1.135632  1.212112 -0.173215    foo    cool

在可编制索引的列或数据列中检索唯一值。

# note that this is deprecated as of 0.14.0
# can be replicated by: store.select_column('df','index').unique()
store.unique("df", "index")
store.unique("df", "string")

您现在可以存储 datetime64 在数据列中

In [11]: df_mixed = df.copy()

In [12]: df_mixed["datetime64"] = pd.Timestamp("20010102")

In [13]: df_mixed.loc[df_mixed.index[3:4], ["A", "B"]] = np.nan

In [14]: store.append("df_mixed", df_mixed)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 store.append("df_mixed", df_mixed)

NameError: name 'store' is not defined

In [15]: df_mixed1 = store.select("df_mixed")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [15], in <cell line: 1>()
----> 1 df_mixed1 = store.select("df_mixed")

NameError: name 'store' is not defined

In [16]: df_mixed1
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [16], in <cell line: 1>()
----> 1 df_mixed1

NameError: name 'df_mixed1' is not defined

In [17]: df_mixed1.dtypes.value_counts()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 df_mixed1.dtypes.value_counts()

NameError: name 'df_mixed1' is not defined

你可以过去了 columns 关键字以筛选返回列的列表,这等效于将 Term('columns',list_of_columns_to_filter)

In [18]: store.select("df", columns=["A", "B"])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [18], in <cell line: 1>()
----> 1 store.select("df", columns=["A", "B"])

NameError: name 'store' is not defined

HDFStore 现在在追加表时序列化多索引数据帧。

In [19]: index = pd.MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
   ....:                               ['one', 'two', 'three']],
   ....:                       labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
   ....:                               [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
   ....:                       names=['foo', 'bar'])
   ....:

In [20]: df = pd.DataFrame(np.random.randn(10, 3), index=index,
   ....:                   columns=['A', 'B', 'C'])
   ....:

In [21]: df
Out[21]:
                  A         B         C
foo bar
foo one   -0.116619  0.295575 -1.047704
    two    1.640556  1.905836  2.772115
    three  0.088787 -1.144197 -0.633372
bar one    0.925372 -0.006438 -0.820408
    two   -0.600874 -1.039266  0.824758
baz two   -0.824095 -0.337730 -0.927764
    three -0.840123  0.248505 -0.109250
qux one    0.431977 -0.460710  0.336505
    two   -3.207595 -1.535854  0.409769
    three -0.673145 -0.741113 -0.110891

In [22]: store.append('mi', df)

In [23]: store.select('mi')
Out[23]:
                  A         B         C
foo bar
foo one   -0.116619  0.295575 -1.047704
    two    1.640556  1.905836  2.772115
    three  0.088787 -1.144197 -0.633372
bar one    0.925372 -0.006438 -0.820408
    two   -0.600874 -1.039266  0.824758
baz two   -0.824095 -0.337730 -0.927764
    three -0.840123  0.248505 -0.109250
qux one    0.431977 -0.460710  0.336505
    two   -3.207595 -1.535854  0.409769
    three -0.673145 -0.741113 -0.110891

# the levels are automatically included as data columns
In [24]: store.select('mi', "foo='bar'")
Out[24]:
                A         B         C
foo bar
bar one  0.925372 -0.006438 -0.820408
    two -0.600874 -1.039266  0.824758

通过创建多个表 append_to_multiple 和选择通过 select_as_multiple 可以从多个表中创建/选择并返回组合结果,方法是使用 where 在选择器表上。

In [19]: df_mt = pd.DataFrame(
   ....:     np.random.randn(8, 6),
   ....:     index=pd.date_range("1/1/2000", periods=8),
   ....:     columns=["A", "B", "C", "D", "E", "F"],
   ....: )
   ....: 

In [20]: df_mt["foo"] = "bar"

# you can also create the tables individually
In [21]: store.append_to_multiple(
   ....:     {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
   ....: )
   ....: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [21], in <cell line: 1>()
----> 1 store.append_to_multiple(
      2     {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
      3 )

NameError: name 'store' is not defined

In [22]: store
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [22], in <cell line: 1>()
----> 1 store

NameError: name 'store' is not defined

# individual tables were created
In [23]: store.select("df1_mt")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [23], in <cell line: 1>()
----> 1 store.select("df1_mt")

NameError: name 'store' is not defined

In [24]: store.select("df2_mt")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [24], in <cell line: 1>()
----> 1 store.select("df2_mt")

NameError: name 'store' is not defined

# as a multiple
In [25]: store.select_as_multiple(
   ....:     ["df1_mt", "df2_mt"], where=["A>0", "B>0"], selector="df1_mt"
   ....: )
   ....: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [25], in <cell line: 1>()
----> 1 store.select_as_multiple(
      2     ["df1_mt", "df2_mt"], where=["A>0", "B>0"], selector="df1_mt"
      3 )

NameError: name 'store' is not defined

Enhancements

  • HDFStore 现在可以读取原生的PyTables表格式表

  • 你可以过去了 nan_rep = 'my_nan_rep' 要追加,请更改磁盘上的默认NaN表示形式(可进行转换 np.nan ),则默认为 nan

  • 你可以过去了 indexappend 。此默认值为 True 。这将自动在 可转位的数据列 餐桌上的

  • 你可以过去了 chunksize=an integerappend ,更改写入区块大小(默认为50000)。这将显著降低您在写入时的内存使用量。

  • 你可以过去了 expectedrows=an integer 致第一个 append 的预期行数,以设置 PyTables 威尔在意料之中。这将优化读/写性能。

  • Select 现在支持传递 startstop 以在选择中提供选择空间限制。

  • 大大改进了文件解析器的ISO8601(例如,yyyy-mm-dd)日期解析 (GH2698 )

  • 允许 DataFrame.merge 处理对于64位整数来说太大的组合大小 (GH2690 )

  • 级数现在具有一元求反(-级数)和求逆(~级数)运算符 (GH2686 )

  • DataFrame.lot现在包括一个 logx 参数将x轴更改为对数刻度 (GH2327 )

  • 系列算术运算符现在可以处理常量和ndarray输入 (GH2574 )

  • ExcelFile现在需要一个 kind 参数指定文件类型 (GH2613 )

  • Series.str方法的更快实现 (GH2602 )

错误修复

  • HDFStore 表现在可以存储 float32 类型正确(不能与 float64 然而)

  • 修复了指定请求段时的Google Analytics前缀 (GH2713 )。

  • 用于重置Google Analytics令牌存储的功能,以便用户可以从错误设置的客户端机密中恢复 (GH2687 )。

  • 修复了传入多索引时导致段错误的GROUPBY错误 (GH2706 )

  • 修复了将具有DateTime64值的Series传递到 to_datetime 结果产生虚假的输出值 (GH2699 )

  • 修复了中的错误 pattern in HDFStore 当模式不是有效的正则表达式时的表达式 (GH2694 )

  • 修复了聚合布尔数据时的性能问题 (GH2692 )

  • 当给定布尔掩码关键字和一系列新值时,Series __setitem__ 现在将传入的值与原始系列对齐 (GH2686 )

  • 修复内存对具有大量组合值的多索引级别执行计数排序时导致的错误 (GH2684 )

  • 修复了索引为具有固定偏移量时区的DatetimeIndex时导致打印失败的错误 (GH2683 )

  • 更正了当偏移量大于5天且开始日期在周末时的工作日减去逻辑 (GH2680 )

  • 修复了文件的列数多于数据数时的C文件解析器行为 (GH2668 )

  • 修复了文件读取器错误,即在隐式列和指定的 usecols 价值

  • 具有数字或日期时间索引的DataFrame现在在绘制之前进行排序 (GH2609 )

  • 修复了传递列、索引但记录为空时出现的DataFrame.from_Records错误 (GH2633 )

  • 修复了dtype为DateTime64时的系列操作的几个错误 (GH2689GH2629GH2626 )

请参阅 full release notes 或在GitHub上的问题跟踪器上查看完整的列表。

贡献者#

共有17人为此次发布贡献了补丁。名字中带有“+”的人第一次贡献了一个补丁。

  • Andy Hayden +

  • Anton I. Sipos +

  • Chang She

  • Christopher Whelan

  • Damien Garaud +

  • Dan Allan +

  • Dieter Vandenbussche

  • Garrett Drapala +

  • Jay Parlar +

  • Thouis (Ray) Jones +

  • Vincent Arel-Bundock +

  • Wes McKinney

  • elpres

  • herrfz +

  • jreback

  • svaksha +

  • y-p