版本0.10.1(2013年1月22日)#
这是0.10.0的次要版本,包括新功能、增强功能和错误修复。特别是,Jeff Reback贡献了大量新的HDFStore功能。
一个不受欢迎的API中断,函数采用 inplace
选项已恢复,并添加了弃用警告。
API更改#
新功能#
MySQL对数据库的支持(Dan Allan贡献)
HDFStore#
您可能需要升级现有的数据文件。请浏览 兼容性 主要文件中的章节。
通过将列表传递给,可以指定(和索引)希望能够对表执行查询的某些列 data_columns
In [1]: store = pd.HDFStore("store.h5")
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:139, in import_optional_dependency(name, extra, errors, min_version)
138 try:
--> 139 module = importlib.import_module(name)
140 except ImportError:
File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
125 level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)
File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)
File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)
File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)
ModuleNotFoundError: No module named 'tables'
During handling of the above exception, another exception occurred:
ImportError Traceback (most recent call last)
Input In [1], in <cell line: 1>()
----> 1 store = pd.HDFStore("store.h5")
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/pytables.py:573, in HDFStore.__init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
570 if "format" in kwargs:
571 raise ValueError("format is not a defined argument for HDFStore")
--> 573 tables = import_optional_dependency("tables")
575 if complib is not None and complib not in tables.filters.all_complibs:
576 raise ValueError(
577 f"complib only supports {tables.filters.all_complibs} compression."
578 )
File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
140 except ImportError:
141 if errors == "raise":
--> 142 raise ImportError(msg)
143 else:
144 return None
ImportError: Missing optional dependency 'pytables'. Use pip or conda to install pytables.
In [2]: df = pd.DataFrame(
...: np.random.randn(8, 3),
...: index=pd.date_range("1/1/2000", periods=8),
...: columns=["A", "B", "C"],
...: )
...:
In [3]: df["string"] = "foo"
In [4]: df.loc[df.index[4:6], "string"] = np.nan
In [5]: df.loc[df.index[7:9], "string"] = "bar"
In [6]: df["string2"] = "cool"
In [7]: df
Out[7]:
A B C string string2
2000-01-01 0.469112 -0.282863 -1.509059 foo cool
2000-01-02 -1.135632 1.212112 -0.173215 foo cool
2000-01-03 0.119209 -1.044236 -0.861849 foo cool
2000-01-04 -2.104569 -0.494929 1.071804 foo cool
2000-01-05 0.721555 -0.706771 -1.039575 NaN cool
2000-01-06 0.271860 -0.424972 0.567020 NaN cool
2000-01-07 0.276232 -1.087401 -0.673690 foo cool
2000-01-08 0.113648 -1.478427 0.524988 bar cool
# on-disk operations
In [8]: store.append("df", df, data_columns=["B", "C", "string", "string2"])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [8], in <cell line: 1>()
----> 1 store.append("df", df, data_columns=["B", "C", "string", "string2"])
NameError: name 'store' is not defined
In [9]: store.select("df", "B>0 and string=='foo'")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 store.select("df", "B>0 and string=='foo'")
NameError: name 'store' is not defined
# this is in-memory version of this type of selection
In [10]: df[(df.B > 0) & (df.string == "foo")]
Out[10]:
A B C string string2
2000-01-02 -1.135632 1.212112 -0.173215 foo cool
在可编制索引的列或数据列中检索唯一值。
# note that this is deprecated as of 0.14.0
# can be replicated by: store.select_column('df','index').unique()
store.unique("df", "index")
store.unique("df", "string")
您现在可以存储 datetime64
在数据列中
In [11]: df_mixed = df.copy()
In [12]: df_mixed["datetime64"] = pd.Timestamp("20010102")
In [13]: df_mixed.loc[df_mixed.index[3:4], ["A", "B"]] = np.nan
In [14]: store.append("df_mixed", df_mixed)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 store.append("df_mixed", df_mixed)
NameError: name 'store' is not defined
In [15]: df_mixed1 = store.select("df_mixed")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [15], in <cell line: 1>()
----> 1 df_mixed1 = store.select("df_mixed")
NameError: name 'store' is not defined
In [16]: df_mixed1
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [16], in <cell line: 1>()
----> 1 df_mixed1
NameError: name 'df_mixed1' is not defined
In [17]: df_mixed1.dtypes.value_counts()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 df_mixed1.dtypes.value_counts()
NameError: name 'df_mixed1' is not defined
你可以过去了 columns
关键字以筛选返回列的列表,这等效于将 Term('columns',list_of_columns_to_filter)
In [18]: store.select("df", columns=["A", "B"])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [18], in <cell line: 1>()
----> 1 store.select("df", columns=["A", "B"])
NameError: name 'store' is not defined
HDFStore
现在在追加表时序列化多索引数据帧。
In [19]: index = pd.MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
....: ['one', 'two', 'three']],
....: labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
....: [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
....: names=['foo', 'bar'])
....:
In [20]: df = pd.DataFrame(np.random.randn(10, 3), index=index,
....: columns=['A', 'B', 'C'])
....:
In [21]: df
Out[21]:
A B C
foo bar
foo one -0.116619 0.295575 -1.047704
two 1.640556 1.905836 2.772115
three 0.088787 -1.144197 -0.633372
bar one 0.925372 -0.006438 -0.820408
two -0.600874 -1.039266 0.824758
baz two -0.824095 -0.337730 -0.927764
three -0.840123 0.248505 -0.109250
qux one 0.431977 -0.460710 0.336505
two -3.207595 -1.535854 0.409769
three -0.673145 -0.741113 -0.110891
In [22]: store.append('mi', df)
In [23]: store.select('mi')
Out[23]:
A B C
foo bar
foo one -0.116619 0.295575 -1.047704
two 1.640556 1.905836 2.772115
three 0.088787 -1.144197 -0.633372
bar one 0.925372 -0.006438 -0.820408
two -0.600874 -1.039266 0.824758
baz two -0.824095 -0.337730 -0.927764
three -0.840123 0.248505 -0.109250
qux one 0.431977 -0.460710 0.336505
two -3.207595 -1.535854 0.409769
three -0.673145 -0.741113 -0.110891
# the levels are automatically included as data columns
In [24]: store.select('mi', "foo='bar'")
Out[24]:
A B C
foo bar
bar one 0.925372 -0.006438 -0.820408
two -0.600874 -1.039266 0.824758
通过创建多个表 append_to_multiple
和选择通过 select_as_multiple
可以从多个表中创建/选择并返回组合结果,方法是使用 where
在选择器表上。
In [19]: df_mt = pd.DataFrame(
....: np.random.randn(8, 6),
....: index=pd.date_range("1/1/2000", periods=8),
....: columns=["A", "B", "C", "D", "E", "F"],
....: )
....:
In [20]: df_mt["foo"] = "bar"
# you can also create the tables individually
In [21]: store.append_to_multiple(
....: {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
....: )
....:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [21], in <cell line: 1>()
----> 1 store.append_to_multiple(
2 {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
3 )
NameError: name 'store' is not defined
In [22]: store
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [22], in <cell line: 1>()
----> 1 store
NameError: name 'store' is not defined
# individual tables were created
In [23]: store.select("df1_mt")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [23], in <cell line: 1>()
----> 1 store.select("df1_mt")
NameError: name 'store' is not defined
In [24]: store.select("df2_mt")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [24], in <cell line: 1>()
----> 1 store.select("df2_mt")
NameError: name 'store' is not defined
# as a multiple
In [25]: store.select_as_multiple(
....: ["df1_mt", "df2_mt"], where=["A>0", "B>0"], selector="df1_mt"
....: )
....:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [25], in <cell line: 1>()
----> 1 store.select_as_multiple(
2 ["df1_mt", "df2_mt"], where=["A>0", "B>0"], selector="df1_mt"
3 )
NameError: name 'store' is not defined
Enhancements
HDFStore
现在可以读取原生的PyTables表格式表你可以过去了
nan_rep = 'my_nan_rep'
要追加,请更改磁盘上的默认NaN表示形式(可进行转换np.nan
),则默认为nan
。你可以过去了
index
至append
。此默认值为True
。这将自动在 可转位的 和 数据列 餐桌上的你可以过去了
chunksize=an integer
至append
,更改写入区块大小(默认为50000)。这将显著降低您在写入时的内存使用量。你可以过去了
expectedrows=an integer
致第一个append
的预期行数,以设置PyTables
威尔在意料之中。这将优化读/写性能。Select
现在支持传递start
和stop
以在选择中提供选择空间限制。大大改进了文件解析器的ISO8601(例如,yyyy-mm-dd)日期解析 (GH2698 )
允许
DataFrame.merge
处理对于64位整数来说太大的组合大小 (GH2690 )级数现在具有一元求反(-级数)和求逆(~级数)运算符 (GH2686 )
DataFrame.lot现在包括一个
logx
参数将x轴更改为对数刻度 (GH2327 )系列算术运算符现在可以处理常量和ndarray输入 (GH2574 )
ExcelFile现在需要一个
kind
参数指定文件类型 (GH2613 )Series.str方法的更快实现 (GH2602 )
错误修复
HDFStore
表现在可以存储float32
类型正确(不能与float64
然而)修复了指定请求段时的Google Analytics前缀 (GH2713 )。
用于重置Google Analytics令牌存储的功能,以便用户可以从错误设置的客户端机密中恢复 (GH2687 )。
修复了传入多索引时导致段错误的GROUPBY错误 (GH2706 )
修复了将具有DateTime64值的Series传递到
to_datetime
结果产生虚假的输出值 (GH2699 )修复了中的错误
pattern in HDFStore
当模式不是有效的正则表达式时的表达式 (GH2694 )修复了聚合布尔数据时的性能问题 (GH2692 )
当给定布尔掩码关键字和一系列新值时,Series __setitem__ 现在将传入的值与原始系列对齐 (GH2686 )
修复内存对具有大量组合值的多索引级别执行计数排序时导致的错误 (GH2684 )
修复了索引为具有固定偏移量时区的DatetimeIndex时导致打印失败的错误 (GH2683 )
更正了当偏移量大于5天且开始日期在周末时的工作日减去逻辑 (GH2680 )
修复了文件的列数多于数据数时的C文件解析器行为 (GH2668 )
修复了文件读取器错误,即在隐式列和指定的
usecols
价值具有数字或日期时间索引的DataFrame现在在绘制之前进行排序 (GH2609 )
修复了传递列、索引但记录为空时出现的DataFrame.from_Records错误 (GH2633 )
请参阅 full release notes 或在GitHub上的问题跟踪器上查看完整的列表。
贡献者#
共有17人为此次发布贡献了补丁。名字中带有“+”的人第一次贡献了一个补丁。
Andy Hayden +
Anton I. Sipos +
Chang She
Christopher Whelan
Damien Garaud +
Dan Allan +
Dieter Vandenbussche
Garrett Drapala +
Jay Parlar +
Thouis (Ray) Jones +
Vincent Arel-Bundock +
Wes McKinney
elpres
herrfz +
jreback
svaksha +
y-p