0.9.1版(2012年11月14日)#
这是从0.9.0开始的错误修复版本,包括几个新功能和增强功能以及大量的错误修复。新功能包括DataFrame和Series的按列排序,改进的RANK方法的NA处理,DataFrame的掩码函数,以及DataFrame的日内时间序列过滤。
新功能#
Series.sort
,DataFrame.sort
,以及DataFrame.sort_index
现在可以以每列的方式指定以支持多个排序顺序 (GH928 )In [2]: df = pd.DataFrame(np.random.randint(0, 2, (6, 3)), ...: columns=['A', 'B', 'C']) In [3]: df.sort(['A', 'B'], ascending=[1, 0]) Out[3]: A B C 3 0 1 1 4 0 1 1 2 0 0 1 0 1 0 0 1 1 0 0 5 1 0 0
DataFrame.rank
现在支持其他参数值用于na_option
参数,以便可以为缺少的值分配最大或最小的排名 (GH1508 , GH2159 )In [1]: df = pd.DataFrame(np.random.randn(6, 3), columns=['A', 'B', 'C']) In [2]: df.loc[2:4] = np.nan In [3]: df.rank() Out[3]: A B C 0 3.0 2.0 1.0 1 1.0 3.0 2.0 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN 5 2.0 1.0 3.0 [6 rows x 3 columns] In [4]: df.rank(na_option='top') Out[4]: A B C 0 6.0 5.0 4.0 1 4.0 6.0 5.0 2 2.0 2.0 2.0 3 2.0 2.0 2.0 4 2.0 2.0 2.0 5 5.0 4.0 6.0 [6 rows x 3 columns] In [5]: df.rank(na_option='bottom') Out[5]: A B C 0 3.0 2.0 1.0 1 1.0 3.0 2.0 2 5.0 5.0 5.0 3 5.0 5.0 5.0 4 5.0 5.0 5.0 5 2.0 1.0 3.0 [6 rows x 3 columns]DataFrame有新的
where
和mask
方法以根据给定的布尔掩码选择值 (GH2109 , GH2151 )DataFrame当前支持通过与DataFrame相同长度的布尔向量进行切片(在
[]
)。返回的DataFrame与原始DataFrame具有相同的列数,但在其索引上被切片。In [6]: df = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C']) In [7]: df Out[7]: A B C 0 0.276232 -1.087401 -0.673690 1 0.113648 -1.478427 0.524988 2 0.404705 0.577046 -1.715002 3 -1.039268 -0.370647 -1.157892 4 -1.344312 0.844885 1.075770 [5 rows x 3 columns] In [8]: df[df['A'] > 0] Out[8]: A B C 0 0.276232 -1.087401 -0.673690 1 0.113648 -1.478427 0.524988 2 0.404705 0.577046 -1.715002 [3 rows x 3 columns]如果使用基于DataFrame的布尔条件对DataFrame进行切片(具有与原始DataFrame相同的大小),则返回与原始DataFrame相同的大小(索引和列)的DataFrame,其中包含不满足布尔条件的元素
NaN
。这是通过新方法实现的DataFrame.where
。此外,where
采用可选的other
要求更换的论据。In [9]: df[df > 0] Out[9]: A B C 0 0.276232 NaN NaN 1 0.113648 NaN 0.524988 2 0.404705 0.577046 NaN 3 NaN NaN NaN 4 NaN 0.844885 1.075770 [5 rows x 3 columns] In [10]: df.where(df > 0) Out[10]: A B C 0 0.276232 NaN NaN 1 0.113648 NaN 0.524988 2 0.404705 0.577046 NaN 3 NaN NaN NaN 4 NaN 0.844885 1.075770 [5 rows x 3 columns] In [11]: df.where(df > 0, -df) Out[11]: A B C 0 0.276232 1.087401 0.673690 1 0.113648 1.478427 0.524988 2 0.404705 0.577046 1.715002 3 1.039268 0.370647 1.157892 4 1.344312 0.844885 1.075770 [5 rows x 3 columns]此外,
where
现在对齐输入布尔条件(ndarray或DataFrame),以便可以使用设置进行部分选择。这类似于通过.ix
(但在内容上而不是轴标签上)In [12]: df2 = df.copy() In [13]: df2[df2[1:4] > 0] = 3 In [14]: df2 Out[14]: A B C 0 0.276232 -1.087401 -0.673690 1 3.000000 -1.478427 3.000000 2 3.000000 3.000000 -1.715002 3 -1.039268 -0.370647 -1.157892 4 -1.344312 0.844885 1.075770 [5 rows x 3 columns]
DataFrame.mask
是的逆布尔运算where
。In [15]: df.mask(df <= 0) Out[15]: A B C 0 0.276232 NaN NaN 1 0.113648 NaN 0.524988 2 0.404705 0.577046 NaN 3 NaN NaN NaN 4 NaN 0.844885 1.075770 [5 rows x 3 columns]启用按列名称引用Excel列 (GH1936 )
In [16]: xl = pd.ExcelFile('data/test.xls') --------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:139, in import_optional_dependency(name, extra, errors, min_version) 138 try: --> 139 module = importlib.import_module(name) 140 except ImportError: File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package) 125 level += 1 --> 126 return _bootstrap._gcd_import(name[level:], package, level) File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level) File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_) File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_) ModuleNotFoundError: No module named 'xlrd' During handling of the above exception, another exception occurred: ImportError Traceback (most recent call last) Input In [16], in <cell line: 1>() ----> 1 xl = pd.ExcelFile('data/test.xls') File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:1525, in ExcelFile.__init__(self, path_or_buffer, engine, storage_options) 1522 self.engine = engine 1523 self.storage_options = storage_options -> 1525 self._reader = self._engines[engine](self._io, storage_options=storage_options) File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_xlrd.py:29, in XlrdReader.__init__(self, filepath_or_buffer, storage_options) 19 """ 20 Reader using xlrd engine. 21 (...) 26 {storage_options} 27 """ 28 err_msg = "Install xlrd >= 1.0.0 for Excel support" ---> 29 import_optional_dependency("xlrd", extra=err_msg) 30 super().__init__(filepath_or_buffer, storage_options=storage_options) File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:142, in import_optional_dependency(name, extra, errors, min_version) 140 except ImportError: 141 if errors == "raise": --> 142 raise ImportError(msg) 143 else: 144 return None ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd. In [17]: xl.parse('Sheet1', index_col=0, parse_dates=True, ....: parse_cols='A:D') ....: --------------------------------------------------------------------------- NameError Traceback (most recent call last) Input In [17], in <cell line: 1>() ----> 1 xl.parse('Sheet1', index_col=0, parse_dates=True, 2 parse_cols='A:D') NameError: name 'xl' is not definedAdded option to disable pandas-style tick locators and formatters using
series.plot(x_compat=True)
orpandas.plot_params['x_compat'] = True
(GH2205)现有的时间序列方法
at_time
和between_time
已添加到DataFrame (GH2149 )DATA帧.Dot现在可以接受ndarray (GH2042 )
DataFrame.drop现在支持非唯一索引 (GH2101 )
Panel.Shift现在支持负值期间 (GH2164 )
DataFrame现在支持一元~运算符 (GH2110 )
API更改#
使用周期索引对数据进行上采样将产生跨越原始时间窗口的更高频率的时间序列
In [1]: prng = pd.period_range('2012Q1', periods=2, freq='Q') In [2]: s = pd.Series(np.random.randn(len(prng)), prng) In [4]: s.resample('M') Out[4]: 2012-01 -1.471992 2012-02 NaN 2012-03 NaN 2012-04 -0.493593 2012-05 NaN 2012-06 NaN Freq: M, dtype: float64Perod.end_time现在返回时间间隔中的最后一纳秒 (GH2124 , GH2125 , GH1764 )
In [18]: p = pd.Period('2012') In [19]: p.end_time Out[19]: Timestamp('2012-12-31 23:59:59.999999999')对于指定了自定义转换器的列,文件解析器不再强制为FLOAT或BOOL (GH2184 )
In [20]: import io In [21]: data = ('A,B,C\n' ....: '00001,001,5\n' ....: '00002,002,6') ....: In [22]: pd.read_csv(io.StringIO(data), converters={'A': lambda x: x.strip()}) Out[22]: A B C 0 00001 1 5 1 00002 2 6 [2 rows x 3 columns]
请参阅 full release notes 或在GitHub上的问题跟踪器上查看完整的列表。
贡献者#
共有11人为此次发布贡献了补丁。名字中带有“+”的人第一次贡献了一个补丁。
Brenda Moon +
Chang She
Jeff Reback +
Justin C Johnson +
K.-Michael Aye
Martin Blais
Tobias Brandt +
Wes McKinney
Wouter Overmeire
timmie
y-p