0.9.1版(2012年11月14日)#

这是从0.9.0开始的错误修复版本,包括几个新功能和增强功能以及大量的错误修复。新功能包括DataFrame和Series的按列排序,改进的RANK方法的NA处理,DataFrame的掩码函数,以及DataFrame的日内时间序列过滤。

新功能#

  • Series.sortDataFrame.sort ,以及 DataFrame.sort_index 现在可以以每列的方式指定以支持多个排序顺序 (GH928 )

    In [2]: df = pd.DataFrame(np.random.randint(0, 2, (6, 3)),
       ...:                   columns=['A', 'B', 'C'])
    
    In [3]: df.sort(['A', 'B'], ascending=[1, 0])
    
    Out[3]:
       A  B  C
    3  0  1  1
    4  0  1  1
    2  0  0  1
    0  1  0  0
    1  1  0  0
    5  1  0  0
    
  • DataFrame.rank 现在支持其他参数值用于 na_option 参数,以便可以为缺少的值分配最大或最小的排名 (GH1508GH2159 )

    In [1]: df = pd.DataFrame(np.random.randn(6, 3), columns=['A', 'B', 'C'])
    
    In [2]: df.loc[2:4] = np.nan
    
    In [3]: df.rank()
    Out[3]: 
         A    B    C
    0  3.0  2.0  1.0
    1  1.0  3.0  2.0
    2  NaN  NaN  NaN
    3  NaN  NaN  NaN
    4  NaN  NaN  NaN
    5  2.0  1.0  3.0
    
    [6 rows x 3 columns]
    
    In [4]: df.rank(na_option='top')
    Out[4]: 
         A    B    C
    0  6.0  5.0  4.0
    1  4.0  6.0  5.0
    2  2.0  2.0  2.0
    3  2.0  2.0  2.0
    4  2.0  2.0  2.0
    5  5.0  4.0  6.0
    
    [6 rows x 3 columns]
    
    In [5]: df.rank(na_option='bottom')
    Out[5]: 
         A    B    C
    0  3.0  2.0  1.0
    1  1.0  3.0  2.0
    2  5.0  5.0  5.0
    3  5.0  5.0  5.0
    4  5.0  5.0  5.0
    5  2.0  1.0  3.0
    
    [6 rows x 3 columns]
    
  • DataFrame有新的 wheremask 方法以根据给定的布尔掩码选择值 (GH2109GH2151 )

    DataFrame当前支持通过与DataFrame相同长度的布尔向量进行切片(在 [] )。返回的DataFrame与原始DataFrame具有相同的列数,但在其索引上被切片。

    In [6]: df = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])
    
    In [7]: df
    Out[7]: 
              A         B         C
    0  0.276232 -1.087401 -0.673690
    1  0.113648 -1.478427  0.524988
    2  0.404705  0.577046 -1.715002
    3 -1.039268 -0.370647 -1.157892
    4 -1.344312  0.844885  1.075770
    
    [5 rows x 3 columns]
    
    In [8]: df[df['A'] > 0]
    Out[8]: 
              A         B         C
    0  0.276232 -1.087401 -0.673690
    1  0.113648 -1.478427  0.524988
    2  0.404705  0.577046 -1.715002
    
    [3 rows x 3 columns]
    

    如果使用基于DataFrame的布尔条件对DataFrame进行切片(具有与原始DataFrame相同的大小),则返回与原始DataFrame相同的大小(索引和列)的DataFrame,其中包含不满足布尔条件的元素 NaN 。这是通过新方法实现的 DataFrame.where 。此外, where 采用可选的 other 要求更换的论据。

    In [9]: df[df > 0]
    Out[9]: 
              A         B         C
    0  0.276232       NaN       NaN
    1  0.113648       NaN  0.524988
    2  0.404705  0.577046       NaN
    3       NaN       NaN       NaN
    4       NaN  0.844885  1.075770
    
    [5 rows x 3 columns]
    
    In [10]: df.where(df > 0)
    Out[10]: 
              A         B         C
    0  0.276232       NaN       NaN
    1  0.113648       NaN  0.524988
    2  0.404705  0.577046       NaN
    3       NaN       NaN       NaN
    4       NaN  0.844885  1.075770
    
    [5 rows x 3 columns]
    
    In [11]: df.where(df > 0, -df)
    Out[11]: 
              A         B         C
    0  0.276232  1.087401  0.673690
    1  0.113648  1.478427  0.524988
    2  0.404705  0.577046  1.715002
    3  1.039268  0.370647  1.157892
    4  1.344312  0.844885  1.075770
    
    [5 rows x 3 columns]
    

    此外, where 现在对齐输入布尔条件(ndarray或DataFrame),以便可以使用设置进行部分选择。这类似于通过 .ix (但在内容上而不是轴标签上)

    In [12]: df2 = df.copy()
    
    In [13]: df2[df2[1:4] > 0] = 3
    
    In [14]: df2
    Out[14]: 
              A         B         C
    0  0.276232 -1.087401 -0.673690
    1  3.000000 -1.478427  3.000000
    2  3.000000  3.000000 -1.715002
    3 -1.039268 -0.370647 -1.157892
    4 -1.344312  0.844885  1.075770
    
    [5 rows x 3 columns]
    

    DataFrame.mask 是的逆布尔运算 where

    In [15]: df.mask(df <= 0)
    Out[15]: 
              A         B         C
    0  0.276232       NaN       NaN
    1  0.113648       NaN  0.524988
    2  0.404705  0.577046       NaN
    3       NaN       NaN       NaN
    4       NaN  0.844885  1.075770
    
    [5 rows x 3 columns]
    
  • 启用按列名称引用Excel列 (GH1936 )

    In [16]: xl = pd.ExcelFile('data/test.xls')
    ---------------------------------------------------------------------------
    ModuleNotFoundError                       Traceback (most recent call last)
    File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:139, in import_optional_dependency(name, extra, errors, min_version)
        138 try:
    --> 139     module = importlib.import_module(name)
        140 except ImportError:
    
    File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
        125         level += 1
    --> 126 return _bootstrap._gcd_import(name[level:], package, level)
    
    File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)
    
    File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)
    
    File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)
    
    ModuleNotFoundError: No module named 'xlrd'
    
    During handling of the above exception, another exception occurred:
    
    ImportError                               Traceback (most recent call last)
    Input In [16], in <cell line: 1>()
    ----> 1 xl = pd.ExcelFile('data/test.xls')
    
    File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_base.py:1525, in ExcelFile.__init__(self, path_or_buffer, engine, storage_options)
       1522 self.engine = engine
       1523 self.storage_options = storage_options
    -> 1525 self._reader = self._engines[engine](self._io, storage_options=storage_options)
    
    File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/io/excel/_xlrd.py:29, in XlrdReader.__init__(self, filepath_or_buffer, storage_options)
         19 """
         20 Reader using xlrd engine.
         21 
       (...)
         26 {storage_options}
         27 """
         28 err_msg = "Install xlrd >= 1.0.0 for Excel support"
    ---> 29 import_optional_dependency("xlrd", extra=err_msg)
         30 super().__init__(filepath_or_buffer, storage_options=storage_options)
    
    File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/compat/_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
        140 except ImportError:
        141     if errors == "raise":
    --> 142         raise ImportError(msg)
        143     else:
        144         return None
    
    ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd.
    
    In [17]: xl.parse('Sheet1', index_col=0, parse_dates=True,
       ....:          parse_cols='A:D')
       ....: 
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    Input In [17], in <cell line: 1>()
    ----> 1 xl.parse('Sheet1', index_col=0, parse_dates=True,
          2          parse_cols='A:D')
    
    NameError: name 'xl' is not defined
    
  • Added option to disable pandas-style tick locators and formatters using series.plot(x_compat=True) or pandas.plot_params['x_compat'] = True (GH2205)

  • 现有的时间序列方法 at_timebetween_time 已添加到DataFrame (GH2149 )

  • DATA帧.Dot现在可以接受ndarray (GH2042 )

  • DataFrame.drop现在支持非唯一索引 (GH2101 )

  • Panel.Shift现在支持负值期间 (GH2164 )

  • DataFrame现在支持一元~运算符 (GH2110 )

API更改#

  • 使用周期索引对数据进行上采样将产生跨越原始时间窗口的更高频率的时间序列

    In [1]: prng = pd.period_range('2012Q1', periods=2, freq='Q')
    
    In [2]: s = pd.Series(np.random.randn(len(prng)), prng)
    
    In [4]: s.resample('M')
    Out[4]:
    2012-01   -1.471992
    2012-02         NaN
    2012-03         NaN
    2012-04   -0.493593
    2012-05         NaN
    2012-06         NaN
    Freq: M, dtype: float64
    
  • Perod.end_time现在返回时间间隔中的最后一纳秒 (GH2124GH2125GH1764 )

    In [18]: p = pd.Period('2012')
    
    In [19]: p.end_time
    Out[19]: Timestamp('2012-12-31 23:59:59.999999999')
    
  • 对于指定了自定义转换器的列,文件解析器不再强制为FLOAT或BOOL (GH2184 )

    In [20]: import io
    
    In [21]: data = ('A,B,C\n'
       ....:         '00001,001,5\n'
       ....:         '00002,002,6')
       ....: 
    
    In [22]: pd.read_csv(io.StringIO(data), converters={'A': lambda x: x.strip()})
    Out[22]: 
           A  B  C
    0  00001  1  5
    1  00002  2  6
    
    [2 rows x 3 columns]
    

请参阅 full release notes 或在GitHub上的问题跟踪器上查看完整的列表。

贡献者#

共有11人为此次发布贡献了补丁。名字中带有“+”的人第一次贡献了一个补丁。

  • Brenda Moon +

  • Chang She

  • Jeff Reback +

  • Justin C Johnson +

  • K.-Michael Aye

  • Martin Blais

  • Tobias Brandt +

  • Wes McKinney

  • Wouter Overmeire

  • timmie

  • y-p