0.25.0中的新特性(2019年7月18日)#

警告

从0.25.x系列发行版开始，Pandas只支持Python3.5.3及更高版本。看见 Dropping Python 2.7 了解更多详细信息。

警告

在未来的版本中，受支持的最低Python版本将提升到3.6。

警告

Panel has been fully removed. For N-D labeled data structures, please use xarray

警告

read_pickle() 和 read_msgpack() 仅保证向后兼容到Pandas0.20.3版 (GH27082 )

这些是Pandas0.25.0的变化。看见发行说明获取完整的更改日志，包括其他版本的Pandas。

增强#

GroupBy聚合并重新标记#

在将多个聚合函数应用于特定列时，PANDA已添加了特殊的GROUPPY行为，称为“命名聚合”，用于命名输出列 (GH18366 ， GH26512 )。

In [1]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
   ...:                         'height': [9.1, 6.0, 9.5, 34.0],
   ...:                         'weight': [7.9, 7.5, 9.9, 198.0]})
   ...: 

In [2]: animals
Out[2]: 
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

[4 rows x 3 columns]

In [3]: animals.groupby("kind").agg(
   ...:     min_height=pd.NamedAgg(column='height', aggfunc='min'),
   ...:     max_height=pd.NamedAgg(column='height', aggfunc='max'),
   ...:     average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
   ...: )
   ...: 
Out[3]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

[2 rows x 3 columns]

将所需的列名作为 **kwargs 至 .agg 。的价值 **kwargs 应该是元组，其中第一个元素是列选择，第二个元素是要应用的聚合函数。Pandas提供了 pandas.NamedAgg 命名元组，以便更清楚地说明函数的参数是什么，但也接受普通元组。

In [4]: animals.groupby("kind").agg(
   ...:     min_height=('height', 'min'),
   ...:     max_height=('height', 'max'),
   ...:     average_weight=('weight', np.mean),
   ...: )
   ...: 
Out[4]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

[2 rows x 3 columns]

建议使用命名聚合来替代不推荐使用的命名列特定聚合输出的“dict-of-dicts”方法 (重命名时，不建议使用字典的groupby.agg() )。

类似的方法现在也可用于Series Groupby对象。因为不需要选择列，所以值可以只是要应用的函数

In [5]: animals.groupby("kind").height.agg(
   ...:     min_height="min",
   ...:     max_height="max",
   ...: )
   ...: 
Out[5]: 
      min_height  max_height
kind                        
cat          9.1         9.5
dog          6.0        34.0

[2 rows x 2 columns]

在将字典传递给Series Groupby聚合时，建议使用此类型的聚合来替代不推荐使用的行为 (重命名时，不建议使用字典的groupby.agg() )。

看见命名聚合想要更多。

具有多个lambdas的GroupBy聚集#

You can now provide multiple lambda functions to a list-like aggregation in pandas.core.groupby.GroupBy.agg (GH26430).

In [6]: animals.groupby('kind').height.agg([
   ...:     lambda x: x.iloc[0], lambda x: x.iloc[-1]
   ...: ])
   ...: 
Out[6]: 
      <lambda_0>  <lambda_1>
kind                        
cat          9.1         9.5
dog          6.0        34.0

[2 rows x 2 columns]

In [7]: animals.groupby('kind').agg([
   ...:     lambda x: x.iloc[0] - x.iloc[1],
   ...:     lambda x: x.iloc[0] + x.iloc[1]
   ...: ])
   ...: 
Out[7]: 
         height                weight           
     <lambda_0> <lambda_1> <lambda_0> <lambda_1>
kind                                            
cat        -0.4       18.6       -2.0       17.8
dog       -28.0       40.0     -190.5      205.5

[2 rows x 4 columns]

以前，这些都引发了 SpecificationError 。

多索引的更好的REPR#

打印 MultiIndex 实例现在显示每行的元组，并确保元组项垂直对齐，因此现在更容易理解 MultiIndex 。 (GH13480 )：

REPR现在看起来是这样的：

In [8]: pd.MultiIndex.from_product([['a', 'abc'], range(500)])
Out[8]: 
MultiIndex([(  'a',   0),
            (  'a',   1),
            (  'a',   2),
            (  'a',   3),
            (  'a',   4),
            (  'a',   5),
            (  'a',   6),
            (  'a',   7),
            (  'a',   8),
            (  'a',   9),
            ...
            ('abc', 490),
            ('abc', 491),
            ('abc', 492),
            ('abc', 493),
            ('abc', 494),
            ('abc', 495),
            ('abc', 496),
            ('abc', 497),
            ('abc', 498),
            ('abc', 499)],
           length=1000)

以前，输出一个 MultiIndex 已打印所有 levels 和 codes 的 MultiIndex ，这在视觉上没有吸引力，并使输出更难导航。例如(将范围限制为5)：

In [1]: pd.MultiIndex.from_product([['a', 'abc'], range(5)])
Out[1]: MultiIndex(levels=[['a', 'abc'], [0, 1, 2, 3]],
   ...:            codes=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]])

在新的REPR中，如果行数小于 options.display.max_seq_items (默认：100个项目)。在水平方向上，如果输出宽度大于 options.display.width (默认：80个字符)。

系列和数据帧的更短截断REPR#

目前，Pandas的默认显示选项确保当Series或DataFrame的行数超过60行时，其REPRR将被截断为最大60行( display.max_rows 选项)。然而，这仍然提供了一种REPRR，占据了垂直屏幕的很大一部分。因此，一个新的选择 display.min_rows 引入的默认设置为10，用于确定截断的REPR中显示的行数：

对于小型系列或DataFrame，最高可 max_rows 显示行数(默认值：60)。
对于长度在以上的较大系列数据帧 max_rows ，仅限 min_rows 显示行数(默认值：10，即第一行和最后5行)。

此双重选项允许仍然查看相对较小的对象的完整内容(例如 df.head(20) 显示全部20行)，同时为大型对象提供简短的REPR。

要恢复单个阈值以前的行为，请设置 pd.options.display.min_rows = None 。

支持max_Level参数的JSON标准化#

json_normalize() 将提供的输入词典规范化为所有嵌套级别。新的max_Level参数提供了对结束标准化的级别的更多控制 (GH23843 )：

REPR现在看起来是这样的：

from pandas.io.json import json_normalize
data = [{
    'CreatedBy': {'Name': 'User001'},
    'Lookup': {'TextField': 'Some text',
               'UserField': {'Id': 'ID001', 'Name': 'Name001'}},
    'Image': {'a': 'b'}
}]
json_normalize(data, max_level=1)

Series.分解以将类似列表的值拆分为行#

Series 和 DataFrame 已经获得了 DataFrame.explode() 将List-Like转换为单行的方法。看见 section on Exploding list-like column 有关更多信息，请参阅文档 (GH16538 ， GH10511 )

下面是一个典型的用例。在一列中有逗号分隔的字符串。

In [9]: df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1},
   ...:                    {'var1': 'd,e,f', 'var2': 2}])
   ...: 

In [10]: df
Out[10]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2

[2 rows x 2 columns]

创建长表单 DataFrame 现在使用链式操作非常简单。

In [11]: df.assign(var1=df.var1.str.split(',')).explode('var1')
Out[11]: 
  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2

[6 rows x 2 columns]

其他增强功能#

DataFrame.plot() 关键词 logy ， logx 和 loglog 现在可以接受该值 'sym' 用于symlog伸缩。 (GH24867 )
Added support for ISO week year format ('%G-%V-%u') when parsing datetimes using to_datetime() (GH16607)
Indexing of DataFrame and Series now accepts zerodim np.ndarray (GH24919)
Timestamp.replace() 现在支持 fold 用于消除DST转换时间歧义的参数 (GH25017 )
DataFrame.at_time() 和 Series.at_time() 现在支持 datetime.time 具有时区的对象 (GH24043 )
DataFrame.pivot_table() 现在接受 observed 参数，该参数被传递给 DataFrame.groupby() 以加快分类数据的分组。 (GH24923 )
Series.str 已经获得了 Series.str.casefold() 方法来移除字符串中存在的所有大小写差异 (GH25405 )
DataFrame.set_index() 现在适用于 abc.Iterator ，只要它们的输出与呼叫帧的长度相同 (GH22484 ， GH24984 )
DatetimeIndex.union() now supports the sort argument. The behavior of the sort parameter matches that of Index.union() (GH24994)
RangeIndex.union() 现在支持 sort 争论。如果 sort=False 一个未排序的 Int64Index 总是会被退回。 sort=None 是默认设置，并返回单调递增的 RangeIndex 如果可能，则返回已排序的 Int64Index 如果没有 (GH24471 )
TimedeltaIndex.intersection() 现在还支持 sort 关键字 (GH24471 )
DataFrame.rename() 现在支持 errors 参数在尝试重命名不存在的键时引发错误 (GH13473 )
已添加稀疏访问器用于使用 DataFrame 其值是稀疏的 (GH25681 )
RangeIndex 已经获得了 start ， stop ，以及 step 属性 (GH25710 )
datetime.timezone 现在支持将对象作为时区方法和构造函数的参数 (GH25065 )
DataFrame.query() 和 DataFrame.eval() 现在支持用反号将列名引起来，以引用带空格的名称 (GH6508 )
merge_asof() 现在，当合并键是不相等的类别时，会给出更清楚的错误消息 (GH26136 )
pandas.core.window.Rolling() 支持指数(或泊松)窗口类型 (GH21303 )
缺少所需导入的错误消息现在包括原始导入错误的文本 (GH23868 )
DatetimeIndex 和 TimedeltaIndex 现在有一个 mean 方法 (GH24757 )
DataFrame.describe() 现在设置不带小数点的整数百分位格式 (GH26660 )
Added support for reading SPSS .sav files using read_spss() (GH26537)
添加了新选项 plotting.backend 能够选择不同于现有打印后端的打印后端 matplotlib 一。使用 pandas.set_option('plotting.backend', '<backend-module>') 哪里 <backend-module 是一个实现Pandas绘图API的库 (GH14130 )
pandas.offsets.BusinessHour 支持多种开放时间间隔 (GH15481 )
read_excel() 现在可以使用 openpyxl 读取Excel文件的步骤 engine='openpyxl' 争论。在将来的版本中，这将成为默认设置 (GH11499 )
pandas.io.excel.read_excel() 支持读取OpenDocument表。指定 engine='odf' 以启用。咨询 IO User Guide 获取更多详细信息 (GH9070 )
Interval ， IntervalIndex ，以及 IntervalArray 已经获得了一个 is_empty 指示给定间隔是否为空的属性 (GH27219 )

向后不兼容的API更改#

使用具有UTC偏移量的日期字符串进行索引#

索引为 DataFrame 或 Series 使用一个 DatetimeIndex 具有UTC偏移量的日期字符串之前将忽略UTC偏移量。现在，UTC偏移量在索引中受到尊重。 (GH24076 ， GH16785 )

In [12]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))

In [13]: df
Out[13]: 
                           0
2019-01-01 00:00:00-08:00  0

[1 rows x 1 columns]

以前的行为 ：

In [3]: df['2019-01-01 00:00:00+04:00':'2019-01-01 01:00:00+04:00']
Out[3]:
                           0
2019-01-01 00:00:00-08:00  0

新行为 ：

In [14]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']
Out[14]: 
                           0
2019-01-01 00:00:00-08:00  0

[1 rows x 1 columns]

`MultiIndex` 由级别和代码构成#

构建一个 MultiIndex 使用 NaN 以前允许级别或代码值<-1。现在不允许使用编码值<-1的施工，并且 NaN 级别的相应代码将被重新分配为-1。 (GH19387 )

以前的行为 ：

In [1]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
   ...:               codes=[[0, -1, 1, 2, 3, 4]])
   ...:
Out[1]: MultiIndex(levels=[[nan, None, NaT, 128, 2]],
                   codes=[[0, -1, 1, 2, 3, 4]])

In [2]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
Out[2]: MultiIndex(levels=[[1, 2]],
                   codes=[[0, -2]])

新行为 ：

In [15]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
   ....:               codes=[[0, -1, 1, 2, 3, 4]])
   ....: 
Out[15]: 
MultiIndex([(nan,),
            (nan,),
            (nan,),
            (nan,),
            (128,),
            (  2,)],
           )

In [16]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [16], in <cell line: 1>()
----> 1 pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/indexes/multi.py:342, in MultiIndex.__new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity)
    339     result.sortorder = sortorder
    341 if verify_integrity:
--> 342     new_codes = result._verify_integrity()
    343     result._codes = new_codes
    345 result._reset_identity()

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/indexes/multi.py:415, in MultiIndex._verify_integrity(self, codes, levels)
    409     raise ValueError(
    410         f"On level {i}, code max ({level_codes.max()}) >= length of "
    411         f"level ({len(level)}). NOTE: this index is in an "
    412         "inconsistent state"
    413     )
    414 if len(level_codes) and level_codes.min() < -1:
--> 415     raise ValueError(f"On level {i}, code value ({level_codes.min()}) < -1")
    416 if not level.is_unique:
    417     raise ValueError(
    418         f"Level values must be unique: {list(level)} on level {i}"
    419     )

ValueError: On level 0, code value (-2) < -1

`GroupBy.apply` 在……上面 `DataFrame` 仅评估第一组一次#

该计划的实施 DataFrameGroupBy.apply() 以前在第一组上一致地对提供的函数求值两次，以推断使用快速代码路径是否安全。特别是对于有副作用的功能，这是一种不受欢迎的行为，可能会导致意外。 (GH2936 ， GH2656 ， GH7739 ， GH10519 ， GH12155 ， GH20084 ， GH21417 )

现在，每个小组都只接受一次评估。

In [17]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})

In [18]: df
Out[18]: 
   a  b
0  x  1
1  y  2

[2 rows x 2 columns]

In [19]: def func(group):
   ....:     print(group.name)
   ....:     return group
   ....: 

以前的行为 ：

In [3]: df.groupby('a').apply(func)
x
x
y
Out[3]:
   a  b
0  x  1
1  y  2

新行为 ：

In [3]: df.groupby('a').apply(func)
x
y
Out[3]:
   a  b
0  x  1
1  y  2

连接稀疏值#

When passed DataFrames whose values are sparse, concat() will now return a Series or DataFrame with sparse values, rather than a SparseDataFrame (GH25702).

In [20]: df = pd.DataFrame({"A": pd.SparseArray([0, 1])})

以前的行为 ：

In [2]: type(pd.concat([df, df]))
pandas.core.sparse.frame.SparseDataFrame

新行为 ：

In [21]: type(pd.concat([df, df]))
Out[21]: pandas.core.frame.DataFrame

这现在与现有的行为匹配 concat 在……上面 Series 具有稀疏值的。 concat() 将继续返回一个 SparseDataFrame 当所有值都是 SparseDataFrame 。

此更改也会影响使用 concat() 在内部，就像 get_dummies() ，它现在返回一个 DataFrame 在所有情况下(以前为 SparseDataFrame 如果所有列都是虚拟编码的，则返回 DataFrame 否则)。

提供任何 SparseSeries 或 SparseDataFrame 至 concat() 会引起一场 SparseSeries 或 SparseDataFrame 被归还，就像以前一样。

这个 `.str` -访问器执行更严格的类型检查#

由于缺乏更细粒度的数据类型， Series.str 到目前为止，只检查了数据是否 object 数据类型。 Series.str 现在将推断数据类型数据在该系列；尤其是， 'bytes' -只有数据会引发异常(除 Series.str.decode() ， Series.str.get() ， Series.str.len() ， Series.str.slice() )，请参阅 GH23163 ， GH23011 ， GH23551 。

以前的行为 ：

In [1]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)

In [2]: s
Out[2]:
0      b'a'
1     b'ba'
2    b'cba'
dtype: object

In [3]: s.str.startswith(b'a')
Out[3]:
0     True
1    False
2    False
dtype: bool

新行为 ：

In [22]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)

In [23]: s
Out[23]: 
0      b'a'
1     b'ba'
2    b'cba'
Length: 3, dtype: object

In [24]: s.str.startswith(b'a')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [24], in <cell line: 1>()
----> 1 s.str.startswith(b'a')

File /usr/local/lib/python3.10/dist-packages/pandas-1.5.0.dev0+697.gf9762d8f52-py3.10-linux-x86_64.egg/pandas/core/strings/accessor.py:124, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
    119 if self._inferred_dtype not in allowed_types:
    120     msg = (
    121         f"Cannot use .str.{func_name} with values of "
    122         f"inferred dtype '{self._inferred_dtype}'."
    123     )
--> 124     raise TypeError(msg)
    125 return func(self, *args, **kwargs)

TypeError: Cannot use .str.startswith with values of inferred dtype 'bytes'.

在GroupBy期间保留类别数据类型#

以前，分类列而不是GROUPBY键将被转换为 object GROUP BY操作期间的数据类型。Pandas现在将保留这些d型。 (GH18502 )

In [25]: cat = pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True)

In [26]: df = pd.DataFrame({'payload': [-1, -2, -1, -2], 'col': cat})

In [27]: df
Out[27]: 
   payload  col
0       -1  foo
1       -2  bar
2       -1  bar
3       -2  qux

[4 rows x 2 columns]

In [28]: df.dtypes
Out[28]: 
payload       int64
col        category
Length: 2, dtype: object

以前的行为 ：

In [5]: df.groupby('payload').first().col.dtype
Out[5]: dtype('O')

新行为 ：

In [29]: df.groupby('payload').first().col.dtype
Out[29]: CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True)

不兼容的索引类型联合#

When performing Index.union() operations between objects of incompatible dtypes, the result will be a base Index of dtype object. This behavior holds true for unions between Index objects that previously would have been prohibited. The dtype of empty Index objects will now be evaluated before performing union operations rather than simply returning the other Index object. Index.union() can now be considered commutative, such that A.union(B) == B.union(A) (GH23525).

以前的行为 ：

In [1]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
...
ValueError: can only call with other PeriodIndex-ed objects

In [2]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[2]: Int64Index([1, 2, 3], dtype='int64')

新行为 ：

In [3]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
Out[3]: Index([1991-09-05, 1991-09-06, 1, 2, 3], dtype='object')
In [4]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[4]: Index([1, 2, 3], dtype='object')

请注意，整型和浮点型索引被认为是“兼容的”。整数值被强制为浮点数，这可能会导致精度损失。看见设置索引对象上的操作想要更多。

`DataFrame` GroupBy Fill/b不再返回组标签#

这些方法 ffill ， bfill ， pad 和 backfill 的 DataFrameGroupBy 以前在返回值中包含组标签，这与其他GROUPBY转换不一致。现在只返回填充的值。 (GH21521 )

In [30]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})

In [31]: df
Out[31]: 
   a  b
0  x  1
1  y  2

[2 rows x 2 columns]

以前的行为 ：

In [3]: df.groupby("a").ffill()
Out[3]:
   a  b
0  x  1
1  y  2

新行为 ：

In [32]: df.groupby("a").ffill()
Out[32]: 
   b
0  1
1  2

[2 rows x 1 columns]

`DataFrame` 对空的类别/对象列进行描述将返回top和freq#

When calling DataFrame.describe() with an empty categorical / object column, the 'top' and 'freq' columns were previously omitted, which was inconsistent with the output for non-empty columns. Now the 'top' and 'freq' columns will always be included, with numpy.nan in the case of an empty DataFrame (GH26397)

In [33]: df = pd.DataFrame({"empty_col": pd.Categorical([])})

In [34]: df
Out[34]: 
Empty DataFrame
Columns: [empty_col]
Index: []

[0 rows x 1 columns]

以前的行为 ：

In [3]: df.describe()
Out[3]:
        empty_col
count           0
unique          0

新行为 ：

In [35]: df.describe()
Out[35]: 
       empty_col
count          0
unique         0
top          NaN
freq         NaN

[4 rows x 1 columns]

`str` 方法现在调用 `repr` 而不是相反#

到目前为止，Pandas的字符串表示大多是在Pandas对象的 __str__/__unicode__ /__bytes__``方法，并调用 ``__str__ from the __repr__ method, if a specific __repr__ method is not found. This is not needed for Python3. In pandas 0.25, the string representations of pandas objects are now generally defined in __repr__, and calls to __str__ in general now pass the call on to the __repr__, if a specific __str__ method doesn't exist, as is standard for Python. This change is backward compatible for direct usage of pandas, but if you subclass pandas objects and give your subclasses specific __str__/__repr__ 方法，您可能需要调整您的 __str__/__repr__ 方法： (GH26495 )。

索引AN `IntervalIndex` 使用 `Interval` 对象#

的索引方法 IntervalIndex 已修改为仅要求完全匹配 Interval 查询。 IntervalIndex 以前在任何重叠上匹配的方法 Interval 。使用标量点的行为(例如，使用整数查询)不变 (GH16316 )。

In [36]: ii = pd.IntervalIndex.from_tuples([(0, 4), (1, 5), (5, 8)])

In [37]: ii
Out[37]: IntervalIndex([(0, 4], (1, 5], (5, 8]], dtype='interval[int64, right]')

这个 in 操作员 (__contains__ )现在只返回 True 要精确匹配到 Intervals 在 IntervalIndex ，而这之前将返回 True 对于任何 Interval 重叠和 Interval 在 IntervalIndex 。

以前的行为 ：

In [4]: pd.Interval(1, 2, closed='neither') in ii
Out[4]: True

In [5]: pd.Interval(-10, 10, closed='both') in ii
Out[5]: True

新行为 ：

In [38]: pd.Interval(1, 2, closed='neither') in ii
Out[38]: False

In [39]: pd.Interval(-10, 10, closed='both') in ii
Out[39]: False

这个 get_loc() 方法现在只返回与 Interval 查询，而不是以前为重叠匹配返回位置的行为。一个 KeyError 如果未找到完全匹配的项，则将引发。

以前的行为 ：

In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: array([0, 1])

In [7]: ii.get_loc(pd.Interval(2, 6))
Out[7]: array([0, 1, 2])

新行为 ：

In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: 1

In [7]: ii.get_loc(pd.Interval(2, 6))
---------------------------------------------------------------------------
KeyError: Interval(2, 6, closed='right')

同样， get_indexer() 和 get_indexer_non_unique() 还将仅返回完全匹配的位置 Interval 查询，使用 -1 表示未找到完全匹配的项。

这些索引更改扩展到查询 Series 或 DataFrame vbl.用一种. IntervalIndex 指数。

In [40]: s = pd.Series(list('abc'), index=ii)

In [41]: s
Out[41]: 
(0, 4]    a
(1, 5]    b
(5, 8]    c
Length: 3, dtype: object

Selecting from a Series or DataFrame using [] (__getitem__) or loc now only returns exact matches for Interval queries.

以前的行为 ：

In [8]: s[pd.Interval(1, 5)]
Out[8]:
(0, 4]    a
(1, 5]    b
dtype: object

In [9]: s.loc[pd.Interval(1, 5)]
Out[9]:
(0, 4]    a
(1, 5]    b
dtype: object

新行为 ：

In [42]: s[pd.Interval(1, 5)]
Out[42]: 'b'

In [43]: s.loc[pd.Interval(1, 5)]
Out[43]: 'b'

类似地，一个 KeyError 将针对非完全匹配引发，而不是返回重叠匹配。

以前的行为 ：

In [9]: s[pd.Interval(2, 3)]
Out[9]:
(0, 4]    a
(1, 5]    b
dtype: object

In [10]: s.loc[pd.Interval(2, 3)]
Out[10]:
(0, 4]    a
(1, 5]    b
dtype: object

新行为 ：

In [6]: s[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')

In [7]: s.loc[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')

这个 overlaps() 方法可用于创建布尔索引器，该索引器复制以前返回重叠匹配的行为。

新行为 ：

In [44]: idxr = s.index.overlaps(pd.Interval(2, 3))

In [45]: idxr
Out[45]: array([ True,  True, False])

In [46]: s[idxr]
Out[46]: 
(0, 4]    a
(1, 5]    b
Length: 2, dtype: object

In [47]: s.loc[idxr]
Out[47]: 
(0, 4]    a
(1, 5]    b
Length: 2, dtype: object

级数上的二进制uuncs现在对齐#

Applying a binary ufunc like numpy.power() now aligns the inputs when both are Series (GH23293).

In [48]: s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [49]: s2 = pd.Series([3, 4, 5], index=['d', 'c', 'b'])

In [50]: s1
Out[50]: 
a    1
b    2
c    3
Length: 3, dtype: int64

In [51]: s2
Out[51]: 
d    3
c    4
b    5
Length: 3, dtype: int64

以前的行为

In [5]: np.power(s1, s2)
Out[5]:
a      1
b     16
c    243
dtype: int64

新行为

In [52]: np.power(s1, s2)
Out[52]: 
a     1.0
b    32.0
c    81.0
d     NaN
Length: 4, dtype: float64

这与Pandas中其他二进制操作的行为相匹配，例如 Series.add() 。要保留以前的行为，请转换其他行为 Series 在应用ufunc之前将其转换为数组。

In [53]: np.power(s1, s2.array)
Out[53]: 
a      1
b     16
c    243
Length: 3, dtype: int64

Categorical.argort现在将缺少的值放在末尾#

Categorical.argsort() 现在将缺失的值放在数组的末尾，使其与NumPy和其他Pandas保持一致 (GH21801 )。

In [54]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)

以前的行为

In [2]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)

In [3]: cat.argsort()
Out[3]: array([1, 2, 0])

In [4]: cat[cat.argsort()]
Out[4]:
[NaN, a, b]
categories (2, object): [a < b]

新行为

In [55]: cat.argsort()
Out[55]: array([2, 0, 1])

In [56]: cat[cat.argsort()]
Out[56]: 
['a', 'b', NaN]
Categories (2, object): ['a' < 'b']

将字典列表传递给DataFrame时保留列顺序#

从Python3.7开始， dict 是 guaranteed 。在实践中，从Python3.6开始就是这样。这个 DataFrame 构造函数现在处理字典列表的方式与处理 OrderedDict 即维护词典的秩序。此更改仅在Pandas运行在>=3.6的Python上时适用 (GH27309 )。

In [57]: data = [
   ....:     {'name': 'Joe', 'state': 'NY', 'age': 18},
   ....:     {'name': 'Jane', 'state': 'KY', 'age': 19, 'hobby': 'Minecraft'},
   ....:     {'name': 'Jean', 'state': 'OK', 'age': 20, 'finances': 'good'}
   ....: ]
   ....: 

以前的行为 ：

这些列先前是按词典顺序排序的，

In [1]: pd.DataFrame(data)
Out[1]:
   age finances      hobby  name state
0   18      NaN        NaN   Joe    NY
1   19      NaN  Minecraft  Jane    KY
2   20     good        NaN  Jean    OK

新行为 ：

现在，列顺序与 dict ，从上到下考虑所有的记录。因此，与以前的Pandas版本相比，生成的DataFrame的列顺序发生了变化。

In [58]: pd.DataFrame(data)
Out[58]: 
   name state  age      hobby finances
0   Joe    NY   18        NaN      NaN
1  Jane    KY   19  Minecraft      NaN
2  Jean    OK   20        NaN     good

[3 rows x 5 columns]

提高了依赖项的最低版本#

由于取消了对Python2.7的支持，许多可选依赖项已更新了最低版本 (GH25725 ， GH24942 ， GH25752 )。独立地更新了一些受支持的依赖项的最低版本 (GH23519 ， GH25554 )。如果已安装，我们现在需要：

套餐	最低版本	必填项
钱币	1.13.3	X
皮兹	2015.4	X
Python-Dateutil	2.6.1	X
瓶颈	1.2.1
数字快递	2.6.2
最热(Dev)	4.0.2

为 optional libraries 一般建议使用最新版本。下表列出了目前在整个Pandas发育过程中正在测试的每个库的最低版本。低于最低测试版本的可选库仍可运行，但不被视为受支持。

套餐	最低版本
美味可口的汤	4.6.0
实木地板	0.2.1
Gcsf	0.2.2
Lxml	3.8.0
Matplotlib	2.2.2
OpenPyxl	2.4.8
绿箭侠	0.9.0
Pymysql	0.7.1
易燃物	3.4.2
斯比	0.19.0
SQLALCHIZY	1.1.4
XARRAY	0.8.2
Xlrd	1.1.0
Xlsx写入器	0.9.8
超大重量	1.2.0

看见依赖项和可选依赖项想要更多。

其他API更改#

DatetimeTZDtype 现在将把PYTZ时区标准化为一个公共时区实例 (GH24713 )
Timestamp 和 Timedelta 标量现在实现 to_numpy() 方法作为别名 Timestamp.to_datetime64() 和 Timedelta.to_timedelta64() ，分别为。 (GH24653 )
Timestamp.strptime() will now rise a NotImplementedError (GH25016)
比较 Timestamp 包含不受支持的对象，现在返回 NotImplemented 与其提高 TypeError 。这意味着不受支持的丰富比较被委托给另一个对象，并且现在与 datetime 对象 (GH24011 )
Bug in DatetimeIndex.snap() which didn't preserving the name of the input Index (GH25575)
The arg argument in pandas.core.groupby.DataFrameGroupBy.agg() has been renamed to func (GH26089)
The arg argument in pandas.core.window._Window.aggregate() has been renamed to func (GH26372)
大多数Pandas班都有一个 __bytes__ 方法，该方法用于获取对象的python2样式的字节字符串表示形式。此方法已作为删除Python2的一部分删除 (GH26447 )
这个 .str -已为1级禁用访问器 MultiIndex ，使用 MultiIndex.to_flat_index() 如果有必要的话 (GH23679 )
删除了对剪贴板的GTK包支持 (GH26563 )
Using an unsupported version of Beautiful Soup 4 will now raise an ImportError instead of a ValueError (GH27063)
Series.to_excel() 和 DataFrame.to_excel() 现在将引发一个 ValueError 保存时区感知数据时。 (GH27008 ， GH7056 )
ExtensionArray.argsort() 将NA值放在排序数组的末尾。 (GH21801 )
DataFrame.to_hdf() 和 Series.to_hdf() 现在将引发一个 NotImplementedError 在保存时 MultiIndex 具有扩展名数据类型的 fixed 格式化。 (GH7775 )
Passing duplicate names in read_csv() will now raise a ValueError (GH17346)

不推荐使用#

稀疏子类#

这个 SparseSeries 和 SparseDataFrame 子类已弃用。它们的功能更好--由 Series 或 DataFrame 具有稀疏值的。

以前的方式

df = pd.SparseDataFrame({"A": [0, 0, 1, 2]})
df.dtypes

新方式

In [59]: df = pd.DataFrame({"A": pd.SparseArray([0, 0, 1, 2])})

In [60]: df.dtypes
Out[60]: 
A    Sparse[int64, 0]
Length: 1, dtype: object

这两种方法的内存使用是相同的。看见正在迁移了解更多信息 (GH19239 )。

消息包格式#

自0.25起，msgpack格式已弃用，并将在未来版本中删除。建议在Pandas物体的有线传输中使用yarrow。 (GH27084 )

其他不推荐使用的词#

The deprecated .ix[] indexer now raises a more visible FutureWarning instead of DeprecationWarning (GH26438).
Deprecated the units=M (months) and units=Y (year) parameters for units of pandas.to_timedelta(), pandas.Timedelta() and pandas.TimedeltaIndex() (GH16344)
pandas.concat() 已不推荐使用 join_axes -关键字。相反，您可以使用 DataFrame.reindex() 或 DataFrame.reindex_like() 基于结果还是基于输入 (GH21951 )
这个 SparseArray.values 属性已弃用。您可以使用 np.asarray(...) 或者 SparseArray.to_dense() 方法，而不是 (GH26421 )。
功能 pandas.to_datetime() 和 pandas.to_timedelta() 已经不推荐使用 box 关键词。相反，您可以使用 to_numpy() 或 Timestamp.to_datetime64() 或 Timedelta.to_timedelta64() 。 (GH24416 )
这个 DataFrame.compound() 和 Series.compound() 方法已弃用，并将在未来版本中删除 (GH26405 )。
内部属性 _start ， _stop 和 _step 的属性 RangeIndex 已经被弃用了。使用公共属性 start ， stop 和 step 取而代之的是 (GH26581 )。
The Series.ftype(), Series.ftypes() and DataFrame.ftypes() methods are deprecated and will be removed in a future version. Instead, use Series.dtype() and DataFrame.dtypes() (GH26705).
这个 Series.get_values() ， DataFrame.get_values() ， Index.get_values() ， SparseArray.get_values() 和 Categorical.get_values() 方法已弃用。其中之一 np.asarray(..) 或 to_numpy() 可以改为使用 (GH19617 )。
NumPy uuncs上的‘out’方法，例如 np.subtract.outer 已在上被弃用 Series 物体。使用将输入转换为数组 Series.array 第一 (GH27186 )
Timedelta.resolution() is deprecated and replaced with Timedelta.resolution_string(). In a future version, Timedelta.resolution() will be changed to behave like the standard library datetime.timedelta.resolution (GH21344)
read_table() 已经不受欢迎了。 (GH25220 )
Index.dtype_str 已弃用。 (GH18262 )
Series.imag 和 Series.real 都已弃用。 (GH18262 )
Series.put() 已弃用。 (GH18262 )
Index.item() 和 Series.item() 已弃用。 (GH18262 )
默认值 ordered=None 在……里面 CategoricalDtype 已被弃用，取而代之的是 ordered=False 。在类别类型之间转换时 ordered=True 必须显式传递才能保留。 (GH26336 )
Index.contains() is deprecated. Use key in index (__contains__) instead (GH17753).
DataFrame.get_dtype_counts() 已弃用。 (GH18262 )
Categorical.ravel() will return a Categorical instead of a np.ndarray (GH27199)

删除先前版本的弃用/更改#

Removed Panel (GH25047, GH25191, GH25231)
Removed the previously deprecated sheetname keyword in read_excel() (GH16442, GH20938)
Removed the previously deprecated TimeGrouper (GH16942)
Removed the previously deprecated parse_cols keyword in read_excel() (GH16488)
Removed the previously deprecated pd.options.html.border (GH16970)
Removed the previously deprecated convert_objects (GH11221)
Removed the previously deprecated select method of DataFrame and Series (GH17633)
Removed the previously deprecated behavior of Series treated as list-like in rename_categories() (GH17982)
Removed the previously deprecated DataFrame.reindex_axis and Series.reindex_axis (GH17842)
Removed the previously deprecated behavior of altering column or index labels with Series.rename_axis() or DataFrame.rename_axis() (GH17842)
Removed the previously deprecated tupleize_cols keyword argument in read_html(), read_csv(), and DataFrame.to_csv() (GH17877, GH17820)
Removed the previously deprecated DataFrame.from.csv and Series.from_csv (GH17812)
Removed the previously deprecated raise_on_error keyword argument in DataFrame.where() and DataFrame.mask() (GH17744)
Removed the previously deprecated ordered and categories keyword arguments in astype (GH17742)
Removed the previously deprecated cdate_range (GH17691)
Removed the previously deprecated True option for the dropna keyword argument in SeriesGroupBy.nth() (GH17493)
Removed the previously deprecated convert keyword argument in Series.take() and DataFrame.take() (GH17352)
删除了以前不推荐使用的算术运算行为 datetime.date 对象 (GH21152 )

性能改进#

显著提高了 SparseArray 使大多数操作受益的初始化，修复了v0.20.0中引入的性能回归 (GH24985 )
DataFrame.to_stata() 现在，在输出包含任何字符串或非本机字节序列的数据时速度更快 (GH25045 )
改进的性能 Series.searchsorted() 。当dtype为int8/int16/int32并且搜索的键在dtype的整数范围内时，加速比尤其大 (GH22034 )
Improved performance of pandas.core.groupby.GroupBy.quantile() (GH20405)
Improved performance of slicing and other selected operation on a RangeIndex (GH26565, GH26617, GH26722)
RangeIndex 现在无需实例化实际的哈希表即可执行标准查找，从而节省内存 (GH16685 )
改进的性能 read_csv() 通过更快地标记化和更快地解析较小的浮点数 (GH25784 )
改进的性能 read_csv() 通过更快地解析N/A和布尔值 (GH25804 )
Improved performance of IntervalIndex.is_monotonic, IntervalIndex.is_monotonic_increasing and IntervalIndex.is_monotonic_decreasing by removing conversion to MultiIndex (GH24813)
改进的性能 DataFrame.to_csv() 编写DateTime数据类型时 (GH25708 )
改进的性能 read_csv() 通过更快地解析 MM/YYYY 和 DD/MM/YYYY 日期时间格式 (GH25922 )
Improved performance of nanops for dtypes that cannot store NaNs. Speedup is particularly prominent for Series.all() and Series.any() (GH25070)
改进的性能 Series.map() 对于分类序列上的词典映射器，通过映射类别而不是映射所有值 (GH23785 )
Improved performance of IntervalIndex.intersection() (GH24813)
改进的性能 read_csv() 通过更快地连接日期列，而无需为整型/浮点型零和浮点型进行额外的字符串转换 NaN ；通过更快地检查字符串是否可能是日期 (GH25754 )
Improved performance of IntervalIndex.is_unique by removing conversion to MultiIndex (GH24813)
恢复的性能 DatetimeIndex.__iter__() 通过重新启用专用代码路径 (GH26702 )
提高构建时的性能 MultiIndex 至少有一个 CategoricalIndex 级别 (GH22044 )
Improved performance by removing the need for a garbage collect when checking for SettingWithCopyWarning (GH27031)
For to_datetime() changed default value of cache parameter to True (GH26043)
改进的性能 DatetimeIndex 和 PeriodIndex 给定非唯一、单调数据的切片 (GH27136 )。
改进的性能 pd.read_json() 用于面向索引的数据。 (GH26773 )
Improved performance of MultiIndex.shape() (GH27384).

错误修复#

直截了当的#

Bug in DataFrame.at() and Series.at() that would raise exception if the index was a CategoricalIndex (GH20629)
Fixed bug in comparison of ordered Categorical that contained missing values with a scalar which sometimes incorrectly resulted in True (GH26504)
Bug in DataFrame.dropna() when the DataFrame has a CategoricalIndex containing Interval objects incorrectly raised a TypeError (GH25087)

类似日期的#

Bug in to_datetime() which would raise an (incorrect) ValueError when called with a date far into the future and the format argument specified instead of raising OutOfBoundsDatetime (GH23830)
Bug in to_datetime() which would raise InvalidIndexError: Reindexing only valid with uniquely valued Index objects when called with cache=True, with arg including at least two different elements from the set {None, numpy.nan, pandas.NaT} (GH22305)
窃听 DataFrame 和 Series 其中支持时区的数据具有 dtype='datetime64[ns] 不会变得幼稚 (GH25843 )
Improved Timestamp type checking in various datetime functions to prevent exceptions when using a subclassed datetime (GH25851)
Bug in Series and DataFrame repr where np.datetime64('NaT') and np.timedelta64('NaT') with dtype=object would be represented as NaN (GH25445)
窃听 to_datetime() ，它不会将无效参数替换为 NaT 当错误设置为强制时 (GH26122 )
Bug in adding DateOffset with nonzero month to DatetimeIndex would raise ValueError (GH26258)
Bug in to_datetime() which raises unhandled OverflowError when called with mix of invalid dates and NaN values with format='%Y%m%d' and error='coerce' (GH25512)
窃听 isin() 用于类似日期时间的索引； DatetimeIndex ， TimedeltaIndex 和 PeriodIndex 其中 levels 参数被忽略。 (GH26675 )
窃听 to_datetime() 这就引发了 TypeError 为 format='%Y%m%d' 当调用长度大于等于6位的无效整数日期时， errors='ignore'
错误，当比较 PeriodIndex 针对零维Numy数组 (GH26689 )
在构造一个 Series 或 DataFrame 从麻木 datetime64 具有非ns单位和生成垃圾数据的越界时间戳的数组，该数组现在将正确地引发 OutOfBoundsDatetime 错误 (GH26206 )。
窃听 date_range() 带着不必要的 OverflowError 为大大小小的约会而被抚养 (GH26651 )
Bug where adding Timestamp to a np.timedelta64 object would raise instead of returning a Timestamp (GH24775)
Bug where comparing a zero-dimensional numpy array containing a np.datetime64 object to a Timestamp would incorrect raise TypeError (GH26916)
窃听 to_datetime() 这将会引起 ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True 使用调用时 cache=True ，具有 arg 包括具有不同偏移量的日期时间字符串 (GH26097 )

Timedelta#

窃听 TimedeltaIndex.intersection() 其中对于非单调指数在某些情况下为空 Index 在实际存在交叉点时返回 (GH25913 )
Bug with comparisons between Timedelta and NaT raising TypeError (GH26039)
在添加或减去 BusinessHour 发送到 Timestamp 由此产生的着陆时间分别在下一天或前一天 (GH26381 )
错误，当比较 TimedeltaIndex 针对零维Numy数组 (GH26689 )

时区#

窃听 DatetimeIndex.to_frame() 其中时区感知数据将转换为时区原始数据 (GH25809 )
窃听 to_datetime() 使用 utc=True 以及将先前分析的UTC偏移量应用于后续参数的DateTime字符串 (GH24992 )
Bug in Timestamp.tz_localize() and Timestamp.tz_convert() does not propagate freq (GH25241)
Bug in Series.at() where setting Timestamp with timezone raises TypeError (GH25506)
窃听 DataFrame.update() 使用时区感知数据更新时，将返回时区原始数据 (GH25807 )
窃听 to_datetime() 在那里，没有提供任何信息 RuntimeError 是在经过一次天真的 Timestamp 具有混合UTC偏移量的日期时间字符串 (GH25978 )
窃听 to_datetime() 使用 unit='ns' 将从分析的参数中删除时区信息 (GH26168 )
Bug in DataFrame.join() where joining a timezone aware index with a timezone aware column would result in a column of NaN (GH26335)
窃听 date_range() 如果不明确或不存在的开始或结束时间不是由 ambiguous 或 nonexistent 分别使用关键字 (GH27088 )
Bug in DatetimeIndex.union() when combining a timezone aware and timezone unaware DatetimeIndex (GH21671)
Bug when applying a numpy reduction function (e.g. numpy.minimum()) to a timezone aware Series (GH15552)

数字#

窃听 to_numeric() 其中大的负数被不正确地处理 (GH24910 )
Bug in to_numeric() in which numbers were being coerced to float, even though errors was not coerce (GH24910)
窃听 to_numeric() 其中的无效值 errors 是被允许的 (GH26466 )
窃听 format 其中浮点复数没有被格式化为适当的显示精度和修剪 (GH25514 )
错误消息中的错误 DataFrame.corr() 和 Series.corr() 。添加了使用Callable的可能性。 (GH25729 )
窃听 Series.divmod() 和 Series.rdivmod() 这将引发(不正确) ValueError 而不是退回一双 Series 作为结果的对象 (GH25557 )
当将非数字索引发送到 interpolate() 使用需要数字索引的方法。 (GH21662 )
Bug in eval() when comparing floats with scalar operators, for example: x < -0.1 (GH25928)
修复了将全布尔数组强制转换为整数扩展数组失败的错误 (GH25211 )
Bug in divmod with a Series object containing zeros incorrectly raising AttributeError (GH26987)
Inconsistency in Series floor-division (//) and divmod filling positive//zero with NaN instead of Inf (GH27321)

转换#

窃听 DataFrame.astype() 在传递列和类型的字典时， errors 参数被忽略。 (GH25905 )

字符串#

Bug in the __name__ 的几个方法的属性 Series.str ，它们设置不正确 (GH23551 )
Improved error message when passing Series of wrong dtype to Series.str.cat() (GH22722)

间隔#

建设 Interval 被限制为数字， Timestamp 和 Timedelta 端点 (GH23013 )
修复了中的错误 Series/DataFrame 未显示 NaN 在……里面 IntervalIndex 带有缺失的值 (GH25984 )
Bug in IntervalIndex.get_loc() where a KeyError would be incorrectly raised for a decreasing IntervalIndex (GH25860)
Bug in Index constructor where passing mixed closed Interval objects would result in a ValueError instead of an object dtype Index (GH27172)

标引#

改进了调用时的异常消息 DataFrame.iloc() 包含非数字对象的列表 (GH25753 )。
改进了调用时的异常消息 .iloc 或 .loc 具有不同长度的布尔索引器 (GH26658 )。
窃听 KeyError 索引时出现异常消息 MultiIndex 不存在的密钥不显示原始密钥 (GH27250 )。
窃听 .iloc 和 .loc 使用布尔索引器不会引发 IndexError 当传递的项太少时 (GH26658 )。
Bug in DataFrame.loc() and Series.loc() where KeyError was not raised for a MultiIndex when the key was less than or equal to the number of levels in the MultiIndex (GH14885).
其中的BUG DataFrame.append() 生成了一个错误警告，指示 KeyError 当要追加的数据包含新列时，将在将来引发 (GH22252 )。
Bug in which DataFrame.to_csv() caused a segfault for a reindexed data frame, when the indices were single-level MultiIndex (GH26303).
修复了在将一个 arrays.PandasArray 发送到 pandas.core.frame.DataFrame 会引发错误 (GH26390 )
中使用的可调用本地引用的关键字参数 DataFrame.query() 字符串 (GH26426 )
修复了 KeyError 编制索引时 MultiIndex 级别使用仅包含一个标签的列表，该标签缺失 (GH27148 )
Bug which produced AttributeError on partial matching Timestamp in a MultiIndex (GH26944)
Bug in Categorical and CategoricalIndex with Interval values when using the in operator (__contains) with objects that are not comparable to the values in the Interval (GH23705)
Bug in DataFrame.loc() and DataFrame.iloc() on a DataFrame with a single timezone-aware datetime64[ns] column incorrectly returning a scalar instead of a Series (GH27110)
窃听 CategoricalIndex 和 Categorical 错误地提高 ValueError 而不是 TypeError 当使用 in 操作员 (__contains__ ) (GH21729 )
中设置新值时出现错误 Series 使用一个 Timedelta 对象将值错误地转换为整数。 (GH22717 )
Bug in Series setting a new key (__setitem__) with a timezone-aware datetime incorrectly raising ValueError (GH12862)
窃听 DataFrame.iloc() 使用只读索引器编制索引时 (GH17192 )
Bug in Series setting an existing tuple key (__setitem__) with timezone-aware datetime values incorrectly raising TypeError (GH20441)

丢失#

修复了中误导性的异常消息 Series.interpolate() If参数 order 是必需的，但被省略了 (GH10633 ， GH24014 )。
中的异常消息中显示的固定类别类型 DataFrame.dropna() 如果无效 axis 传递的参数 (GH25555 )
A ValueError 现在将被抛出 DataFrame.fillna() 什么时候 limit 不是正整数 (GH27042 )

MultiIndex#

Bug in which incorrect exception raised by Timedelta when testing the membership of MultiIndex (GH24570)

IO#

窃听 DataFrame.to_html() 其中使用显示选项截断值，而不是输出完整内容 (GH17004 )
修复了使用时丢失文本的错误 to_clipboard() 如果在Windows上复制Python3中的utf-16字符 (GH25040 )
窃听 read_json() 为 orient='table' 当它尝试在缺省情况下推断数据类型时，这是不适用的，因为数据类型已经在JSON模式中定义 (GH21345 )
窃听 read_json() 为 orient='table' 和浮点索引，因为它默认推断出索引数据类型，这不适用，因为在JSON模式中已经定义了索引数据类型 (GH25433 )
窃听 read_json() 为 orient='table' 和浮点列名的字符串，因为它将列名类型转换为 Timestamp ，这是不适用的，因为JSON模式中已经定义了列名 (GH25435 )
Bug in json_normalize() for errors='ignore' where missing values in the input data, were filled in resulting DataFrame with the string "nan" instead of numpy.nan (GH25468)
DataFrame.to_html() now raises TypeError when using an invalid type for the classes parameter instead of AssertionError (GH25608)
窃听 DataFrame.to_string() 和 DataFrame.to_latex() 这将导致不正确的输出，当 header 使用了关键字 (GH16718 )
窃听 read_csv() 无法正确解释Python3.6+上Windows上的UTF8编码文件名 (GH15086 )
提高了 pandas.read_stata() 和 pandas.io.stata.StataReader 转换缺少值的列时 (GH25772 )
窃听 DataFrame.to_html() 其中，页眉编号在舍入时会忽略显示选项 (GH17280 )
窃听 read_hdf() 从直接使用PyTables写入的HDF5文件中读取表失败，并显示 ValueError 在使用通过 start 或 stop 论据 (GH11188 )
窃听 read_hdf() 未在以下时间正确关闭门店 KeyError 被提升了 (GH25766 )
改进了在Stata DTA文件中重复值标签时失败的解释，并建议了解决方法 (GH25772 )
改进 pandas.read_stata() 和 pandas.io.stata.StataReader 读取STATA保存的格式错误的118格式文件 (GH25960 )
改进了 col_space 中的参数 DataFrame.to_html() 接受字符串以便可以正确设置css长度值 (GH25941 )
修复了从S3加载包含以下内容的对象时的错误 # URL中的字符 (GH25945 )
加法 use_bqstorage_api 参数设置为 read_gbq() 以加快大型数据帧的下载速度。此功能需要0.10.0版的 pandas-gbq 库以及 google-cloud-bigquery-storage 和 fastavro 类库。 (GH26104 )
修复了中的内存泄漏 DataFrame.to_json() 在处理数字数据时 (GH24889 )
窃听 read_json() 其中日期字符串带有 Z 未转换为UTC时区 (GH26168 )
已添加 cache_dates=True 参数设置为 read_csv() ，它允许在解析唯一日期时对其进行缓存 (GH25990 )
DataFrame.to_excel() 现在引发一个 ValueError 当调用方的维度超出Excel的限制时 (GH26051 )
修复了中的错误 pandas.read_csv() 其中，BOM将导致使用Engine=‘python’进行不正确的解析 (GH26545 )
read_excel() 现在引发一个 ValueError 当输入类型为 pandas.io.excel.ExcelFile 和 engine 参数已传递，因为 pandas.io.excel.ExcelFile 是否定义了引擎 (GH26566 )
选择时出现错误 HDFStore 使用 where='' 指定 (GH26610 )。
修复了中的错误 DataFrame.to_excel() 其中自定义对象(即 PeriodIndex )内部合并的单元格未转换为对Excel编写器安全的类型 (GH27006 )
Bug in read_hdf() where reading a timezone aware DatetimeIndex would raise a TypeError (GH11926)
窃听 to_msgpack() 和 read_msgpack() 这将引发一个 ValueError 而不是 FileNotFoundError 对于无效路径 (GH27160 )
修复了中的错误 DataFrame.to_parquet() 这将引发一个 ValueError 当数据帧没有列时 (GH27339 )
Allow parsing of PeriodDtype columns when using read_csv() (GH26934)

标绘#

修复了以下错误 api.extensions.ExtensionArray 无法在matplotlib绘图中使用 (GH25587 )
Bug in an error message in DataFrame.plot(). Improved the error message if non-numerics are passed to DataFrame.plot() (GH25481)
绘制非数字/非日期时间的索引时，错误的刻度标签位置 (GH7612 ， GH15912 ， GH22334 )
修复了导致打印的错误 PeriodIndex 如果频率是频率规则代码的倍数，则失败的时间序列 (GH14763 )
修复了打印时的错误 DatetimeIndex 使用 datetime.timezone.utc 时区 (GH17173 )

分组依据/重采样/滚动#

窃听 pandas.core.resample.Resampler.agg() 使用时区感知索引，其中 OverflowError 在传递函数列表时引发 (GH22660 )
窃听 pandas.core.groupby.DataFrameGroupBy.nunique() 其中丢失了列标高的名称 (GH23222 )
窃听 pandas.core.groupby.GroupBy.agg() 将聚合函数应用于时区感知数据时 (GH23683 )
窃听 pandas.core.groupby.GroupBy.first() 和 pandas.core.groupby.GroupBy.last() 将在何处丢弃时区信息 (GH21603 )
窃听 pandas.core.groupby.GroupBy.size() 仅对NA值进行分组时 (GH23050 )
窃听 Series.groupby() 哪里 observed Kwarg之前被忽略了 (GH24880 )
窃听 Series.groupby() 在哪里使用 groupby 使用一个 MultiIndex 标签列表等于系列长度的系列导致不正确的分组 (GH25704 )
确保对中的输出排序 groupby 聚合函数在所有版本的Python中是一致的 (GH25692 )
Ensured that result group order is correct when grouping on an ordered Categorical and specifying observed=True (GH25871, GH25167)
窃听 pandas.core.window.Rolling.min() 和 pandas.core.window.Rolling.max() 这导致了内存泄漏 (GH25893 )
窃听 pandas.core.window.Rolling.count() 和 pandas.core.window.Expanding.count 之前忽略了 axis 关键字 (GH13503 )
窃听 pandas.core.groupby.GroupBy.idxmax() 和 pandas.core.groupby.GroupBy.idxmin() WITH DATETIME列将返回不正确的数据类型 (GH25444 ， GH15306 )
窃听 pandas.core.groupby.GroupBy.cumsum() ， pandas.core.groupby.GroupBy.cumprod() ， pandas.core.groupby.GroupBy.cummin() 和 pandas.core.groupby.GroupBy.cummax() 对于没有类别的类别列，将返回错误的结果或段错误 (GH16771 )
窃听 pandas.core.groupby.GroupBy.nth() 其中分组中的NA值将返回不正确的结果 (GH26011 )
Bug in pandas.core.groupby.SeriesGroupBy.transform() where transforming an empty group would raise a ValueError (GH26208)
窃听 pandas.core.frame.DataFrame.groupby() 在其中传递一个 pandas.core.groupby.grouper.Grouper 时会返回不正确的组 .groups 访问者 (GH26326 )
窃听 pandas.core.groupby.GroupBy.agg() 其中为uint64列返回不正确的结果。 (GH26310 )
窃听 pandas.core.window.Rolling.median() 和 pandas.core.window.Rolling.quantile() 在窗口为空的情况下引发内存错误 (GH26005 )
Bug in pandas.core.window.Rolling.median() and pandas.core.window.Rolling.quantile() where incorrect results are returned with closed='left' and closed='neither' (GH26005)
改进 pandas.core.window.Rolling ， pandas.core.window.Window 和 pandas.core.window.ExponentialMovingWindow 用于从结果中排除有害列的函数，而不是引发错误并引发 DataError 只有在所有列都令人讨厌的情况下 (GH12537 )
窃听 pandas.core.window.Rolling.max() 和 pandas.core.window.Rolling.min() 返回不正确的结果并显示空的变量窗口 (GH26005 )
Raise a helpful exception when an unsupported weighted window function is used as an argument of pandas.core.window.Window.aggregate() (GH26597)

重塑#

窃听 pandas.merge() 添加一个字符串 None ，如果 None 在后缀中分配，而不是保持列名不变 (GH24782 )。
窃听 merge() 当按索引名称合并时，有时会导致索引编号不正确(缺失的索引值现在被指定为NA) (GH24212 ， GH25009 )
to_records() 现在接受dtype到其 column_dtypes 参数 (GH24895 )
窃听 concat() 其中的顺序为 OrderedDict (及 dict 在Python3.6+中)在作为 objs 论据 (GH21510 )
Bug in pivot_table() where columns with NaN values are dropped even if dropna argument is False, when the aggfunc argument contains a list (GH22159)
窃听 concat() 其中所产生的 freq 两个人中的一个 DatetimeIndex 带着相同的 freq 会被丢弃 (GH3232 )。
窃听 merge() 与等价的分类数据类型合并时会引发错误 (GH22501 )
窃听 DataFrame 用迭代器或生成器的字典实例化(例如 pd.DataFrame({{'A': reversed(range(3))}}) )引发错误 (GH26349 )。
窃听 DataFrame 实例化 range (例如 pd.DataFrame(range(3)) )引发错误 (GH26342 )。
窃听 DataFrame 当传递非空元组时，构造函数会导致分段错误 (GH25691 )
Bug in Series.apply() failed when the series is a timezone aware DatetimeIndex (GH25959)
窃听 pandas.cut() 由于整数溢出，大垃圾箱可能会错误地引发错误 (GH26045 )
窃听 DataFrame.sort_index() 多索引时引发错误 DataFrame 在所有级别上进行排序，初始级别排在最后 (GH26053 )
Bug in Series.nlargest() treats True as smaller than False (GH26154)
Bug in DataFrame.pivot_table() with a IntervalIndex as pivot index would raise TypeError (GH25814)
Bug in which DataFrame.from_dict() ignored order of OrderedDict when orient='index' (GH8425).
Bug in DataFrame.transpose() where transposing a DataFrame with a timezone-aware datetime column would incorrectly raise ValueError (GH26825)
窃听 pivot_table() 在将时区感知列透视为 values 将删除时区信息 (GH14948 )
窃听 merge_asof() 在指定多个 by 一列就是一列 datetime64[ns, tz] 数据类型 (GH26649 )

稀疏#

显著提高了 SparseArray 使大多数操作受益的初始化，修复了v0.20.0中引入的性能回归 (GH24985 )
窃听 SparseFrame 构造函数，其中传递 None 因为数据会导致 default_fill_value 被忽视 (GH16807 )
Bug in SparseDataFrame when adding a column in which the length of values does not match length of index, AssertionError is raised instead of raising ValueError (GH25484)
在中引入更好的错误消息 Series.sparse.from_coo() 因此，它返回一个 TypeError 对于不是COO矩阵的输入 (GH26554 )
窃听 numpy.modf() 在一个 SparseArray 。现在是一个元组 SparseArray 是返回的 (GH26946 )。

构建更改#

修复MacOS上的PyPy安装错误 (GH26536 )

ExtensionArray#

Bug in factorize() when passing an ExtensionArray with a custom na_sentinel (GH25696).
Series.count() 错误计算ExtensionArray中的NA值 (GH26835 )
已添加 Series.__array_ufunc__ 更好地处理应用于扩展阵列支持的系列的NumPy uuncs (GH23293 )。
Keyword argument deep has been removed from ExtensionArray.copy() (GH27083)

其他#

从提供的UltraJSON实现中删除了未使用的C函数 (GH26198 )
允许 Index 和 RangeIndex 要传给麻木 min 和 max 功能 (GH26125 )
的空对象的repr中使用实际的类名 Series 子类 (GH27001 )。
Bug in DataFrame where passing an object array of timezone-aware datetime objects would incorrectly raise ValueError (GH13287)

贡献者#

共有231人为此次发布贡献了补丁。名字中带有“+”的人第一次贡献了一个补丁。

1_x7 +
Abdullah İhsan Seçer +
Adam Bull +
Adam Hooper
Albert Villanova del Moral
Alex Watt +
AlexTereshenkov +
Alexander Buchkovsky
Alexander Hendorf +
Alexander Nordin +
Alexander Ponomaroff
Alexandre Batisse +
Alexandre Decan +
Allen Downey +
Alyssa Fu Ward +
Andrew Gaspari +
Andrew Wood +
Antoine Viscardi +
Antonio Gutierrez +
Arno Veenstra +
ArtinSarraf
Batalex +
Baurzhan Muftakhidinov
Benjamin Rowell
Bharat Raghunathan +
Bhavani Ravi +
Big Head +
Brett Randall +
Bryan Cutler +
C John Klehm +
Caleb Braun +
Cecilia +
Chris Bertinato +
Chris Stadler +
Christian Haege +
Christian Hudon
Christopher Whelan
Chuanzhu Xu +
Clemens Brunner
Damian Kula +
Daniel Hrisca +
Daniel Luis Costa +
Daniel Saxton
DanielFEvans +
David Liu +
Deepyaman Datta +
Denis Belavin +
Devin Petersohn +
Diane Trout +
EdAbati +
Enrico Rotundo +
EternalLearner42 +
Evan +
Evan Livelo +
Fabian Rost +
Flavien Lambert +
Florian Rathgeber +
Frank Hoang +
Gaibo Zhang +
Gioia Ballin
Giuseppe Romagnuolo +
Gordon Blackadder +
Gregory Rome +
Guillaume Gay
HHest +
Hielke Walinga +
How Si Wei +
Hubert
Huize Wang +
Hyukjin Kwon +
Ian Dunn +
Inevitable-Marzipan +
Irv Lustig
JElfner +
Jacob Bundgaard +
James Cobon-Kerr +
Jan-Philip Gehrcke +
Jarrod Millman +
Jayanth Katuri +
Jeff Reback
Jeremy Schendel
Jiang Yue +
Joel Ostblom
Johan von Forstner +
Johnny Chiu +
Jonas +
Jonathon Vandezande +
Jop Vermeer +
Joris Van den Bossche
Josh
Josh Friedlander +
Justin Zheng
Kaiqi Dong
Kane +
Kapil Patel +
Kara de la Marck +
Katherine Surta +
Katrin Leinweber +
Kendall Masse
Kevin Sheppard
Kyle Kosic +
Lorenzo Stella +
Maarten Rietbergen +
Mak Sze Chun
Marc Garcia
Mateusz Woś
Matias Heikkilä
Mats Maiwald +
Matthew Roeschke
Max Bolingbroke +
Max Kovalovs +
Max van Deursen +
Michael
Michael Davis +
Michael P. Moran +
Mike Cramblett +
Min ho Kim +
Misha Veldhoen +
Mukul Ashwath Ram +
MusTheDataGuy +
Nanda H Krishna +
Nicholas Musolino
Noam Hershtig +
Noora Husseini +
Paul
Paul Reidy
Pauli Virtanen
Pav A +
Peter Leimbigler +
Philippe Ombredanne +
Pietro Battiston
Richard Eames +
Roman Yurchak
Ruijing Li
Ryan
Ryan Joyce +
Ryan Nazareth
Ryan Rehman +
Sakar Panta +
Samuel Sinayoko
Sandeep Pathak +
Sangwoong Yoon
Saurav Chakravorty
Scott Talbert +
Sergey Kopylov +
Shantanu Gontia +
Shivam Rana +
Shorokhov Sergey +
Simon Hawkins
Soyoun(Rose) Kim
Stephan Hoyer
Stephen Cowley +
Stephen Rauch
Sterling Paramore +
Steven +
Stijn Van Hoey
Sumanau Sareen +
Takuya N +
Tan Tran +
Tao He +
Tarbo Fukazawa
Terji Petersen +
Thein Oo
ThibTrip +
Thijs Damsma +
Thiviyan Thanapalasingam
Thomas A Caswell
Thomas Kluiters +
Tilen Kusterle +
Tim Gates +
Tim Hoffmann
Tim Swast
Tom Augspurger
Tom Neep +
Tomáš Chvátal +
Tyler Reddy
Vaibhav Vishal +
Vasily Litvinov +
Vibhu Agarwal +
Vikramjeet Das +
Vladislav +
Víctor Moron Tejero +
Wenhuan
Will Ayd +
William Ayd
Wouter De Coster +
Yoann Goular +
Zach Angell +
alimcmaster1
anmyachev +
chris-b1
danielplawrence +
endenis +
enisnazif +
ezcitron +
fjetter
froessler
gfyoung
gwrome +
h-vetinari
haison +
hannah-c +
heckeop +
iamshwin +
jamesoliverh +
jbrockmendel
jkovacevic +
killerontherun1 +
knuu +
kpapdac +
kpflugshaupt +
krsnik93 +
leerssej +
lrjball +
mazayo +
nathalier +
nrebena +
nullptr +
pilkibun +
pmaxey83 +
rbenes +
robbuckley
shawnbrown +
sudhir mohanraj +
tadeja +
tamuhey +
thatneat
topper-123
willweil +
yehia67 +
yhaque1213 +

0.25.1中的新特性(2019年8月21日)

0.24.2中的新特性(2019年3月12日)

0.25.0中的新特性(2019年7月18日)#

增强#

GroupBy聚合并重新标记#

具有多个lambdas的GroupBy聚集#

多索引的更好的REPR#

系列和数据帧的更短截断REPR#

支持max_Level参数的JSON标准化#

Series.分解以将类似列表的值拆分为行#

其他增强功能#

向后不兼容的API更改#

使用具有UTC偏移量的日期字符串进行索引#

MultiIndex 由级别和代码构成#

GroupBy.apply 在……上面 DataFrame 仅评估第一组一次#

连接稀疏值#

这个 .str -访问器执行更严格的类型检查#

在GroupBy期间保留类别数据类型#

不兼容的索引类型联合#

DataFrame GroupBy Fill/b不再返回组标签#

DataFrame 对空的类别/对象列进行描述将返回top和freq#

__str__ 方法现在调用 __repr__ 而不是相反#

索引AN IntervalIndex 使用 Interval 对象#

级数上的二进制uuncs现在对齐#

Categorical.argort现在将缺少的值放在末尾#

将字典列表传递给DataFrame时保留列顺序#

提高了依赖项的最低版本#

其他API更改#

不推荐使用#

稀疏子类#

消息包格式#

其他不推荐使用的词#

删除先前版本的弃用/更改#

性能改进#

错误修复#

直截了当的#

类似日期的#

Timedelta#

时区#

数字#

转换#

字符串#

间隔#

标引#

丢失#

MultiIndex#

IO#

标绘#

分组依据/重采样/滚动#

重塑#

稀疏#

构建更改#

ExtensionArray#

其他#

贡献者#

`MultiIndex` 由级别和代码构成#

`GroupBy.apply` 在……上面 `DataFrame` 仅评估第一组一次#

这个 `.str` -访问器执行更严格的类型检查#

`DataFrame` GroupBy Fill/b不再返回组标签#

`DataFrame` 对空的类别/对象列进行描述将返回top和freq#

`str` 方法现在调用 `repr` 而不是相反#

索引AN `IntervalIndex` 使用 `Interval` 对象#