>>> from env_helper import info; info()
页面更新时间: 2023-12-16 21:47:51
运行环境:
    Linux发行版本: Debian GNU/Linux 12 (bookworm)
    操作系统内核: Linux-6.1.0-15-amd64-x86_64-with-glibc2.36
    Python版本: 3.11.2

8.6. Pandas稀疏数据

当任何匹配特定值的数据(NaN/缺失值,尽管可以选择任何值)被省略时,稀疏对象被“压缩”。 一个特殊的SparseIndex对象跟踪数据被“稀疏”的地方。 这将在一个例子中更有意义。

在旧版本中,所有的标准Pandas数据结构都应用了 to_sparse 方法。 这个方法已经被弃用,现在的方式是使用 pandas.arrays.SparseArray 来声明:

>>> import pandas as pd
>>> import numpy as np
>>>
>>> ts = pd.Series(np.random.randn(10))
>>> ts[2:-2] = np.nan
>>>
>>> sts = pd.arrays.SparseArray(ts)
>>>
>>> # sts = ts.to_numpy()
>>> # sts = ts.to_sparse()
>>> print (sts)
[0.41409625411779355, -1.5475189937381009, nan, nan, nan, nan, nan, nan, 0.2557387139917992, -2.7711120403907983]
Fill: nan
IntIndex
Indices: array([0, 1, 8, 9], dtype=int32)

为了内存效率的原因,所以需要稀疏对象的存在。

现在假设有一个大的NA DataFrame并执行下面的代码 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(10000, 4))
>>> df.loc[:9998] = np.nan
>>> sdf = pd.SparseDtype(df)
>>> # sdf = df.to_sparse()
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

Cell In [14], line 6
      4 df = pd.DataFrame(np.random.randn(10000, 4))
      5 df.loc[:9998] = np.nan
----> 6 sdf = pd.SparseDtype(df)
      7 # sdf = df.to_sparse()


File /usr/lib/python3/dist-packages/pandas/core/arrays/sparse/dtype.py:91, in SparseDtype.__init__(self, dtype, fill_value)
     88         fill_value = dtype.fill_value
     89     dtype = dtype.subtype
---> 91 dtype = pandas_dtype(dtype)
     92 if is_string_dtype(dtype):
     93     dtype = np.dtype("object")


File /usr/lib/python3/dist-packages/pandas/core/dtypes/common.py:1781, in pandas_dtype(dtype)
   1778 # try a numpy dtype
   1779 # raise a consistent TypeError if failed
   1780 try:
-> 1781     npdtype = np.dtype(dtype)
   1782 except SyntaxError as err:
   1783     # np.dtype uses `eval` which can raise SyntaxError
   1784     raise TypeError(f"data type '{dtype}' not understood") from err


TypeError: Cannot interpret '             0         1        2         3
0          NaN       NaN      NaN       NaN
1          NaN       NaN      NaN       NaN
2          NaN       NaN      NaN       NaN
3          NaN       NaN      NaN       NaN
4          NaN       NaN      NaN       NaN
...        ...       ...      ...       ...
9995       NaN       NaN      NaN       NaN
9996       NaN       NaN      NaN       NaN
9997       NaN       NaN      NaN       NaN
9998       NaN       NaN      NaN       NaN
9999  1.032179 -0.396169  0.54003  0.326016

[10000 rows x 4 columns]' as a data type
>>> print (sdf.density)
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

Cell In [8], line 8
      6 sdf = df.to_numpy()
      7 # sdf = df.to_sparse()
----> 8 print (sdf.density)


AttributeError: 'numpy.ndarray' object has no attribute 'density'

通过调用to_dense可以将任何稀疏对象转换回标准密集形式 -

>>> import pandas as pd
>>> import numpy as np
>>> ts = pd.Series(np.random.randn(10))
>>> ts[2:-2] = np.nan
>>> sts = ts.to_sparse()
>>> print (sts.to_dense())

8.6.1. 稀疏Dtypes

稀疏数据应该具有与其密集表示相同的dtype。 目前,支持float64,int64和booldtypes。 取决于原始的dtype,fill_value默认值的更改 -

float64 − np.nan
int64 − 0
bool − False

执行下面的代码来理解相同的内容 -

>>> import pandas as pd
>>> import numpy as np
>>>
>>> s = pd.Series([1, np.nan, np.nan])
>>> print (s)
>>> print ("=============================")
>>> s.to_sparse()
>>> print (s)