Pandas文档字符串指南#

关于文档字符串和标准#

Python文档字符串是用于记录Python模块、类、函数或方法的字符串，因此程序员无需阅读实现的细节即可了解它的功能。

此外，从文档字符串自动生成在线(Html)文档也是一种常见的做法。 Sphinx 就是为了这个目的。

下一个示例给出了一个文档字符串的概念：

def add(num1, num2):
    """
    Add up two integer numbers.

    This function simply wraps the ``+`` operator, and does not
    do anything interesting, except for illustrating what
    the docstring of a very simple function looks like.

    Parameters
    ----------
    num1 : int
        First number to add.
    num2 : int
        Second number to add.

    Returns
    -------
    int
        The sum of ``num1`` and ``num2``.

    See Also
    --------
    subtract : Subtract one integer from another.

    Examples
    --------
    >>> add(2, 2)
    4
    >>> add(25, 0)
    25
    >>> add(10, -10)
    0
    """
    return num1 + num2

存在一些关于文档字符串的标准，这使得它们更容易阅读，并允许它们很容易地被导出为其他格式，如html或pdf。

中定义了每个Python文档字符串应遵循的第一个约定 PEP-257 。

由于PEP-257的范围相当广泛，还存在其他更具体的标准。在Pandas的情况下，遵循NumPy文档字符串约定。本文档对这些约定进行了说明：

numpydoc docstring guide (它基于原始的 Guide to NumPy/SciPy documentation )

NumpyDoc是支持NumPy文档字符串约定的Sphinx扩展。

该标准使用reStrutiredText(REST)。ReStrutiredText是一种标记语言，允许在纯文本文件中编码样式。可在以下位置找到有关reStrutiredText的文档：

Pandas有一些在相关类之间共享文档字符串的帮助器，请参见共享文档字符串。

本文件的其余部分将总结所有上述指导方针，并将提供针对Pandas项目的其他公约。

编写文档字符串#

一般规则#

DOCSTRINGS必须用三个双引号定义。在文档字符串之前或之后不应保留空行。正文从开始引号后的下一行开始。右引号有自己的一行(意思是它们不在最后一句的末尾)。

在极少数情况下，文档字符串中会使用诸如粗体文本或斜体之类的REST样式，但在反号之间显示的内联代码是常见的吗？以下代码被视为内联代码：

参数的名称
Python代码、模块、函数、内置、类型、文字...(例如： os ， list ， numpy.abs ， datetime.date ， True )
Pandas班(以 :class:`pandas.Series `)
Pandas方法(在表格中 :meth:`pandas.Series.sum `)
Pandas功能(在表格中 :func:`pandas.to_datetime `)

备注

若要仅显示链接的类、方法或函数的最后一个组件，请在其前面加上 ~ 。例如, :class:`~pandas.Series ` will link to pandas.Series but only display the last part, Series as the link text. See Sphinx cross-referencing syntax 有关详细信息，请参阅。

好：

def add_values(arr):
    """
    Add the values in ``arr``.

    This is equivalent to Python ``sum`` of :meth:`pandas.Series.sum`.

    Some sections are omitted here for simplicity.
    """
    return sum(arr)

Bad:

def func():

    """Some function.

    With several mistakes in the docstring.

    It has a blank like after the signature ``def func():``.

    The text 'Some function' should go in the line after the
    opening quotes of the docstring, not in the same line.

    There is a blank line between the docstring and the first line
    of code ``foo = 1``.

    The closing quotes should be in the next line, not in this one."""

    foo = 1
    bar = 2
    return foo + bar

第1部分：简短总结#

简短的摘要是一句话，它以简洁的方式表达了函数的功能。

简短的摘要必须以大写字母开头，以点结尾，并且一行大小合适。它需要在不提供细节的情况下表达对象的功能。对于函数和方法，简短的摘要必须以不定式动词开头。

好：

def astype(dtype):
    """
    Cast Series type.

    This section will provide further details.
    """
    pass

Bad:

def astype(dtype):
    """
    Casts Series type.

    Verb in third-person of the present simple, should be infinitive.
    """
    pass

def astype(dtype):
    """
    Method to cast Series type.

    Does not start with verb.
    """
    pass

def astype(dtype):
    """
    Cast Series type

    Missing dot at the end.
    """
    pass

def astype(dtype):
    """
    Cast Series type from its current type to the new type defined in
    the parameter dtype.

    Summary is too verbose and doesn't fit in a single line.
    """
    pass

第2部分：扩展摘要#

扩展摘要提供了有关该函数功能的详细信息。它不应该讨论参数的细节，也不应该讨论执行说明，这些说明将在其他部分介绍。

在简短摘要和扩展摘要之间留有一个空行。扩展摘要中的每一段都以一个圆点结尾。

扩展摘要应该提供有关该函数为什么有用及其用例的详细信息(如果它不是太通用的话)。

def unstack():
    """
    Pivot a row index to columns.

    When using a MultiIndex, a level can be pivoted so each value in
    the index becomes a column. This is especially useful when a subindex
    is repeated for the main index, and data is easier to visualize as a
    pivot table.

    The index level will be automatically removed from the index when added
    as columns.
    """
    pass

第3部分：参数#

参数的详细信息将在本节中添加。这一节的标题是“参数”，后跟一行，“参数”一词的每个字母下面都有一个连字符。在章节标题之前会留一个空行，但不会在后面，也不会在带有“参数”和连字符的行之间。

在标题之后，必须记录签名中的每个参数，包括 *args 和 **kwargs ，但不是 self 。

参数按名称定义，后跟空格、冒号、另一个空格和类型(或多个类型)。请注意，名称和冒号之间的空格很重要。未为其定义类型 *args 和 **kwargs ，但必须为所有其他参数定义。在参数定义之后，需要有一行参数描述，它是缩进的，可以有多行。描述必须以大写字母开头，以点结尾。

对于具有缺省值的关键字参数，缺省值将列在类型末尾的逗号之后。在这种情况下，类型的确切形式将是“int，默认为0”。在某些情况下，解释默认参数的含义可能很有用，该参数可以添加在逗号“int，default-1，意思是所有CPU”之后。

在缺省值为 None ，这意味着不会使用该值。而不是 "str, default None" ，最好是写下 "str, optional" 。什么时候 None 是一个正在使用的值，我们将保留“str，默认为无”的形式。例如，在 df.to_csv(compression=None) ， None 不是正在使用的值，但意味着压缩是可选的，如果没有提供，则不会使用压缩。在本例中，我们将使用 "str, optional" 。仅在像这样的情况下 func(value=None) 和 None 正被以相同的方式使用 0 或 foo 将被使用，那么我们将指定“str，int或None，默认为None”。

好：

class Series:
    def plot(self, kind, color='blue', **kwargs):
        """
        Generate a plot.

        Render the data in the Series as a matplotlib plot of the
        specified kind.

        Parameters
        ----------
        kind : str
            Kind of matplotlib plot.
        color : str, default 'blue'
            Color name or rgb code.
        **kwargs
            These parameters will be passed to the matplotlib plotting
            function.
        """
        pass

Bad:

class Series:
    def plot(self, kind, **kwargs):
        """
        Generate a plot.

        Render the data in the Series as a matplotlib plot of the
        specified kind.

        Note the blank line between the parameters title and the first
        parameter. Also, note that after the name of the parameter ``kind``
        and before the colon, a space is missing.

        Also, note that the parameter descriptions do not start with a
        capital letter, and do not finish with a dot.

        Finally, the ``**kwargs`` parameter is missing.

        Parameters
        ----------

        kind: str
            kind of matplotlib plot
        """
        pass

参数类型#

在指定参数类型时，可以直接使用Python内置数据类型(与更冗长的字符串、整数、布尔值等相比，优先使用Python类型)：

集成
浮动
应力
布尔尔

对于复杂类型，定义子类型。为 dict 和 tuple ，因为存在多个类型，所以我们使用方括号来帮助读取类型(用于 dict 和正常的方括号 tuple )：

整型列表
{str：int}的字典
(字符串，整型，整型)的元组
(字符串，)的元组
一组字符串

如果只允许一组值，请将它们放在花括号中，并用逗号(后面跟一个空格)分隔。如果这些值是序号，并且它们有顺序，则按此顺序列出它们。否则，请先列出缺省值(如果有)：

{0，10，25}
{‘简单’，‘高级’}
{‘低’、‘中’、‘高’}
{‘猫’，‘狗’，‘鸟’}

如果类型是在Python模块中定义的，则必须指定该模块：

datetime.date
datetime.datetime
decimal.Decimal

如果类型在包中，则还必须指定模块：

numpy.ndarray
scipy.sparse.coo_matrix

如果类型是Pandas类型，还应指定除Series和DataFrame之外的Pandas：

系列
DataFrame
pandas.Index
pandas.Categorical
pandas.arrays.SparseArray

如果确切的类型不相关，但必须与NumPy数组兼容，则可以指定类似数组。如果接受任何可以迭代的类型，则可以使用Iterable：

类似阵列的
可迭代的

如果接受多个类型，请用逗号分隔它们，最后两种类型除外，需要用单词‘or’分隔：

整型或浮点型
浮点型、小数型、小数型或无
字符串或字符串列表

如果 None 是可接受的值之一，则它始终需要是列表中的最后一个。

对于AXIS，惯例是使用类似以下内容：

轴：{0或‘index’，1或‘Columns’，None}，默认为None

第4节：回报或收益#

如果该方法返回值，则将在本节中进行说明。此外，如果该方法产生其输出。

该部分的标题将以与“参数”相同的方式定义。名称为“RETURNS”或“YIELS”，后跟一行连字符，其数量与前面单词中的字母一样多。

返回的文档也类似于参数。但在这种情况下，不会提供任何名称，除非该方法返回或产生多个值(值的元组)。

“收益”和“收益率”的类型与“参数”的类型相同。此外，描述必须以点结尾。

例如，使用单个值：

def sample():
    """
    Generate and return a random number.

    The value is sampled from a continuous uniform distribution between
    0 and 1.

    Returns
    -------
    float
        Random number generated.
    """
    return np.random.random()

具有多个值：

import string

def random_letters():
    """
    Generate and return a sequence of random letters.

    The length of the returned string is also random, and is also
    returned.

    Returns
    -------
    length : int
        Length of the returned string.
    letters : str
        String of random letters.
    """
    length = np.random.randint(1, 10)
    letters = ''.join(np.random.choice(string.ascii_lowercase)
                      for i in range(length))
    return length, letters

如果该方法产生其值：

def sample_values():
    """
    Generate an infinite sequence of random numbers.

    The values are sampled from a continuous uniform distribution between
    0 and 1.

    Yields
    ------
    float
        Random number generated.
    """
    while True:
        yield np.random.random()

第5节：另请参阅#

本部分用于让用户了解与所记录的Pandas功能相关的功能。在极少数情况下，如果根本找不到相关的方法或函数，则可以跳过此部分。

一个明显的例子就是 head() 和 tail() 方法：研究方法。AS tail() 是否等同于 head() 但在结束的时候 Series 或 DataFrame 而不是一开始，让用户知道它是好的。

为了直观地了解什么可以被认为是相关的，这里有一些例子：

loc 和 iloc ，因为它们做的是相同的，但在一种情况下提供索引，在其他位置
max 和 min ，因为他们做了相反的事情
iterrows ， itertuples 和 items ，因为用户很容易找到迭代列的方法，结果就是迭代行的方法，反之亦然
fillna 和 dropna ，因为这两种方法都用于处理缺失值。
read_csv 和 to_csv ，因为它们是互补的
merge 和 join ，因为一种是另一种的泛化
astype and pandas.to_datetime, as users may be reading the documentation of astype to know how to cast as a date, and the way to do it is with pandas.to_datetime
where 与以下内容相关 numpy.where ，因为它的功能是基于它的

在决定什么是相关的时，您应该主要使用您的常识，并考虑什么对阅读文档的用户有用，特别是对经验较少的用户。

当涉及到其他库时(主要 numpy )，则首先使用模块的名称(而不是 np )。如果函数位于非主模块中，如 scipy.sparse ，列出完整的模块(例如 scipy.sparse.coo_matrix )。

本节有一个标题“See Also”(请注意大写的S和A)，后面是连字符的行，前面是空行。

在标题后面，我们将为每个相关的方法或函数添加一行，后跟一个空格、一个冒号、另一个空格和一个简短的描述，说明此方法或函数的作用、为什么它与上下文相关，以及所记录的函数与被引用的函数之间的主要区别。描述还必须以点结尾。

请注意，在“RETURNS”和“YIELS”中，描述位于类型之后的行。然而，在这一节中，它位于同一行，中间有一个冒号。如果描述不适合同一行，则可以继续到其他行，这些行必须进一步缩进。

例如：

class Series:
    def head(self):
        """
        Return the first 5 elements of the Series.

        This function is mainly useful to preview the values of the
        Series without displaying the whole of it.

        Returns
        -------
        Series
            Subset of the original series with the 5 first values.

        See Also
        --------
        Series.tail : Return the last 5 elements of the Series.
        Series.iloc : Return a slice of the elements in the Series,
            which can also be used to return the first or last n.
        """
        return self.iloc[:5]

第6节：备注#

这是一个可选部分，用于记录有关算法实现的说明，或记录函数行为的技术方面。

您可以随意跳过它，除非您熟悉算法的实现，或者在为函数编写示例时发现了一些违反直觉的行为。

本部分采用与扩展摘要部分相同的格式。

第7节：示例#

这是文档串中最重要的部分之一，尽管被放在最后，因为人们通常通过举例比通过准确的解释更好地理解概念。

文档字符串中的示例除了说明函数或方法的用法外，还必须是有效的Python代码，以确定性的方式返回给定的输出，并且可以由用户复制和运行。

示例在Python终端中以会话的形式提供。 >>> is used to present code. ... 用于从上一行开始的代码。输出紧跟在生成输出的最后一行代码之后(中间没有空行)。描述示例的注释可以在它们之前和之后添加空行。

演示示例的方法如下：

导入所需的库(除 numpy 和 pandas )
创建示例所需的数据
展示一个非常基本的示例，让您对最常见的用例有一个概念
添加带有解释的示例，以说明如何将参数用于扩展功能

一个简单的例子可以是：

class Series:

    def head(self, n=5):
        """
        Return the first elements of the Series.

        This function is mainly useful to preview the values of the
        Series without displaying all of it.

        Parameters
        ----------
        n : int
            Number of values to return.

        Return
        ------
        pandas.Series
            Subset of the original series with the n first values.

        See Also
        --------
        tail : Return the last n elements of the Series.

        Examples
        --------
        >>> s = pd.Series(['Ant', 'Bear', 'Cow', 'Dog', 'Falcon',
        ...                'Lion', 'Monkey', 'Rabbit', 'Zebra'])
        >>> s.head()
        0   Ant
        1   Bear
        2   Cow
        3   Dog
        4   Falcon
        dtype: object

        With the ``n`` parameter, we can change the number of returned rows:

        >>> s.head(n=3)
        0   Ant
        1   Bear
        2   Cow
        dtype: object
        """
        return self.iloc[:n]

这些例子应该尽可能地简洁。如果函数的复杂性需要很长的示例，建议使用标题为粗体的块。使用双星 ** 要使文本加粗，请执行以下操作 **this example** 。

示例的约定#

假设示例中的代码始终以以下两行开头，这两行没有显示：

import numpy as np
import pandas as pd

必须显式导入示例中使用的任何其他模块，每行一个(如中所建议的 PEP 8#imports )和避免别名。避免过度导入，但如果需要，首先从标准库导入，然后是第三方库(如matplotlib)。

当用一个单独的 Series 使用这个名字 s ，并且如果使用单个 DataFrame 使用这个名字 df 。对于指数， idx 是首选名称。如果一组同类的 Series 或 DataFrame 是用过的，说出它们的名字 s1 ， s2 ， s3 ..。或 df1 ， df2 ， df3 ..。如果数据不是同构的，并且需要多个结构，则使用有意义的名称来命名它们，例如 df_main 和 df_to_join 。

示例中使用的数据应尽可能紧凑。建议行数为4左右，但要使其对特定示例有意义。例如，在 head 方法时，它需要大于5，才能使用默认值显示该示例。如果这样做的话 mean ，我们可以使用像这样的东西 [1, 2, 3] ，所以很容易看出返回值是平均值。

对于更复杂的例子(例如分组)，避免使用没有解释的数据，如A、B、C、D列的随机数矩阵……取而代之的是使用一个有意义的例子，这样更容易理解概念。除非示例需要，否则请使用动物的名称，以保持示例的一致性。以及它们的数值性质。

调用该方法时，关键字参数 head(n=3) 比位置参数更可取 head(3) 。

好：

class Series:

    def mean(self):
        """
        Compute the mean of the input.

        Examples
        --------
        >>> s = pd.Series([1, 2, 3])
        >>> s.mean()
        2
        """
        pass


    def fillna(self, value):
        """
        Replace missing values by ``value``.

        Examples
        --------
        >>> s = pd.Series([1, np.nan, 3])
        >>> s.fillna(0)
        [1, 0, 3]
        """
        pass

    def groupby_mean(self):
        """
        Group by index and return mean.

        Examples
        --------
        >>> s = pd.Series([380., 370., 24., 26],
        ...               name='max_speed',
        ...               index=['falcon', 'falcon', 'parrot', 'parrot'])
        >>> s.groupby_mean()
        index
        falcon    375.0
        parrot     25.0
        Name: max_speed, dtype: float64
        """
        pass

    def contains(self, pattern, case_sensitive=True, na=numpy.nan):
        """
        Return whether each value contains ``pattern``.

        In this case, we are illustrating how to use sections, even
        if the example is simple enough and does not require them.

        Examples
        --------
        >>> s = pd.Series('Antelope', 'Lion', 'Zebra', np.nan)
        >>> s.contains(pattern='a')
        0    False
        1    False
        2     True
        3      NaN
        dtype: bool

        **Case sensitivity**

        With ``case_sensitive`` set to ``False`` we can match ``a`` with both
        ``a`` and ``A``:

        >>> s.contains(pattern='a', case_sensitive=False)
        0     True
        1    False
        2     True
        3      NaN
        dtype: bool

        **Missing values**

        We can fill missing values in the output using the ``na`` parameter:

        >>> s.contains(pattern='a', na=False)
        0    False
        1    False
        2     True
        3    False
        dtype: bool
        """
        pass

Bad:

def method(foo=None, bar=None):
    """
    A sample DataFrame method.

    Do not import NumPy and pandas.

    Try to use meaningful data, when it makes the example easier
    to understand.

    Try to avoid positional arguments like in ``df.method(1)``. They
    can be all right if previously defined with a meaningful name,
    like in ``present_value(interest_rate)``, but avoid them otherwise.

    When presenting the behavior with different parameters, do not place
    all the calls one next to the other. Instead, add a short sentence
    explaining what the example shows.

    Examples
    --------
    >>> import numpy as np
    >>> import pandas as pd
    >>> df = pd.DataFrame(np.random.randn(3, 3),
    ...                   columns=('a', 'b', 'c'))
    >>> df.method(1)
    21
    >>> df.method(bar=14)
    123
    """
    pass

让您的示例通过文档测试的提示#

让这些示例通过验证脚本中的doctest有时可能很棘手。以下是一些注意事项：

导入所有需要的库(除了Pandas和NumPy之外，这些库已经导入为 import pandas as pd 和 import numpy as np )，并定义示例中使用的所有变量。
尽量避免使用随机数据。然而，在某些情况下，随机数据可能是可行的，例如，如果您正在记录的函数涉及概率分布，或者如果使函数结果有意义所需的数据量太多，以至于手动创建它非常麻烦。在这些情况下，始终使用固定的随机种子以使生成的示例可预测。示例：：
```
>>> np.random.seed(42)
>>> df = pd.DataFrame({'normal': np.random.normal(100, 5, 20)})
```

如果您有一个环绕多行的代码片段，则需要使用‘...’在接下来的行上：

>>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], index=['a', 'b', 'c'],
...                   columns=['A', 'B'])

如果要显示引发异常的案例，可以执行以下操作：
```
>>> pd.to_datetime(["712-01-01"])
Traceback (most recent call last):
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 712-01-01 00:00:00
```
必须包括“TraceBack(Last Recent Call Last)：”，但对于实际错误，只有错误名称就足够了。
如果结果中有一小部分可能不同(例如，对象表示中的散列)，您可以使用 ... 来代表这一部分。

如果你想证明这一点 s.plot() 返回matplotlib AxesSubploy对象，这将导致doctest失败：：
```
>>> s.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7efd0c0b0690>
```
但是，您可以执行以下操作(请注意需要添加的注释)：
```
>>> s.plot()  
<matplotlib.axes._subplots.AxesSubplot at ...>
```

示例中的情节#

在Pandas还田中有一些方法。要呈现由文档中的示例生成的绘图， .. plot:: 指令存在。

要使用它，请将下一段代码放在“Examples”标题后面，如下所示。在构建文档时，将自动生成绘图。

class Series:
    def plot(self):
        """
        Generate a plot with the ``Series`` data.

        Examples
        --------

        .. plot::
            :context: close-figs

            >>> s = pd.Series([1, 2, 3])
            >>> s.plot()
        """
        pass

对文档做出贡献

为代码库做出贡献

Pandas文档字符串指南#

关于文档字符串和标准#

编写文档字符串#

一般规则#

第1部分：简短总结#

第2部分：扩展摘要#

第3部分：参数#

参数类型#

第4节：回报或收益#

第5节：另请参阅#

第6节：备注#

第7节：示例#

示例的约定#

让您的示例通过文档测试的提示#

示例中的情节#

共享文档字符串#