`tokenize` ——python源的tokenizer¶

这个 tokenize 模块为用Python实现的Python源代码提供词法扫描器。此模块中的扫描仪还将注释作为标记返回，这使得它对于实现“漂亮的打印机”非常有用，包括用于屏幕显示的着色器。

为了简化令牌流处理，所有 operator 和 delimiter 令牌和 Ellipsis 使用泛型返回 OP 令牌类型。通过检查 exact_type 属性上 named tuple 从回来 tokenize.tokenize() .

标记化输入¶

主要入口点是 generator ：

tokenize.tokenize(readline)¶

这个 tokenize() 生成器需要一个参数， readline ，它必须是提供与 io.IOBase.readline() 文件对象的方法。对函数的每次调用都应返回一行字节输入。

生成器使用以下成员生成5个元组：标记类型；标记字符串；2个元组 (srow, scol) 指定标记在源中开始的行和列的整数；2元组 (erow, ecol) 指定标记在源中结束的行和列；以及找到标记的行。传递的行（最后一个元组项）是 身体的 行。5元组作为 named tuple 字段名： type string start end line .

归还的人 named tuple 具有名为的附加属性 exact_type 它包含 OP 令牌。对于所有其他令牌类型 exact_type 等于命名的元组 type 字段。

在 3.1 版更改: 添加了对命名元组的支持。

在 3.3 版更改: 为添加了支持 exact_type .

tokenize() 根据 PEP 263 .

tokenize.generate_tokens(readline)¶

将读取Unicode字符串而不是字节的源标记化。

类似于 tokenize() , the readline 参数是可调用的，返回一行输入。然而， generate_tokens() 预期 readline 返回str对象而不是字节。

结果是一个迭代器产生了命名的元组，与 tokenize() . 它不产生 ENCODING 令牌。

所有常量来自 token 模块也从导出 tokenize .

提供了另一个功能来逆转标记化技术过程。这对于创建标记化脚本、修改标记流和写回修改过的脚本的工具很有用。

tokenize.untokenize(iterable)¶

将标记转换回python源代码。这个 可迭代的 必须返回至少包含两个元素的序列，标记类型和标记字符串。忽略任何其他序列元素。

重建的脚本作为单个字符串返回。结果保证了标记化回匹配输入，从而保证了转换的无损性和往返性。保证仅适用于令牌类型和令牌字符串，因为令牌之间的间距（列位置）可能会更改。

它返回字节，使用 ENCODING 令牌，它是由 tokenize() . 如果输入中没有编码标记，则返回str。

tokenize() 需要检测它标记的源文件的编码。它用于执行此操作的功能可用：

tokenize.detect_encoding(readline)¶

这个 detect_encoding() 函数用于检测用于解码Python源文件的编码。它需要一个参数readline，与 tokenize() 生成器。

它最多调用readline两次，并返回所使用的编码（作为字符串）和它已读取的任何行（不是从字节解码）的列表。

它根据中指定的UTF-8 BOM或编码cookie检测编码。 PEP 263 . 如果同时存在一个BOM和一个cookie，但不同意， SyntaxError 将被引发。请注意，如果找到了物料清单， 'utf-8-sig' 将作为编码返回。

如果未指定编码，则默认为 'utf-8' 将被退回。

使用 open() 打开python源文件：它使用 detect_encoding() 检测文件编码。

tokenize.open(filename)¶: 使用检测到的编码以只读模式打开文件 detect_encoding() .

3.2 新版功能.

exception tokenize.TokenError¶

在文件中的任何位置未完成可拆分为多行的docstring或表达式时引发，例如：：

"""Beginning of
docstring

或：

[1,
 2,
 3

请注意，未闭合的单引号字符串不会导致引发错误。它们被标记为 ERRORTOKEN 然后对其内容进行标记化技术。

命令行用法¶

3.3 新版功能.

这个 tokenize 模块可以作为脚本从命令行执行。它简单如下：

python -m tokenize [-e] [filename.py]

接受以下选项：

-h, --help¶: 显示此帮助消息并退出

-e, --exact¶: 使用确切类型显示令牌名称

如果 filename.py 它的内容被标记为stdout。否则，在stdin上执行标记化技术。

实例¶

将浮点文字转换为十进制对象的脚本重写器示例：

from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP
from io import BytesIO

def decistmt(s):
    """Substitute Decimals for floats in a string of statements.

    >>> from decimal import Decimal
    >>> s = 'print(+21.3e-5*-.1234/81.7)'
    >>> decistmt(s)
    "print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"

    The format of the exponent is inherited from the platform C library.
    Known cases are "e-007" (Windows) and "e-07" (not Windows).  Since
    we're only showing 12 digits, and the 13th isn't close to 5, the
    rest of the output should be platform-independent.

    >>> exec(s)  #doctest: +ELLIPSIS
    -3.21716034272e-0...7

    Output from calculations with Decimal should be identical across all
    platforms.

    >>> exec(decistmt(s))
    -3.217160342717258261933904529E-7
    """
    result = []
    g = tokenize(BytesIO(s.encode('utf-8')).readline)  # tokenize the string
    for toknum, tokval, _, _, _ in g:
        if toknum == NUMBER and '.' in tokval:  # replace NUMBER tokens
            result.extend([
                (NAME, 'Decimal'),
                (OP, '('),
                (STRING, repr(tokval)),
                (OP, ')')
            ])
        else:
            result.append((toknum, tokval))
    return untokenize(result).decode('utf-8')

从命令行标记化的示例。脚本：

def say_hello():
    print("Hello, World!")

say_hello()

将标记化为以下输出，其中第一列是找到标记的行/列坐标的范围，第二列是标记的名称，最后一列是标记的值（如果有）。

$ python -m tokenize hello.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,3:            NAME           'def'
1,4-1,13:           NAME           'say_hello'
1,13-1,14:          OP             '('
1,14-1,15:          OP             ')'
1,15-1,16:          OP             ':'
1,16-1,17:          NEWLINE        '\n'
2,0-2,4:            INDENT         '    '
2,4-2,9:            NAME           'print'
2,9-2,10:           OP             '('
2,10-2,25:          STRING         '"Hello, World!"'
2,25-2,26:          OP             ')'
2,26-2,27:          NEWLINE        '\n'
3,0-3,1:            NL             '\n'
4,0-4,0:            DEDENT         ''
4,0-4,9:            NAME           'say_hello'
4,9-4,10:           OP             '('
4,10-4,11:          OP             ')'
4,11-4,12:          NEWLINE        '\n'
5,0-5,0:            ENDMARKER      ''

可以使用 -e 选项：

$ python -m tokenize -e hello.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,3:            NAME           'def'
1,4-1,13:           NAME           'say_hello'
1,13-1,14:          LPAR           '('
1,14-1,15:          RPAR           ')'
1,15-1,16:          COLON          ':'
1,16-1,17:          NEWLINE        '\n'
2,0-2,4:            INDENT         '    '
2,4-2,9:            NAME           'print'
2,9-2,10:           LPAR           '('
2,10-2,25:          STRING         '"Hello, World!"'
2,25-2,26:          RPAR           ')'
2,26-2,27:          NEWLINE        '\n'
3,0-3,1:            NL             '\n'
4,0-4,0:            DEDENT         ''
4,0-4,9:            NAME           'say_hello'
4,9-4,10:           LPAR           '('
4,10-4,11:          RPAR           ')'
4,11-4,12:          NEWLINE        '\n'
5,0-5,0:            ENDMARKER      ''

以编程方式标记文件的示例，使用 generate_tokens() ：：

import tokenize

with tokenize.open('hello.py') as f:
    tokens = tokenize.generate_tokens(f.readline)
    for token in tokens:
        print(token)

或直接读取字节 tokenize() ：：

import tokenize

with open('hello.py', 'rb') as f:
    tokens = tokenize.tokenize(f.readline)
    for token in tokens:
        print(token)

`tokenize` ——python源的tokenizer¶

标记化输入¶

命令行用法¶

实例¶

Table of Contents

上一个主题

下一个主题

tokenize ——python源的tokenizer¶

标记化输入¶

命令行用法¶

实例¶

`tokenize` ——python源的tokenizer¶