>>> from env_helper import info; info()

页面更新时间： 2022-12-22 21:11:43
运行环境：
    Linux发行版本: Debian GNU/Linux 11 (bullseye)
    操作系统内核: Linux-5.10.0-20-amd64-x86_64-with-glibc2.31
    Python版本: 3.9.2

2.4. NLTK 词性标注¶

NLTK模块的一个更强大的方面是，它可以为你做词性标注。意思是把一个句子中的单词标注为名词，形容词，动词等。更令人印象深刻的是，它也可以按照时态来标记，以及其他。这是一列标签，它们的含义和一些例子：

POS tag list:

CC	coordinating conjunction
CD	cardinal digit
DT	determiner
EX	existential there (like: “there is” … think of it like “there exists”)
FW	foreign word
IN	preposition/subordinating conjunction
JJ	adjective ‘big’
JJR	adjective, comparative ‘bigger’
JJS	adjective, superlative ‘biggest’
LS	list marker 1)
MD	modal could, will
NN	noun, singular ‘desk’
NNS	noun plural ‘desks’
NNP	proper noun, singular ‘Harrison’
NNPS	proper noun, plural ‘Americans’
PDT	predeterminer ‘all the kids’
POS	possessive ending parent’s
PRP	personal pronoun I, he, she
PRP＄	possessive pronoun my, his, hers
RB	adverb very, silently,
RBR	adverb, comparative better
RBS	adverb, superlative best
RP	particle give up
TO	to go ‘to’ the store.
UH	interjection errrrrrrrm
VB	verb, base form take
VBD	verb, past tense took
VBG	verb, gerund/present participle taking
VBN	verb, past participle taken
VBP	verb, sing. present, non-3d take
VBZ	verb, 3rd person sing. present takes
WDT	wh-determiner which
WP	wh-pronoun who, what
WP＄	possessive wh-pronoun whose
WRB	wh-abverb where, when

我们如何使用这个？当我们处理它的时候，我们要讲解一个新的句子标记器，叫做PunktSentenceTokenizer。这个标记器能够无监督地进行机器学习，所以你可以在你使用的任何文本上进行实际的训练。首先，让我们获取一些我们打算使用的导入：

>>> import nltk
>>> from nltk.corpus import state_union
>>> from nltk.tokenize import PunktSentenceTokenizer
>>> from nltk import data
>>>
>>> data.path.append("/home/bk/nltk_data/packages")

现在让我们创建训练和测试数据，使用state_union语料库阅读器：

>>> train_text = state_union.raw("2005-GWBush.txt")
>>> sample_text = state_union.raw("2006-GWBush.txt")

一个是 2005 年以来的国情咨文演说，另一个是 2006 年以来的乔治·W·布什总统的演讲。

接下来，我们可以训练 Punkt 标记器，如下所示：

>>> custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

之后我们可以实际分词，使用：

>>> tokenized = custom_sent_tokenizer.tokenize(sample_text)

现在我们可以通过创建一个函数，来完成这个词性标注脚本，该函数将遍历并标记每个句子的词性，如下所示：

>>> def process_content():
>>>     try:
>>>         for i in tokenized[:5]:
>>>             words = nltk.word_tokenize(i)
>>>             tagged = nltk.pos_tag(words)
>>>             print(tagged)
>>>
>>>     except Exception as e:
>>>         print(str(e))
>>> process_content()

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), (',', ','), ('graceful', 'JJ'), (',', ','), ('courageous', 'JJ'), ('woman', 'NN'), ('who', 'WP'), ('called', 'VBD'), ('America', 'NNP'), ('to', 'TO'), ('its', 'PRP$'), ('founding', 'NN'), ('ideals', 'NNS'), ('and', 'CC'), ('carried', 'VBD'), ('on', 'IN'), ('a', 'DT'), ('noble', 'JJ'), ('dream', 'NN'), ('.', '.')]
[('Tonight', 'NN'), ('we', 'PRP'), ('are', 'VBP'), ('comforted', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('hope', 'NN'), ('of', 'IN'), ('a', 'DT'), ('glad', 'JJ'), ('reunion', 'NN'), ('with', 'IN'), ('the', 'DT'), ('husband', 'NN'), ('who', 'WP'), ('was', 'VBD'), ('taken', 'VBN'), ('so', 'RB'), ('long', 'RB'), ('ago', 'RB'), (',', ','), ('and', 'CC'), ('we', 'PRP'), ('are', 'VBP'), ('grateful', 'JJ'), ('for', 'IN'), ('the', 'DT'), ('good', 'JJ'), ('life', 'NN'), ('of', 'IN'), ('Coretta', 'NNP'), ('Scott', 'NNP'), ('King', 'NNP'), ('.', '.')]
[('(', '('), ('Applause', 'NNP'), ('.', '.'), (')', ')')]
[('President', 'NNP'), ('George', 'NNP'), ('W.', 'NNP'), ('Bush', 'NNP'), ('reacts', 'VBZ'), ('to', 'TO'), ('applause', 'VB'), ('during', 'IN'), ('his', 'PRP$'), ('State', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('Union', 'NNP'), ('Address', 'NNP'), ('at', 'IN'), ('the', 'DT'), ('Capitol', 'NNP'), (',', ','), ('Tuesday', 'NNP'), (',', ','), ('Jan', 'NNP'), ('.', '.')]

输出应该是元组列表，元组中的第一个元素是单词，第二个元素是词性标签。

到了这里，我们可以开始获得含义，但是还有一些工作要做。我们将要讨论的下一个话题是分块（chunking），其中我们跟句单词的词性，将单词分到，有意义的分组中。

2.3. NLTK 英文词干提取

2.5. NLTK 朴素贝叶斯分类器

Python 3 教程 文档

2.4. NLTK 词性标注¶

Python 3 教程文档