>>> from env_helper import info; info()
页面更新时间: 2022-12-22 21:11:43
运行环境:
    Linux发行版本: Debian GNU/Linux 11 (bullseye)
    操作系统内核: Linux-5.10.0-20-amd64-x86_64-with-glibc2.31
    Python版本: 3.9.2

2.4. NLTK 词性标注

NLTK模块的一个更强大的方面是,它可以为你做词性标注。 意思是把一个句子中的单词标注为名词,形容词,动词等。 更令人印象深刻的是,它也可以按照时态来标记,以及其他。 这是一列标签,它们的含义和一些例子:

POS tag list:

CC

coordinating conjunction

CD

cardinal digit

DT

determiner

EX

existential there (like: “there is” … think of it like “there exists”)

FW

foreign word

IN

preposition/subordinating conjunction

JJ

adjective ‘big’

JJR

adjective, comparative ‘bigger’

JJS

adjective, superlative ‘biggest’

LS

list marker 1)

MD

modal could, will

NN

noun, singular ‘desk’

NNS

noun plural ‘desks’

NNP

proper noun, singular ‘Harrison’

NNPS

proper noun, plural ‘Americans’

PDT

predeterminer ‘all the kids’

POS

possessive ending parent’s

PRP

personal pronoun I, he, she

PRP$

possessive pronoun my, his, hers

RB

adverb very, silently,

RBR

adverb, comparative better

RBS

adverb, superlative best

RP

particle give up

TO

to go ‘to’ the store.

UH

interjection errrrrrrrm

VB

verb, base form take

VBD

verb, past tense took

VBG

verb, gerund/present participle taking

VBN

verb, past participle taken

VBP

verb, sing. present, non-3d take

VBZ

verb, 3rd person sing. present takes

WDT

wh-determiner which

WP

wh-pronoun who, what

WP$

possessive wh-pronoun whose

WRB

wh-abverb where, when

我们如何使用这个? 当我们处理它的时候,我们要讲解一个新的句子标记器,叫做PunktSentenceTokenizer。 这个标记器能够无监督地进行机器学习,所以你可以在你使用的任何文本上进行实际的训练。 首先,让我们获取一些我们打算使用的导入:

>>> import nltk
>>> from nltk.corpus import state_union
>>> from nltk.tokenize import PunktSentenceTokenizer
>>> from nltk import data
>>>
>>> data.path.append("/home/bk/nltk_data/packages")

现在让我们创建训练和测试数据,使用state_union语料库阅读器:

>>> train_text = state_union.raw("2005-GWBush.txt")
>>> sample_text = state_union.raw("2006-GWBush.txt")

一个是 2005 年以来的国情咨文演说,另一个是 2006 年以来的乔治·W·布什总统的演讲。

接下来,我们可以训练 Punkt 标记器,如下所示:

>>> custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

之后我们可以实际分词,使用:

>>> tokenized = custom_sent_tokenizer.tokenize(sample_text)

现在我们可以通过创建一个函数,来完成这个词性标注脚本,该函数将遍历并标记每个句子的词性,如下所示:

>>> def process_content():
>>>     try:
>>>         for i in tokenized[:5]:
>>>             words = nltk.word_tokenize(i)
>>>             tagged = nltk.pos_tag(words)
>>>             print(tagged)
>>>
>>>     except Exception as e:
>>>         print(str(e))
>>> process_content()
[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), (',', ','), ('graceful', 'JJ'), (',', ','), ('courageous', 'JJ'), ('woman', 'NN'), ('who', 'WP'), ('called', 'VBD'), ('America', 'NNP'), ('to', 'TO'), ('its', 'PRP$'), ('founding', 'NN'), ('ideals', 'NNS'), ('and', 'CC'), ('carried', 'VBD'), ('on', 'IN'), ('a', 'DT'), ('noble', 'JJ'), ('dream', 'NN'), ('.', '.')]
[('Tonight', 'NN'), ('we', 'PRP'), ('are', 'VBP'), ('comforted', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('hope', 'NN'), ('of', 'IN'), ('a', 'DT'), ('glad', 'JJ'), ('reunion', 'NN'), ('with', 'IN'), ('the', 'DT'), ('husband', 'NN'), ('who', 'WP'), ('was', 'VBD'), ('taken', 'VBN'), ('so', 'RB'), ('long', 'RB'), ('ago', 'RB'), (',', ','), ('and', 'CC'), ('we', 'PRP'), ('are', 'VBP'), ('grateful', 'JJ'), ('for', 'IN'), ('the', 'DT'), ('good', 'JJ'), ('life', 'NN'), ('of', 'IN'), ('Coretta', 'NNP'), ('Scott', 'NNP'), ('King', 'NNP'), ('.', '.')]
[('(', '('), ('Applause', 'NNP'), ('.', '.'), (')', ')')]
[('President', 'NNP'), ('George', 'NNP'), ('W.', 'NNP'), ('Bush', 'NNP'), ('reacts', 'VBZ'), ('to', 'TO'), ('applause', 'VB'), ('during', 'IN'), ('his', 'PRP$'), ('State', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('Union', 'NNP'), ('Address', 'NNP'), ('at', 'IN'), ('the', 'DT'), ('Capitol', 'NNP'), (',', ','), ('Tuesday', 'NNP'), (',', ','), ('Jan', 'NNP'), ('.', '.')]

输出应该是元组列表,元组中的第一个元素是单词,第二个元素是词性标签。

到了这里,我们可以开始获得含义,但是还有一些工作要做。 我们将要讨论的下一个话题是分块(chunking),其中我们跟句单词的词性,将单词分到,有意义的分组中。