>>> from env_helper import info; info()
页面更新时间: 2022-12-22 21:11:43
运行环境:
Linux发行版本: Debian GNU/Linux 11 (bullseye)
操作系统内核: Linux-5.10.0-20-amd64-x86_64-with-glibc2.31
Python版本: 3.9.2
2.4. NLTK 词性标注¶
NLTK模块的一个更强大的方面是,它可以为你做词性标注。 意思是把一个句子中的单词标注为名词,形容词,动词等。 更令人印象深刻的是,它也可以按照时态来标记,以及其他。 这是一列标签,它们的含义和一些例子:
POS tag list:
CC |
coordinating conjunction |
---|---|
CD |
cardinal digit |
DT |
determiner |
EX |
existential there (like: “there is” … think of it like “there exists”) |
FW |
foreign word |
IN |
preposition/subordinating conjunction |
JJ |
adjective ‘big’ |
JJR |
adjective, comparative ‘bigger’ |
JJS |
adjective, superlative ‘biggest’ |
LS |
list marker 1) |
MD |
modal could, will |
NN |
noun, singular ‘desk’ |
NNS |
noun plural ‘desks’ |
NNP |
proper noun, singular ‘Harrison’ |
NNPS |
proper noun, plural ‘Americans’ |
PDT |
predeterminer ‘all the kids’ |
POS |
possessive ending parent’s |
PRP |
personal pronoun I, he, she |
PRP$ |
possessive pronoun my, his, hers |
RB |
adverb very, silently, |
RBR |
adverb, comparative better |
RBS |
adverb, superlative best |
RP |
particle give up |
TO |
to go ‘to’ the store. |
UH |
interjection errrrrrrrm |
VB |
verb, base form take |
VBD |
verb, past tense took |
VBG |
verb, gerund/present participle taking |
VBN |
verb, past participle taken |
VBP |
verb, sing. present, non-3d take |
VBZ |
verb, 3rd person sing. present takes |
WDT |
wh-determiner which |
WP |
wh-pronoun who, what |
WP$ |
possessive wh-pronoun whose |
WRB |
wh-abverb where, when |
我们如何使用这个?
当我们处理它的时候,我们要讲解一个新的句子标记器,叫做PunktSentenceTokenizer
。
这个标记器能够无监督地进行机器学习,所以你可以在你使用的任何文本上进行实际的训练。
首先,让我们获取一些我们打算使用的导入:
>>> import nltk
>>> from nltk.corpus import state_union
>>> from nltk.tokenize import PunktSentenceTokenizer
>>> from nltk import data
>>>
>>> data.path.append("/home/bk/nltk_data/packages")
现在让我们创建训练和测试数据,使用state_union
语料库阅读器:
>>> train_text = state_union.raw("2005-GWBush.txt")
>>> sample_text = state_union.raw("2006-GWBush.txt")
一个是 2005 年以来的国情咨文演说,另一个是 2006 年以来的乔治·W·布什总统的演讲。
接下来,我们可以训练 Punkt 标记器,如下所示:
>>> custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
之后我们可以实际分词,使用:
>>> tokenized = custom_sent_tokenizer.tokenize(sample_text)
现在我们可以通过创建一个函数,来完成这个词性标注脚本,该函数将遍历并标记每个句子的词性,如下所示:
>>> def process_content():
>>> try:
>>> for i in tokenized[:5]:
>>> words = nltk.word_tokenize(i)
>>> tagged = nltk.pos_tag(words)
>>> print(tagged)
>>>
>>> except Exception as e:
>>> print(str(e))
>>> process_content()
[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), (',', ','), ('graceful', 'JJ'), (',', ','), ('courageous', 'JJ'), ('woman', 'NN'), ('who', 'WP'), ('called', 'VBD'), ('America', 'NNP'), ('to', 'TO'), ('its', 'PRP$'), ('founding', 'NN'), ('ideals', 'NNS'), ('and', 'CC'), ('carried', 'VBD'), ('on', 'IN'), ('a', 'DT'), ('noble', 'JJ'), ('dream', 'NN'), ('.', '.')]
[('Tonight', 'NN'), ('we', 'PRP'), ('are', 'VBP'), ('comforted', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('hope', 'NN'), ('of', 'IN'), ('a', 'DT'), ('glad', 'JJ'), ('reunion', 'NN'), ('with', 'IN'), ('the', 'DT'), ('husband', 'NN'), ('who', 'WP'), ('was', 'VBD'), ('taken', 'VBN'), ('so', 'RB'), ('long', 'RB'), ('ago', 'RB'), (',', ','), ('and', 'CC'), ('we', 'PRP'), ('are', 'VBP'), ('grateful', 'JJ'), ('for', 'IN'), ('the', 'DT'), ('good', 'JJ'), ('life', 'NN'), ('of', 'IN'), ('Coretta', 'NNP'), ('Scott', 'NNP'), ('King', 'NNP'), ('.', '.')]
[('(', '('), ('Applause', 'NNP'), ('.', '.'), (')', ')')]
[('President', 'NNP'), ('George', 'NNP'), ('W.', 'NNP'), ('Bush', 'NNP'), ('reacts', 'VBZ'), ('to', 'TO'), ('applause', 'VB'), ('during', 'IN'), ('his', 'PRP$'), ('State', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('Union', 'NNP'), ('Address', 'NNP'), ('at', 'IN'), ('the', 'DT'), ('Capitol', 'NNP'), (',', ','), ('Tuesday', 'NNP'), (',', ','), ('Jan', 'NNP'), ('.', '.')]
输出应该是元组列表,元组中的第一个元素是单词,第二个元素是词性标签。
到了这里,我们可以开始获得含义,但是还有一些工作要做。 我们将要讨论的下一个话题是分块(chunking),其中我们跟句单词的词性,将单词分到,有意义的分组中。