2.2. NLTK 与停止词¶

自然语言处理的思想，是进行某种形式的分析或处理，机器至少可以在某种程度上理解文本的含义，表述或暗示。

2.2.1. 停止词的概念¶

这显然是一个巨大的挑战，但是有一些任何人都能遵循的步骤。然而，主要思想是电脑根本不会直接理解单词。令人震惊的是，人类也不会。在人类中，记忆被分解成大脑中的电信号，以发射模式的神经组的形式。对于大脑还有很多未知的事情，但是我们越是把人脑分解成基本的元素，我们就会发现基本的元素。那么，事实证明，计算机以非常相似的方式存储信息！如果我们要模仿人类如何阅读和理解文本，我们需要一种尽可能接近的方法。一般来说，计算机使用数字来表示一切事物，但是我们经常直接在编程中看到使用二进制信号（ True 或 False ，可以直接转换为 1 或 0 ，直接来源于电信号存在 (True, 1) 或不存在 (False, 0) ）。为此，我们需要一种方法,将单词转换为数值或信号模式。将数据转换成计算机可以理解的东西，这个过程称为“预处理”。预处理的主要形式之一就是过滤掉无用的数据。在自然语言处理中，无用词（数据）被称为停止词。

我们可以立即认识到，有些词语比其他词语更有意义。我们也可以看到，有些单词是无用的，是填充词。例如，我们在英语中使用它们来填充句子，这样就没有那么奇怪的声音了。一个最常见的，非官方的，无用词的例子是单词 umm 。人们经常用 umm 来填充，比别的词多一些。这个词毫无意义，除非我们正在寻找一个可能缺乏自信，困惑，或者说没有太多话的人。我们都这样做，有…呃…很多时候，你可以在视频中听到我说 umm 或 uhh 。对于大多数分析而言，这些词是无用的。

我们不希望这些词占用我们数据库的空间，或占用宝贵的处理时间。因此，我们称这些词为“无用词”，因为它们是无用的，我们希望对它们不做处理。 “停止词”这个词的另一个版本可以更书面一些：我们停在上面的单词。

例如，如果您发现通常用于讽刺的词语，可能希望立即停止。讽刺的单词或短语将因词库和语料库而异。就目前而言，我们将把停止词当作不含任何含义的词，我们要把它们删除。

2.2.2. 在 NLTK 中使用停止词¶

您可以轻松地实现它，通过存储您认为是停止词的单词列表。 NLTK 用一堆他们认为是停止词的单词，来让你起步，你可以通过 NLTK 语料库来访问它：

>>> from nltk.corpus import stopwords
>>> import nltk
>>> from nltk import data
>>>
>>> data.path.append("/git/nltk_data/packages")

这里是这个列表：

>>> set(stopwords.words('english'))

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's',
 'same',
 'shan',
 "shan't",
 'she',
 "she's",
 'should',
 "should've",
 'shouldn',
 "shouldn't",
 'so',
 'some',
 'such',
 't',
 'than',
 'that',
 "that'll",
 'the',
 'their',
 'theirs',
 'them',
 'themselves',
 'then',
 'there',
 'these',
 'they',
 'this',
 'those',
 'through',
 'to',
 'too',
 'under',
 'until',
 'up',
 've',
 'very',
 'was',
 'wasn',
 "wasn't",
 'we',
 'were',
 'weren',
 "weren't",
 'what',
 'when',
 'where',
 'which',
 'while',
 'who',
 'whom',
 'why',
 'will',
 'with',
 'won',
 "won't",
 'wouldn',
 "wouldn't",
 'y',
 'you',
 "you'd",
 "you'll",
 "you're",
 "you've",
 'your',
 'yours',
 'yourself',
 'yourselves'}

以下是结合使用stop_words集合，从文本中删除停止词的方法：

>>> from nltk.corpus import stopwords
>>> from nltk.tokenize import word_tokenize
>>>
>>> example_sent = "This is a sample sentence, showing off the stop words filtration."
>>>
>>> stop_words = set(stopwords.words('english'))
>>>
>>> word_tokens = word_tokenize(example_sent)
>>>
>>> filtered_sentence = [w for w in word_tokens if not w in stop_words]
>>>
>>> filtered_sentence = []
>>>
>>> for w in word_tokens:
>>>     if w not in stop_words:
>>>         filtered_sentence.append(w)
>>>
>>> print(word_tokens)
>>> print(filtered_sentence)

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']

数据预处理的另一种形式是“词干提取（Stemming）”，这就是我们接下来要讨论的内容。

2.1. 使用 NLTK 分析单词和句子

2.3. NLTK 英文词干提取

Python 3 教程 文档

2.2. NLTK 与停止词¶

2.2.1. 停止词的概念¶

2.2.2. 在 NLTK 中使用停止词¶

Python 3 教程文档