如何索引文档¶

创建索引对象¶

要在目录中创建索引，请使用 index.create_in ：：

import os, os.path
from whoosh import index

if not os.path.exists("indexdir"):
    os.mkdir("indexdir")

ix = index.create_in("indexdir", schema)

要打开目录中的现有索引，请使用 index.open_dir ：：

import whoosh.index as index

ix = index.open_dir("indexdir")

以下是方便的方法：

from whoosh.filedb.filestore import FileStorage
storage = FileStorage("indexdir")

# Create an index
ix = storage.create_index(schema)

# Open an existing index
storage.open_index()

The schema you created the index with is pickled and stored with the index.

可以使用indexname关键字参数在同一目录中保留多个索引：：

# Using the convenience functions
ix = index.create_in("indexdir", schema=schema, indexname="usages")
ix = index.open_dir("indexdir", indexname="usages")

# Using the Storage object
ix = storage.create_index(schema, indexname="usages")
ix = storage.open_index(indexname="usages")

清除索引¶

调用 index.create_in 在具有现有索引的目录上，将清除索引的当前内容。

要测试目录当前是否包含有效索引，请使用 index.exists_in ：：

exists = index.exists_in("indexdir")
usages_exists = index.exists_in("indexdir", indexname="usages")

（或者，您可以简单地从目录中删除索引的文件，例如，如果目录中只有一个索引，请使用 shutil.rmtree 删除目录，然后重新创建。）

索引文档¶

一旦你创建了一个 Index 对象，可以使用 IndexWriter 对象。最简单的方法是 IndexWriter 是调用来的 Index.writer() ：：

ix = index.open_dir("index")
writer = ix.writer()

创建编写器会锁定要编写的索引，因此一次只能打开一个线程/进程。

注解

因为打开编写器会锁定要编写的索引，所以在多线程或多进程环境中，代码需要注意打开编写器可能会引发异常。（ whoosh.store.LockError ）如果作者已经打开。whoosh包括几个示例实现（ whoosh.writing.AsyncWriter 和 whoosh.writing.BufferedWriter ）解决写锁问题的方法。

注解

当编写器处于打开状态并且在提交期间，索引仍然可以读取。现有读卡器不受影响，新读卡器可以正常打开当前索引。一旦提交完成，现有的读卡器将继续看到索引的前一个版本（即，它们不会自动看到新提交的更改）。新读者将看到更新的索引。

索引编写器的 add_document(**kwargs) 方法接受字段名映射到值的关键字参数：

writer = ix.writer()
writer.add_document(title=u"My document", content=u"This is my document!",
                    path=u"/a", tags=u"first short", icon=u"/icons/star.png")
writer.add_document(title=u"Second try", content=u"This is the second example.",
                    path=u"/b", tags=u"second short", icon=u"/icons/sheep.png")
writer.add_document(title=u"Third time's the charm", content=u"Examples are many.",
                    path=u"/c", tags=u"short", icon=u"/icons/book.png")
writer.commit()

您不必为每个字段填写值。whoosh不在乎你是否在文档中遗漏了一个字段。

索引字段必须传递一个Unicode值。存储但未编入索引的字段（即 STORED 字段类型）可以传递任何可pickle的对象。

whoosh将很高兴地允许您添加具有相同值的文档，根据您使用库的目的，这些值可能有用或令人讨厌：

writer.add_document(path=u"/a", title=u"A", content=u"Hello there")
writer.add_document(path=u"/a", title=u"A", content=u"Deja vu!")

这会向索引添加两个具有相同路径和标题字段的文档。有关 update_document 方法，它使用“唯一”字段替换旧文档，而不是追加。

索引和存储同一字段的不同值¶

如果您有一个同时被索引和存储的字段，您可以索引一个Unicode值，但在必要时存储一个不同的对象（它通常不是，但有时确实有用），使用“特殊”关键字参数 _stored_<fieldname> . 正常值将被分析和索引，但“存储”值将显示在结果中：

writer.add_document(title=u"Title to be indexed", _stored_title=u"Stored title")

正在完成添加文档¶

安 IndexWriter object is kind of like a database transaction. You specify a bunch of changes to the index, and then "commit" them all at once.

调用 commit() 上 IndexWriter 将添加的文档保存到索引：：

writer.commit()

一旦您的文档在索引中，您就可以搜索它们。

如果要关闭编写器而不提交更改，请调用 cancel() 而不是 commit() ：：

writer.cancel()

请记住，当您打开了一个编写器（包括您打开的并且仍在作用域中的编写器）时，其他线程或进程都无法获取编写器或修改索引。一个作者还保存了几个打开的文件。所以你应该记得调用给 commit() 或 cancel() 当你完成了一个writer对象。

合并段¶

Whoosh filedb 索引实际上是一个或多个称为段的“子索引”的容器。当您将文档添加到索引中，而不是将新文档与现有文档集成（这可能非常昂贵，因为它涉及到磁盘上的所有索引项），whoosh会在现有段旁边创建一个新段。然后在搜索索引时，whoosh单独搜索两个段并合并结果，使这些段看起来是一个统一的索引。（此智能设计是从Lucene复制的。）

因此，拥有一些片段比每次添加文档时重写整个索引更有效。但是搜索多个片段会稍微减慢搜索速度，而且您拥有的片段越多，搜索速度就越慢。所以whoosh有一个算法，当你调用 commit() 这会寻找可以合并在一起的小片段，从而生成更小、更大的片段。

要防止whoosh在提交期间合并段，请使用 merge 关键字参数：

writer.commit(merge=False)

要将所有段合并在一起，将索引优化为单个段，请使用 optimize 关键字参数：

writer.commit(optimize=True)

由于优化会重写索引中的所有信息，因此在大型索引上可能会变慢。通常情况下，依靠whoosh的合并算法比一直优化要好。

(The Index 对象还具有 optimize() 方法，用于优化索引（将所有段合并在一起）。它只需创建一个作者并调用 commit(optimize=True) 在上面）

为了更好地控制段合并，可以编写自己的合并策略函数，并将其用作 commit() 方法。参见 NO_MERGE ， MERGE_SMALL 和 OPTIMIZE 中的函数 whoosh.writing 模块。

删除文档¶

您可以使用以下方法删除文档 IndexWriter 对象。然后你需要调用 commit() 将删除的内容保存到磁盘上。

delete_document(docnum)

按内部文档编号删除文档的低级方法。

is_deleted(docnum)

低级方法，返回 True 如果删除具有给定内部编号的文档。

delete_by_term(fieldname, termtext)

删除给定（索引）字段包含给定术语的任何文档。这对 ID 或 KEYWORD 领域。

delete_by_query(query)

删除与给定查询匹配的所有文档。

# Delete document by its path -- this field must be indexed
ix.delete_by_term('path', u'/a/b/c')
# Save the deletion to disk
ix.commit()

在 filedb 后端，“删除”文档只是将文档编号添加到与索引一起存储的已删除文档列表中。当您搜索索引时，它知道不会在结果中返回已删除的文档。但是，文档的内容仍然存储在索引中，并且某些统计信息（例如术语文档频率）不会更新，除非合并包含已删除文档的段（请参见上文的合并）。（这是因为立即从索引中删除信息实质上涉及重写磁盘上的整个索引，这将非常低效。）

Updating documents¶

If you want to "replace" (re-index) a document, you can delete the old document using one of the delete_* 方法对 Index 或 IndexWriter 然后使用 IndexWriter.add_document to add the new version. 或者，你可以使用 IndexWriter.update_document 一步完成。

为了 update_document 要工作，必须至少将架构中的一个字段标记为“唯一”。然后，whoosh将使用“唯一”字段的内容搜索要删除的文档：

from whoosh.fields import Schema, ID, TEXT

schema = Schema(path = ID(unique=True), content=TEXT)

ix = index.create_in("index")
writer = ix.writer()
writer.add_document(path=u"/a", content=u"The first document")
writer.add_document(path=u"/b", content=u"The second document")
writer.commit()

writer = ix.writer()
# Because "path" is marked as unique, calling update_document with path="/a"
# will delete any existing documents where the "path" field contains "/a".
writer.update_document(path=u"/a", content="Replacement for the first document")
writer.commit()

必须索引“唯一”字段。

如果没有与要更新的文档的唯一字段匹配的现有文档， update_document 行为就像 add_document .

"Unique" fields and update_document 是删除和添加的简单快捷方式。whoosh没有唯一标识符的固有概念，并且在使用时也不会强制唯一性。 add_document .

增量索引¶

在为文档集合编制索引时，通常需要两个代码路径：一个是从头开始为所有文档编制索引，另一个是只更新已更改的文档（将Web应用程序放在需要根据用户操作添加/更新文档的位置）。

从零开始索引所有内容非常容易。下面是一个简单的例子：

import os.path
from whoosh import index
from whoosh.fields import Schema, ID, TEXT

def clean_index(dirname):
  # Always create the index from scratch
  ix = index.create_in(dirname, schema=get_schema())
  writer = ix.writer()

  # Assume we have a function that gathers the filenames of the
  # documents to be indexed
  for path in my_docs():
    add_doc(writer, path)

  writer.commit()


def get_schema()
  return Schema(path=ID(unique=True, stored=True), content=TEXT)


def add_doc(writer, path):
  fileobj = open(path, "rb")
  content = fileobj.read()
  fileobj.close()
  writer.add_document(path=path, content=content)

现在，对于一小部分文档集合，每次从头开始索引实际上可能足够快。但是对于大型集合，您只需要让脚本重新索引已更改的文档。

首先，我们需要存储每个文档的上次修改时间，以便检查文件是否已更改。在本例中，我们将使用mtime来简化：

def get_schema()
  return Schema(path=ID(unique=True, stored=True), time=STORED, content=TEXT)

def add_doc(writer, path):
  fileobj = open(path, "rb")
  content = fileobj.read()
  fileobj.close()
  modtime = os.path.getmtime(path)
  writer.add_document(path=path, content=content, time=modtime)

现在我们可以修改脚本以允许“清理”（从头开始）或增量索引：

def index_my_docs(dirname, clean=False):
  if clean:
    clean_index(dirname)
  else:
    incremental_index(dirname)


def incremental_index(dirname)
    ix = index.open_dir(dirname)

    # The set of all paths in the index
    indexed_paths = set()
    # The set of all paths we need to re-index
    to_index = set()

    with ix.searcher() as searcher:
      writer = ix.writer()

      # Loop over the stored fields in the index
      for fields in searcher.all_stored_fields():
        indexed_path = fields['path']
        indexed_paths.add(indexed_path)

        if not os.path.exists(indexed_path):
          # This file was deleted since it was indexed
          writer.delete_by_term('path', indexed_path)

        else:
          # Check if this file was changed since it
          # was indexed
          indexed_time = fields['time']
          mtime = os.path.getmtime(indexed_path)
          if mtime > indexed_time:
            # The file has changed, delete it and add it to the list of
            # files to reindex
            writer.delete_by_term('path', indexed_path)
            to_index.add(indexed_path)

      # Loop over the files in the filesystem
      # Assume we have a function that gathers the filenames of the
      # documents to be indexed
      for path in my_docs():
        if path in to_index or path not in indexed_paths:
          # This is either a file that's changed, or a new file
          # that wasn't indexed before. So index it!
          add_doc(writer, path)

      writer.commit()

这个 incremental_index 功能：

循环访问当前索引的所有路径。
- 如果任何文件不再存在，请从索引中删除相应的文档。
- 如果文件仍然存在，但已被修改，请将其添加到要重新索引的路径列表中。
- 如果该文件存在，不管它是否被修改过，请将其添加到所有索引路径的列表中。
循环访问磁盘上文件的所有路径。
- 如果一个路径不在所有索引路径集中，那么该文件是新的，我们需要索引它。
- 如果路径位于要重新索引的路径集中，则需要对其进行索引。
- 否则，我们可以跳过索引文件。

清除索引¶

在某些情况下，您可能希望从头开始重新索引。在不干扰任何现有读卡器的情况下清除索引：

from whoosh import writing

with myindex.writer() as mywriter:
    # You can optionally add documents to the writer here
    # e.g. mywriter.add_document(...)

    # Using mergetype=CLEAR clears all existing segments so the index will
    # only have any documents you've added to this writer
    mywriter.mergetype = writing.CLEAR

或者，如果不使用编写器作为上下文管理器并调用 commit() 直接，这样做：

mywriter = myindex.writer()
# ...
mywriter.commit(mergetype=writing.CLEAR)

注解

如果您不需要担心现有的读卡器，一个更有效的方法就是简单地删除索引目录的内容并重新开始。

如何索引文档¶

创建索引对象¶

清除索引¶

索引文档¶

索引和存储同一字段的不同值¶

正在完成添加文档¶

合并段¶

删除文档¶

Updating documents¶

增量索引¶

清除索引¶

目录

上一个主题

下一个主题