Tips for speeding up batch indexing¶

概述¶

索引文档通常分为两种常规模式：一次添加一个文档（在Web应用程序中），一次添加一组文档（批量索引）。

以下设置和备用工作流可以加快批处理索引。

词干分析器缓存¶

词干分析器默认使用最近最少使用（least-recently-used, LRU）缓存来限制它使用的内存量，以防止缓存在很长一段时间内重复使用时变得非常大。但是，与具有“无界”缓存的词干分析器相比，LRU缓存可以使索引速度降低近200％。

当您使用分析器的一次触发实例进行大批量索引时，请考虑使用无边界缓存：：

w = myindex.writer()
# Get the analyzer object from a text field
stem_ana = w.schema["content"].format.analyzer
# Set the cachesize to -1 to indicate unbounded caching
stem_ana.cachesize = -1
# Reset the analyzer to pick up the changed attribute
stem_ana.clear()

# Use the writer to index documents...

这个 `limitmb` 参数¶

这个 limitmb 参数到 whoosh.index.Index.writer() 控制 maximum 写入程序将用于索引池的内存（兆字节）。数字越高，索引速度就越快。

默认值为 128 实际上有点低，考虑到现在很多人都有几千兆的RAM。设置得更高可以大大加快索引速度：

from whoosh import index

ix = index.open_dir("indexdir")
writer = ix.writer(limitmb=256)

注解

由于解释器开销（最多两倍），实际使用的内存将高于此值。它作为一个调整参数非常有用，但不用于精确控制whoosh的内存使用。

这个 `procs` 参数¶

这个 procs 参数到 whoosh.index.Index.writer() 控制编写器将用于索引的处理器数量（通过 multiprocessing 模块）：

from whoosh import index

ix = index.open_dir("indexdir")
writer = ix.writer(procs=4)

请注意，当使用多处理时， limitmb 参数控制每个过程, 所以实际使用的内存是 limitmb * procs ：：

# Each process will use a limit of 128, for a total of 512
writer = ix.writer(procs=4, limitmb=128)

这个 `multisegment` 参数¶

这个 procs 参数导致默认的编写器使用多个处理器进行大部分索引，但仍然使用单个进程将每个子编写器的池合并到单个段中。

您还可以使用 multisegment=True 关键字参数不是合并每个子编写器的结果，而是简单地让每个子编写器只写出一个新段：

from whoosh import index

ix = index.open_dir("indexdir")
writer = ix.writer(procs=4, multisegment=True)

缺点是，该选项不创建单个新段，而是创建多个新段。 at least 等于您使用的进程数。

例如，如果使用 procs=4 ，作者将创建四个新段。（如果合并旧段或调用 add_reader 在父级编写器上，父级编写器还将编写一个段，这意味着您将获得五个新段。）

所以，同时 multisegment=True 比普通的编写器快得多，您应该只将它用于大批量索引作业（或者可能只用于从头开始索引）。它不应该是唯一一种用于索引的方法，因为否则段的数量将永远增加！

Tips for speeding up batch indexing¶

概述¶

词干分析器缓存¶

这个 `limitmb` 参数¶

这个 `procs` 参数¶

这个 `multisegment` 参数¶

目录

上一个主题

下一个主题

Tips for speeding up batch indexing¶

概述¶

词干分析器缓存¶

这个 limitmb 参数¶

这个 procs 参数¶

这个 multisegment 参数¶

这个 `limitmb` 参数¶

这个 `procs` 参数¶

这个 `multisegment` 参数¶