Whoosh 技巧

一般

从文档编号中获取文档的存储字段

stored_fields = searcher.stored_fields(docnum)

分析

删除短/长于n的单词

使用A StopFilter 以及 minsizemaxsize keyword arguments. 如果只想根据大小而不是常用词进行筛选,请设置 stoplistNone ::

sf = analysis.StopFilter(stoplist=None, minsize=2, maxsize=40)

允许可选的区分大小写搜索

一个快速而简单的方法是索引每个单词的原始版本和低级版本。如果用户搜索一个全小写的单词,它将充当不区分大小写的搜索,但是如果他们搜索一个包含任何大写字符的单词,它将充当区分大小写的搜索:

class CaseSensitivizer(analysis.Filter):
    def __call__(self, tokens):
        for t in tokens:
            yield t
            if t.mode == "index":
               low = t.text.lower()
               if low != t.text:
                   t.text = low
                   yield t

ana = analysis.RegexTokenizer() | CaseSensitivizer()
[t.text for t in ana("The new SuperTurbo 5000", mode="index")]
# ["The", "the", "new", "SuperTurbo", "superturbo", "5000"]

搜索

查找每个文档

myquery = query.Every()

键入时搜索iTunes样式

使用 whoosh.analysis.NgramWordAnalyzer 作为要作为用户类型搜索的字段的分析器。通过使用关闭字段中的位置,可以节省索引中的空间。 phrase=False ,因为在n-gram字段上搜索短语通常没有什么意义:

# For example, to search the "title" field as the user types
analyzer = analysis.NgramWordAnalyzer()
title_field = fields.TEXT(analyzer=analyzer, phrase=False)
schema = fields.Schema(title=title_field)

有关 NgramWordAnalyzer 类以获取有关可用选项的信息。

快捷方式

根据字段值查找文档

# Single document (unique field value)
stored_fields = searcher.document(id="bacon")

# Multiple documents
for stored_fields in searcher.documents(tag="cake"):
    ...

排序和评分

排序和分面 .

根据匹配项的位置评分结果

下面的评分函数使用每个文档中第一个术语出现的位置来计算分数,因此文档中较早出现给定术语的文档将得分更高:

from whoosh import scoring

def pos_score_fn(searcher, fieldname, text, matcher):
    poses = matcher.value_as("positions")
    return 1.0 / (poses[0] + 1)

pos_weighting = scoring.FunctionWeighting(pos_score_fn)
with myindex.searcher(weighting=pos_weighting) as s:
    ...

结果

有多少次点击?

scored 击打::

found = results.scored_length()

根据搜索的参数,可以知道准确的点击总数:

if results.has_exact_length():
    print("Scored", found, "of exactly", len(results), "documents")

但是,通常不知道与查询匹配的确切文档数,因为搜索者可以跳过其知道不会出现在“前n个”列表中的文档块。如果你调用 len(results) 在精确长度未知的查询中,whoosh将运行原始查询的未计分版本以获取精确的数字。这比计分搜索要快,但在非常大的索引或复杂的查询上可能仍然慢得很明显。

作为替代方案,您可以显示 estimated 点击总数:

found = results.scored_length()
if results.has_exact_length():
    print("Scored", found, "of exactly", len(results), "documents")
else:
    low = results.estimated_min_length()
    high = results.estimated_length()

    print("Scored", found, "of between", low, "and", high, "documents")

在每次点击中匹配哪些术语?

# Use terms=True to record term matches for each hit
results = searcher.search(myquery, terms=True)

for hit in results:
    # Which terms matched in this hit?
    print("Matched:", hit.matched_terms())

    # Which terms from the query didn't match in this hit?
    print("Didn't match:", myquery.all_terms() - hit.matched_terms())

全球信息

索引中有多少文档?

# Including documents that are deleted but not yet optimized away
numdocs = searcher.doc_count_all()

# Not including deleted documents
numdocs = searcher.doc_count()

索引中有哪些字段?

return myindex.schema.names()

术语x在索引中吗?

return ("content", "wobble") in searcher

术语x在索引中出现多少次?

# Number of times content:wobble appears in all documents
freq = searcher.frequency("content", "wobble")

# Number of documents containing content:wobble
docfreq = searcher.doc_frequency("content", "wobble")

条款x是否在文件y中?

# Check if the "content" field of document 500 contains the term "wobble"

# Without term vectors, skipping through list...
postings = searcher.postings("content", "wobble")
postings.skip_to(500)
return postings.id() == 500

# ...or the slower but easier way
docset = set(searcher.postings("content", "wobble").all_ids())
return 500 in docset

# If field has term vectors, skipping through list...
vector = searcher.vector(500, "content")
vector.skip_to("wobble")
return vector.id() == "wobble"

# ...or the slower but easier way
wordset = set(searcher.vector(500, "content").all_ids())
return "wobble" in wordset