Whoosh 技巧¶
分析¶
删除短/长于n的单词¶
使用A StopFilter
以及 minsize
和 maxsize
keyword arguments. 如果只想根据大小而不是常用词进行筛选,请设置 stoplist
到 None
::
sf = analysis.StopFilter(stoplist=None, minsize=2, maxsize=40)
允许可选的区分大小写搜索¶
一个快速而简单的方法是索引每个单词的原始版本和低级版本。如果用户搜索一个全小写的单词,它将充当不区分大小写的搜索,但是如果他们搜索一个包含任何大写字符的单词,它将充当区分大小写的搜索:
class CaseSensitivizer(analysis.Filter):
def __call__(self, tokens):
for t in tokens:
yield t
if t.mode == "index":
low = t.text.lower()
if low != t.text:
t.text = low
yield t
ana = analysis.RegexTokenizer() | CaseSensitivizer()
[t.text for t in ana("The new SuperTurbo 5000", mode="index")]
# ["The", "the", "new", "SuperTurbo", "superturbo", "5000"]
搜索¶
查找每个文档¶
myquery = query.Every()
键入时搜索iTunes样式¶
使用 whoosh.analysis.NgramWordAnalyzer
作为要作为用户类型搜索的字段的分析器。通过使用关闭字段中的位置,可以节省索引中的空间。 phrase=False
,因为在n-gram字段上搜索短语通常没有什么意义:
# For example, to search the "title" field as the user types
analyzer = analysis.NgramWordAnalyzer()
title_field = fields.TEXT(analyzer=analyzer, phrase=False)
schema = fields.Schema(title=title_field)
有关 NgramWordAnalyzer
类以获取有关可用选项的信息。
快捷方式¶
根据字段值查找文档¶
# Single document (unique field value)
stored_fields = searcher.document(id="bacon")
# Multiple documents
for stored_fields in searcher.documents(tag="cake"):
...
排序和评分¶
见 排序和分面 .
根据匹配项的位置评分结果¶
下面的评分函数使用每个文档中第一个术语出现的位置来计算分数,因此文档中较早出现给定术语的文档将得分更高:
from whoosh import scoring
def pos_score_fn(searcher, fieldname, text, matcher):
poses = matcher.value_as("positions")
return 1.0 / (poses[0] + 1)
pos_weighting = scoring.FunctionWeighting(pos_score_fn)
with myindex.searcher(weighting=pos_weighting) as s:
...
结果¶
有多少次点击?¶
数 scored 击打::
found = results.scored_length()
根据搜索的参数,可以知道准确的点击总数:
if results.has_exact_length():
print("Scored", found, "of exactly", len(results), "documents")
但是,通常不知道与查询匹配的确切文档数,因为搜索者可以跳过其知道不会出现在“前n个”列表中的文档块。如果你调用 len(results)
在精确长度未知的查询中,whoosh将运行原始查询的未计分版本以获取精确的数字。这比计分搜索要快,但在非常大的索引或复杂的查询上可能仍然慢得很明显。
作为替代方案,您可以显示 estimated 点击总数:
found = results.scored_length()
if results.has_exact_length():
print("Scored", found, "of exactly", len(results), "documents")
else:
low = results.estimated_min_length()
high = results.estimated_length()
print("Scored", found, "of between", low, "and", high, "documents")
在每次点击中匹配哪些术语?¶
# Use terms=True to record term matches for each hit
results = searcher.search(myquery, terms=True)
for hit in results:
# Which terms matched in this hit?
print("Matched:", hit.matched_terms())
# Which terms from the query didn't match in this hit?
print("Didn't match:", myquery.all_terms() - hit.matched_terms())
全球信息¶
索引中有多少文档?¶
# Including documents that are deleted but not yet optimized away
numdocs = searcher.doc_count_all()
# Not including deleted documents
numdocs = searcher.doc_count()
索引中有哪些字段?¶
return myindex.schema.names()
术语x在索引中吗?¶
return ("content", "wobble") in searcher
术语x在索引中出现多少次?¶
# Number of times content:wobble appears in all documents
freq = searcher.frequency("content", "wobble")
# Number of documents containing content:wobble
docfreq = searcher.doc_frequency("content", "wobble")
条款x是否在文件y中?¶
# Check if the "content" field of document 500 contains the term "wobble"
# Without term vectors, skipping through list...
postings = searcher.postings("content", "wobble")
postings.skip_to(500)
return postings.id() == 500
# ...or the slower but easier way
docset = set(searcher.postings("content", "wobble").all_ids())
return 500 in docset
# If field has term vectors, skipping through list...
vector = searcher.vector(500, "content")
vector.skip_to("wobble")
return vector.id() == "wobble"
# ...or the slower but easier way
wordset = set(searcher.vector(500, "content").all_ids())
return "wobble" in wordset