软件导刊2024,Vol.23Issue(6):92-97,6.DOI:10.11907/rjdk.241015
基于前缀剪枝的大规模向量空间相似检索框架
A Large-Scale Vector Space Similarity Retrieval Framework Based on Prefix Pruning
摘要
Abstract
Aiming at the problem of weight-based similarity query under large-scale text collection,an efficient retrieval framework supporting prefix pruning is proposed.Firstly,we give the definition of similarity and its weighted prefix under the vector space model,and theoretically prove the correctness of weighted prefix pruning;then,for large-scale text query,we propose a new inverted index structure,use the index leaf nodes to maintain the prefix weights of the records,and construct efficient similarity retrieval algorithms based on the index;finally,we prove that the meth-od can effectively support large-scale similar retrieval with weights,and the results show that its query efficiency is more than 5 times higher than that of Lucene's subsumption verification strategy.关键词
前缀剪枝/TF/IDF/向量空间/倒排索引/信息检索/数据库Key words
prefix-based pruning/TF/IDF/vector space model/inverted index/information retrieval/database分类
信息技术与安全科学引用本文复制引用
刘健博,邓凌风,李文海,田野..基于前缀剪枝的大规模向量空间相似检索框架[J].软件导刊,2024,23(6):92-97,6.基金项目
武汉市重点研发计划项目(2023010402040006) (2023010402040006)