大数据 ›› 2021, Vol. 7 ›› Issue (6): 67-77.doi: 10.11959/j.issn.2096-0271.2021061

• 研究 • 上一篇    下一篇

基于特征选择的局部敏感哈希位选择算法

周文桦, 刘华文, 李恩慧   

  1. 浙江师范大学数学与计算机科学学院,浙江 金华 321001
  • 出版日期:2021-11-15 发布日期:2021-11-01
  • 作者简介:周文桦(1997- ),男,浙江师范大学数学与计算机科学学院硕士生,主要研究方向为信息检索、哈希学习
    刘华文(1977- ),男,博士,浙江师范大学数学与计算机科学学院教授,主要研究方向为数据挖掘
    李恩慧(1996- ),女,浙江师范大学数学与计算机科学学院硕士生,主要研究方向为聚类、异常点分析
  • 基金资助:
    国家自然科学基金资助项目(61976195)

Algorithm of locality sensitive hashing bit selection based on feature selection

Wenhua ZHOU, Huawen LIU, Enhui LI   

  1. College of Mathematics and Computer Science, Zhejiang Normal University, Jinhua 321001, China
  • Online:2021-11-15 Published:2021-11-01
  • Supported by:
    The National Natural Science Foundation of China(61976195)

摘要:

作为主流的信息检索方法,局部敏感哈希往往需要生成较长的哈希码才能达到检索要求。然而,长哈希码需要消耗巨大的存储空间且携带大量的冗余哈希位。为了解决此问题,采用特征工程中10种简单高效的选择算法从长局部敏感哈希码中选择信息量丰富的哈希位,去除冗余、无效的哈希位。这10种选择算法使用不同的方式来刻画每一个哈希位的性能或两个哈希位之间的相关性,如方差、汉明距离等。通过去除长哈希码中性能较差或具有高相关性的哈希位进行哈希位的选择。将选择后的哈希码与原哈希码的性能进行比较。在4个常用数据集上的实验结果表明,去除冗余哈希位后的哈希码与原哈希码的性能几乎相同,且其哈希位的去除比率能达到30%~70%。

关键词: 近似近邻搜索, 哈希学习, 哈希位选择, 特征选择, 降维

Abstract:

Locality sensitive hashing is one of the most popular information retrieval methods, which needs to generate long hashing bits to meet the retrieval requirement.However, a long hashing bits requires huge storage space, and contains plenty of redundant hashing bits.In order to solve this problem, ten simple and efficient selection algorithms in feature engineering were adopted to extract the hashing bits which carry the largest amount of information from the long hashing bits which were generated by locality sensitive hashing, and the redundant and useless hash bits were removed.Those ten algorithms tried to capture the performance of each hashing bit or the correlation among bits, such as variance and hamming distance.During selection process, the useless or high-correlated hashing bits were removed.Then the selected hashing bits were compared with the original long hashing bits.The experimental results on four common datasets show that the selected hashing bits works as well as the original hashing bits, and their reduction ratio can reach from 30% to 70%.

Key words: approximate nearest neighbor search, hashing learning, hashing bit selection, feature selection, dimensionality reduction

中图分类号: 

No Suggested Reading articles found!