电信科学 ›› 2017, Vol. 33 ›› Issue (6): 73-85.doi: 10.11959/j.issn.1000-0801.2017100

• 研究与开发 • 上一篇    下一篇

基于加权自学习散列的高维数据最近邻查询算法

彭聪,钱江波(),陈华辉,董一鸿   

  1. 宁波大学信息科学与工程学院,浙江 宁波 315211
  • 修回日期:2017-03-31 出版日期:2017-06-01 发布日期:2017-06-27
  • 作者简介:彭聪(1990?),男,宁波大学硕士生,主要研究方向为大数据、数据挖掘。|钱江波(1974?),男,博士,宁波大学教授,主要研究方向为数据处理与挖掘、逻辑电路设计、多维索引与查询优化。|陈华辉(1964?),男,博士,宁波大学教授,主要研究方向为数据处理与挖掘、云计算。|董一鸿(1969?),男,博士,宁波大学教授,主要研究方向为大数据、数据挖掘和人工智能。
  • 基金资助:
    国家自然科学基金资助项目(61472194);国家自然科学基金资助项目(61572266);浙江省自然科学基金资助项目(LY16F020003)

Nearest neighbor search algorithm for high dimensional data based on weighted self-taught hashing

Cong PENG,Jiangbo QIAN(),Huahui CHEN,Yihong DONG   

  1. College of Information Science and Engineering,Ningbo University,Ningbo 315211,China
  • Revised:2017-03-31 Online:2017-06-01 Published:2017-06-27
  • Supported by:
    The National Natural Science Foundation of China(61472194);The National Natural Science Foundation of China(61572266);Zhejiang Provincial Natural Science Foundation of China(LY16F020003)

摘要:

因为查询和存储具有高效性,学习型散列逐渐被应用于解决最近邻查询问题。学习型散列将高维数据转化成二进制编码,并使得原始高维空间中越相似的数据对应二进制编码的汉明距离越小。在实际应用中,每次查询都会返回许多与查询点汉明距离相同而编码互不相同的数据。如何对这些数据进行排序是一个难题。提出了一种基于加权自学习散列的近邻查找算法。实验结果表明,算法能够高效地对具有相同汉明距离的不同编码进行重排序,加权排序后查询的F1值约是原来的2倍并优于同系算法,时间开销可比直接计算原始距离进行排序降低一个数量级。

关键词: 最近邻查询, 学习型散列, 加权自学习, 高维数据

Abstract:

Because of efficiency in query and storage,learning hash is applied in solving the nearest neighbor search problem.The learning hash usually converts high-dimensional data into binary codes.In this way,the similarities between binary codes from two objects are conserved as they were in the original high-dimensional space.In practical applications,a lot of data which have the same distance from the query point but with different code will be returned.How to reorder these candidates is a problem.An algorithm named weighted self-taught hashing was proposed.Experimental results show that the proposed algorithm can reorder the different binary codes with the same Hamming distances efficiently.Compared to the naive algorithm,the F1-score of the proposed algorithm is improved by about 2 times and it is better than the homologous algorithms,furthermore,the time cost is reduced by an order of magnitude.

Key words: nearest neighbor search, learning hash, weighted self-taught, high-dimensional data

中图分类号: 

No Suggested Reading articles found!