基于加权自学习散列的高维数据最近邻查询算法

doi:10.11959/j.issn.1000-0801.2017100

摘要/Abstract

摘要：

因为查询和存储具有高效性，学习型散列逐渐被应用于解决最近邻查询问题。学习型散列将高维数据转化成二进制编码，并使得原始高维空间中越相似的数据对应二进制编码的汉明距离越小。在实际应用中，每次查询都会返回许多与查询点汉明距离相同而编码互不相同的数据。如何对这些数据进行排序是一个难题。提出了一种基于加权自学习散列的近邻查找算法。实验结果表明，算法能够高效地对具有相同汉明距离的不同编码进行重排序，加权排序后查询的F1值约是原来的2倍并优于同系算法，时间开销可比直接计算原始距离进行排序降低一个数量级。

关键词: 最近邻查询, 学习型散列, 加权自学习, 高维数据

Abstract:

Because of efficiency in query and storage,learning hash is applied in solving the nearest neighbor search problem.The learning hash usually converts high-dimensional data into binary codes.In this way,the similarities between binary codes from two objects are conserved as they were in the original high-dimensional space.In practical applications,a lot of data which have the same distance from the query point but with different code will be returned.How to reorder these candidates is a problem.An algorithm named weighted self-taught hashing was proposed.Experimental results show that the proposed algorithm can reorder the different binary codes with the same Hamming distances efficiently.Compared to the naive algorithm,the F1-score of the proposed algorithm is improved by about 2 times and it is better than the homologous algorithms,furthermore,the time cost is reduced by an order of magnitude.

Key words: nearest neighbor search, learning hash, weighted self-taught, high-dimensional data

中图分类号:

TP391

彭聪,钱江波,陈华辉,董一鸿. 基于加权自学习散列的高维数据最近邻查询算法[J]. 电信科学, 2017, 33(6): 73-85.

Cong PENG,Jiangbo QIAN,Huahui CHEN,Yihong DONG. Nearest neighbor search algorithm for high dimensional data based on weighted self-taught hashing[J]. Telecommunications Science, 2017, 33(6): 73-85.

图/表 13

图1

图2

表1

C k与 pk、wk对应关系"

bit	q	$o_{1}$	$o_{2}$	$o_{3}$	$o_{4}$	$o_{5}$	$C_{k}$	$p_{k}$	$w_{k}$
bit₁	0	1	1	1	0	1	1	1/5	1/5
bit₂	1	0	1	0	0	1	2	2/5	2/5
bit₃	1	1	0	1	1	0	3	3/5	3/5
bit₄	0	0	0	0	1	0	4	4/5	4/5

表1

图3

图4

图5

图6

图7

图8

表2

图9

图10

图11

参考文献 30

[1]	HE J , LIU W , CHANG S F . Scalable similarity search with optimized kernel hashing[C]// The 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD),July 25-28,2010,Washington,DC,USA. New York:ACM Press, 2010: 1129-1138.
[2]	吴军 . 大数据和机器智能对未来社会的影响[J]. 电信科学, 2015,31(2): 7-16.
	WU J . Big data,machine intelligence and their impacts to the future World[J]. Telecommunications Science, 2015,31(2): 7-16.
[3]	ZHANG D , WANG F , SI L . Composite hashing with multiple information sources[C]// The International ACM SIGIR Conference on Research and Development in Information Retrieval,SIGIR 2011,July 24-28,2011,Beijing,China. New York:ACM Press, 2011: 225-234.
[4]	尤海浪, 钱锋, 黄祥为 ,等. 基于大数据挖掘构建游戏平台个性化推荐系统的研究与实践[J]. 电信科学, 2014,30(10): 27-32.
	YOU H L , QIAN F , HUANG X W ,et al. Research and practice of building a personalized recommendation system for mobile game platform based on big data mining[J]. Telecommunica-tions Science, 2014,30(10): 27-32.
[5]	徐雅斌, 刘超, 武装 ,等. 基于用户兴趣和推荐信任域的微博推荐[J]. 电信科学, 2015,31(1): 13-20.
	XU Y B , LIU C , WU Z ,et al. Micro-blog recommendation based on user interests and recommendation trust domain[J]. Telecommunications Science, 2015,31(1): 13-20.
[6]	KONG W , LI W J . Isotropic hashing[J]. Advances in Neural Information Processing Systems, 2012(2): 1646-1654.
[7]	GONG Y , LAZEBNIK S , GORDO A ,et al. Iterative quantization:a procrustean approach to learning binary codes for large-scale image retrieval[J]. IEEE Transactions on Pattern Analysis Machine Intelligence, 2013(35): 2916-2929.
[8]	JIANG Q Y , LI W J . Scalable graph hashing with feature transformation[C]// International Conference on Artificial Intelligence,July 25-31,2015,Buenos Aires,Argentina. New York:ACM Press, 2015: 2248-2254.
[9]	WANG J , KUMAR S , CHANG S F . Semi-supervised hashing for large-scale search[J]. IEEE Transactions on Pattern Analysis Machine Intelligence, 2012,34(12): 2393-2406.
[10]	NOROUZI M E , FLEET D J . Minimal loss hashing for compact binary codes[C]// International Conference on Machine Learning,June 28-July 1,2011,Bellevue,Washington,USA.[S.l.:s.n]. 2011: 353-360.
[11]	LIU W , WANG J , JI R ,et al. Supervised hashing with kernels[C]// 2012 IEEE Conference on Computer Vision and Pattern Recognition(CVPR),June 16-21,2012,Kingston,USA. New Jersey:IEEE Press, 2012: 2074-2081.
[12]	LI W J , WANG S , KANG W C . Feature learning based deep supervised hashing with pairwise labels[J]. arXiv preprint arXiv:1511.03855, 2015.
[13]	KANG W C , LI W J , ZHOU Z H . Column sampling based discrete supervised hashing[C]// AAAI,February 12-17,2016,Phoenix,Arizona,USA.[S.l.:s.n]. 2016: 1230-1236.
[14]	BENTLEY J L , . K-d trees for semi dynamic point sets[C]// Symposium on Computational Geometry,June 7-9,1990,Berkley,California,USA. New York:ACM Press, 1990: 187-197.
[15]	AHMED M , MAHAR K , ABDELKADER H ,et al. Combining R-Tree and B-Tree to enhance spatial queries[C]// International Conference on Computer Theory and Applications,Oct 8-11,2013,Alexandria,Egypt.[S.l.:s.n]. 2013.
[16]	SHRIVASTAVA A , LI P . Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS)[J]. Advances in Neural Information Processing Systems, 2014(3): 2321-2329.
[17]	ANDONI A , INDYK P , LAARHOVEN T ,et al. Practical and optimal LSH for angular distance[J].Computer Science,2015. Computer Science, 2015.
[18]	QIAN J , ZHU Q , CHEN H . Multi-granularity locality-sensitive bloom filter[J]. IEEE Transactions on Computers, 2015,64(12): 3500-3514.
[19]	HUANG Q , FENG J , ZHANG Y ,et al. Query-aware locality-sensitive hashing for approximate nearest neighbor search[J]. Proceedings of the VLDB Endowment, 2015,9(1): 1-12.
[20]	LIU Y , CUI J , HUANG Z ,et al. SK-LSH:an efficient index structure for approximate nearest neighbor search[J]. Proceedings of the VLDB Endowment, 2014,7(9): 745-756.
[21]	SHUM H Y , ZHANG L , ZHANG X . QsRank:query-sensitive hash code ranking for efficient ?-neighbor search[C]// 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),June 16-21,2012,Kingston,Rhode Island,USA. New York:ACM Press, 2012: 2058-2065.
[22]	ZHANG L , ZHANG Y , TANG J ,et al. Binary code ranking with weighted hamming distance[C]// 2013 IEEE Conference on Computer Vision and Pattern Recognition(CVPR),June 23-28,2013,Portland,Oregon,USA. New York:ACM Press, 2013: 1586-1593.
[23]	ZHANG D , WANG J , CAI D ,et al. Self-taught hashing for fast similarity search[C]// International ACM SIGIR Conference on Research and Development in Information Retrieval,July 19-23,2010,Geneva,Switzerland. New York:ACM Press, 2010: 18-25.
[24]	袁培森, 沙朝锋, 王晓玲 ,等. 一种基于学习的高维数据 c-近似最近邻查询算法[J]. 软件学报, 2012,23(8): 2018-2031.
	YUAN P S , SHA Z F , WANG X L ,et al. A high dimensional data c-approximate nearest neighbor query algorithm based on learning[J]. Journal of Software, 2012,23(8): 2018-2031.
[25]	殷良鹰 . 非线性度量学习算法研究[D]. 北京:北京理工大学, 2016.
	YIN L Y . Research of nonlinear metric learning algorithm[D]. Beijing:Beijing Institute of Technology, 2016.
[26]	姜丹 . 信息论与编码[M]. 北京: 中国科学技术大学出版社, 2004.
	JIANG D . Information theory and coding[M]. Beijing: Univer-sity of Science ＆ Technology China PressPress, 2004.
[27]	魏书堤, 姜小奇 . 一种利用信息熵确定属性权重的模糊单因素评价方法[J]. 计算机工程与科学, 2010,32(7): 93-94.
	WEI S D , JIANG X Q . A fuzzy single factor evaluation method based on entropy to determine the weight of attributes[J]. Computer Engineering＆ Science, 2010,32(7): 93-94.
[28]	BELKIN M , NIYOGI P . Laplacian eigenmaps for dimensionality reduction and data representation[J]. Neural Computation, 2003,15(6): 1373-1396.
[29]	FAN R E , CHANG K W , HSIEH C J ,et al. Liblinear:a library for large linear classification[J]. Journal of Machine Learning Research, 2008,9(9): 1871-1874.
[30]	周志华 . 机器学习[M]. 北京: 清华大学出版社, 2016.
	ZHOU Z H . Machine learning[M]. Beijing: Tsinghua University PressPress, 2016.

编码长度	Reuters21578数据集		20Newsgroups数据集		TDT2数据集
编码长度	WSTHs	欧氏	WTHs	欧氏	WTHs	欧氏
4	35.428 6	739.105 4	73.172 1	1 353.835 6	260.610 7	4 665.874 5
8	32.559 2	570.788 6	29.594 3	600.568 3	108.156 5	1 947.040 1
12	21.704 7	484.824 2	26.362 6	478.037 5	67.570 3	1 225.862 0
16	21.347 3	354.663 8	21.305 6	442.278 6	50.856 9	920.451 3
20	14.465 1	286.251 0	26.604 4	433.643 9	45.930 0	767.526 7
24	12.248 7	222.789 1	21.476 8	404.780 9	49.711 9	691.038 2
28	12.171 8	205.063 9	23.263 4	414.414 6	38.512 1	617.036 0
32	10.007 8	172.408 3	19.495 6	391.412 2	35.790 2	504.167 3
36	10.338 0	167.010 9	21.275 7	375.782 5	29.896 0	412.498 6
40	8.725 2	135.351 7	20.138 6	333.844 0	25.688 7	346.459 0
44	8.076 7	126.817 7	18.878 6	292.501 3	22.997 9	295.564 6
48	7.850 7	122.781 0	15.489 9	245.141 9	23.666 2	265.888 2
52	6.751 4	96.049 3	14.120 7	207.877 7	21.322 1	235.231 9
56	5.910 4	99.728 5	11.549 9	179.772 8	18.795 4	218.865 2
60	6.908 4	97.791 1	10.841 5	161.983 0	17.524 7	192.378 7
64	5.677 4	81.916 7	10.758 2	141.673 4	17.307 4	189.965 1