Journal on Communications ›› 2015, Vol. 36 ›› Issue (8): 1-7.doi: 10.11959/j.issn.1000-436x.2015226

• Academic paper •     Next Articles

Deduplication algorithm based on condensed nearest neighbor rule for deduplication metadata

Wen-bin YAO1,2,Peng-di YE3,Xiao-yong LI4,Jing-kun CHANG1,2   

  1. 1 Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia,Beijing University of Posts and Telecommunications,Beijing 100876,China
    2 School of Computer Science,Beijing University of Posts and Telecommunications,Beijing 100876,China
    3 The Locomotive and Car Research Institute,China Academy of Railway Sciences,Beijing 100081,China
    4 Key Laboratory of Trustworthy Distributed Computing and Service of Ministry of Education,Beijing University of Posts and Telecommunications,Beijing 100876,China
  • Online:2015-08-25 Published:2015-08-25
  • Supported by:
    The National Natural Science Foundation of China;The National High Technology Research and Development Program of China(863 Program);Fundamental Research Funds for the Central Universities

Abstract:

Building effective deduplication index in the memory could reduce disk access times and enhance chunk fingerprint lookup speed,which was a big challenge for deduplication algorithms in massive data environments.As deduplication data set had many samples with high similarity,a deduplication algorithm based on condensed nearest neighbor rule,which was called Dedup2,was proposed.Dedup2uses clustering algorithm to divide the original deduplication metadata into several categories.According to these categories,it employs condensed nearest neighbor rule to remove the highest similar data in the deduplication metadata.After that it can get the subset of deduplication metadata.Based on this subset,new data objects will be deduplicated based on the principle of data similarity.The results of experiments show that Dedup2can reduce the size of deduplication data set more than 50% effectively while maintain similar deduplication ratio.

Key words: deduplication, deduplication metadata, condensed nearest neighbor rule

No Suggested Reading articles found!