通信学报 ›› 2016, Vol. 37 ›› Issue (12): 103-114.doi: 10.11959/j.issn.1000-436x.2016277

• 学术论文 • 上一篇    下一篇

FilterFA:一种基于字符集规约的模式串匹配算法

张萍1,2,3(),何慧敏4,张春燕1,3,曹聪1,3,刘燕兵1,3,谭建龙1,3   

  1. 1 中国科学院信息工程研究所,北京 100093
    2 中国科学院大学,北京 100049
    3 信息内容安全技术国家工程实验室,北京 100093
    4 中国移动(深圳)有限公司,深圳 518031
  • 出版日期:2016-12-25 发布日期:2017-05-15
  • 基金资助:
    中国科学院战略性科技先导专项基金资助项目;新疆自治区科技专项基金资助项目

FilterFA: a multiple string matching algorithm based on specification of character set

Ping ZHANG1,2,3(),Hui-min HE4,Chun-yan ZHANG1,3,Cong CAO1,3,Yan-bing LIU1,3,Jian-long TAN1,3   

  1. 1 Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
    2 University of Chinese Academy of Sciences, Beijing 100049, China
    3 National Engineering Laboratory for Information Security Technologies, Beijing 100093, China
    4 China Mobile (Shenzhen) Co., Ltd., Shenzhen 518031, China
  • Online:2016-12-25 Published:2017-05-15
  • Supported by:
    The Strategic Priority Research Program of the Chinese Academy of Sciences;Xinjiang Uygur Autonomous Region Science and Technology Project

摘要:

多模式串匹配技术是入侵检测系统的核心技术之一,Aho-Corasick算法广泛应用于其中。针对AC自动机内存开销巨大影响算法性能的问题,提出一种基于字符集规约的改进算法——FilterFA。利用字符集映射函数将原字符集压缩为多个像字符集,针对像字符集构造新的自动机FilterFA,将空间复杂度降至O(|P||Σ′|)。在随机数据集和真实数据集ClamAV上的测试结果表明,当像字符集大小为8,且保证误识别率小于2%时,FilterFA算法消耗的存储空间仅为AC算法的3%左右。

关键词: 入侵检测, 多模式串匹配, 字符集规约, 字符集映射

Abstract:

Multiple string matching is one of the core techniques of intrusion detection system, where Aho-Corasick al-gorithm is widely used. To solve the problem that huge storage overhead of AC would influence performance deeply, an improved algorithm ——FilterFA, based on specification of character set was proposed. This algorithm compressed large character by the character set mapping function, and constructed a new automata based on the mapped character set,then space complexity decreased to O(|P||Σ′|). Experiments on synthetic datasets and real-world datasets (such as ClamAV) show that the storage overhead of FilterFA is only about 3% of that of AC, while the size of the character set is 8, and the false recognition rate is less than 2%.

Key words: intrusion detection, multiple string matching, specification of character set, character set mapping

No Suggested Reading articles found!