网络与信息安全学报 ›› 2024, Vol. 10 ›› Issue (1): 112-122.doi: 10.11959/j.issn.2096-109x.2024008

• 学术论文 • 上一篇    

基于LLM的多粒度口令分析研究

洪萌, 邱卫东, 王杨德   

  1. 上海交通大学网络空间安全学院,上海 200240
  • 修回日期:2024-01-01 出版日期:2024-02-01 发布日期:2024-02-01
  • 作者简介:洪萌(1995− ),女,江苏南通人,上海交通大学硕士生,主要研究方向为密码分析
    邱卫东(1973− ),男,江西修水人,上海交通大学教授、博士生导师,主要研究方向为计算机取证、密码分析、人工智能安全、大数据隐私保护
    王杨德(1987− ),男,辽宁沈阳人,上海交通大学博士生,主要研究方向为计算机取证、密码分析
  • 基金资助:
    国家自然科学基金(61972249);国家重点研发计划(2023YFB3106501)

Research on multi-granularity password analysis based on LLM

Meng HONG, Weidong QIU, Yangde WANG   

  1. School of Cyber Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
  • Revised:2024-01-01 Online:2024-02-01 Published:2024-02-01
  • Supported by:
    The National Natural Science Foundation of China(61972249);The National Key R&D Program of China(2023YFB3106501)

摘要:

基于口令的认证是常见的身份认证机制。然而,大规模口令泄露事件时有发生,表明口令仍面临着被猜测或者盗用等风险。由于口令可以被视作一种特殊的自然语言,近年来运用自然语言处理技术进行口令分析的研究工作逐渐展开。目前少有工作在大语言模型(LLM,large language model)上探究口令文本分词粒度对口令分析效果的影响。为此,提出了基于 LLM 的多粒度口令分析框架,总体上沿用预训练范式,在大量未标记数据集上自主学习口令分布先验知识。该框架由同步网络、主干网络、尾部网络3个模块构成。其中,同步网络模块实现了char-level、template-level和chunk-level这3种粒度的口令分词,并提取了口令的字符分布、结构、词块组成等特征知识;主干网络模块构建了通用的口令模型来学习口令组成规律;尾部网络模块生成了候选口令对目标库进行猜测分析。在Tianya、Twitter等8个口令库上进行大量实验,分析总结了多粒度分词下所提框架在不同语言环境中的口令分析效果。实验结果表明,在中文用户场景中,基于char-level和chunk-level分词的框架口令分析性能接近一致,且显著优于基于template-level分词的框架;在英文用户场景中,基于chunk-level分词的框架口令分析性能最佳。

关键词: 大语言模型, 口令分析, 自然语言处理, 分词

Abstract:

Password-based authentication has been widely used as the primary authentication mechanism.However, occasional large-scale password leaks have highlighted the vulnerability of passwords to risks such as guessing or theft.In recent years, research on password analysis using natural language processing techniques has progressed, treating passwords as a special form of natural language.Nevertheless, limited studies have investigated the impact of password text segmentation granularity on the effectiveness of password analysis with large language models.A multi-granularity password-analyzing framework was proposed based on a large language model, which follows the pre-training paradigm and autonomously learns prior knowledge of password distribution from large unlabelled datasets.The framework comprised three modules: the synchronization network, backbone network, and tail network.The synchronization network module implemented char-level, template-level, and chunk-level password segmentation, extracting knowledge on character distribution, structure, word chunk composition, and other password features.The backbone network module constructed a generic password model to learn the rules governing password composition.The tail network module generated candidate passwords for guessing and analyzing target databases.Experimental evaluations were conducted on eight password databases including Tianya and Twitter, analyzing and summarizing the effectiveness of the proposed framework under different language environments and word segmentation granularities.The results indicate that in Chinese user scenarios, the performance of the password-analyzing framework based on char-level and chunk-level segmentation is comparable, and significantly superior to the framework based on template-level segmentation.In English user scenarios, the framework based on chunk-level segmentation demonstrates the best password-analyzing performance.

Key words: large language model, password analysis, natural language processing, word segmentation

中图分类号: 

No Suggested Reading articles found!