基于LLM的多粒度口令分析研究

doi:10.11959/j.issn.2096-109x.2024008

Abstract

Abstract:

Password-based authentication has been widely used as the primary authentication mechanism.However, occasional large-scale password leaks have highlighted the vulnerability of passwords to risks such as guessing or theft.In recent years, research on password analysis using natural language processing techniques has progressed, treating passwords as a special form of natural language.Nevertheless, limited studies have investigated the impact of password text segmentation granularity on the effectiveness of password analysis with large language models.A multi-granularity password-analyzing framework was proposed based on a large language model, which follows the pre-training paradigm and autonomously learns prior knowledge of password distribution from large unlabelled datasets.The framework comprised three modules: the synchronization network, backbone network, and tail network.The synchronization network module implemented char-level, template-level, and chunk-level password segmentation, extracting knowledge on character distribution, structure, word chunk composition, and other password features.The backbone network module constructed a generic password model to learn the rules governing password composition.The tail network module generated candidate passwords for guessing and analyzing target databases.Experimental evaluations were conducted on eight password databases including Tianya and Twitter, analyzing and summarizing the effectiveness of the proposed framework under different language environments and word segmentation granularities.The results indicate that in Chinese user scenarios, the performance of the password-analyzing framework based on char-level and chunk-level segmentation is comparable, and significantly superior to the framework based on template-level segmentation.In English user scenarios, the framework based on chunk-level segmentation demonstrates the best password-analyzing performance.

Key words: large language model, password analysis, natural language processing, word segmentation

CLC Number:

TP309

Meng HONG, Weidong QIU, Yangde WANG. Research on multi-granularity password analysis based on LLM[J]. Chinese Journal of Network and Information Security, 2024, 10(1): 112-122.

Figures/Tables 21

References 28

[1]	BONNEAU J , HERLEY C , OORSCHOT P C V ,et al. The quest to replace passwords:a framework for comparative evaluation of web authentication schemes[C]// Proceedings of 2012 IEEE Symposium on Security and Privacy. 2012: 553-567.
[2]	BONNEAU J , HERLEY C , VAN OORSCHOT P C ,et al. Passwords and the evolution of imperfect authentication[J]. Communications of the ACM, 2015,58(7): 78-87.
[3]	YAN J , BLACKWELL A , ANDERSON R ,et al. Password memorability and security:empirical results[J]. IEEE Security ＆ Privacy, 2004,2(5): 25-31.
[4]	DELL'AMICO M , MICHIARDI P , ROUDIER Y . Measuring password strength:an empirical analysis[J]. arXiv preprint arXiv:0907.3402, 2009.
[5]	MA J , YANG W N , LUO M ,et al. A study of probabilistic password models[C]// Proceedings of 2014 IEEE Symposium on Security and Privacy. 2014: 689-704.
[6]	DüRMUTH M , ANGELSTORF F , CASTELLUCCIA C ,et al. OMEN:faster password guessing using an ordered Markov enumerator[C]// Proceedings of International Symposium on Engineering Secure Software and Systems. 2015: 119-132.
[7]	MELICHER W , UR B , SEGRETI S M ,et al. Fast,lean,and accurate:modeling password guess ability using neural networks[C]// Proceedings of the 25th USENIX Conference on Security Symposium. 2016: 175-191.
[8]	WEIR M , AGGARWAL S , MEDEIROS B D ,et al. Password cracking using probabilistic context-free grammars[C]// Proceedings of 2009 30th IEEE Symposium on Security and Privacy. 2009: 391-405.
[9]	PASQUINI D , GANGWAL A , ATENIESE G ,et al. Improving password guessing via representation learning[C]// Proceedings of 2021 IEEE Symposium on Security and Privacy (SP). 2021: 1382-1399.
[10]	NARAYANAN A , SHMATIKOV V . Fast dictionary attacks on passwords using time-space tradeoff[C]// Proceedings of the 12th ACM Conference on Computer and Communications Security. 2005: 364-372.
[11]	CASTELLUCCIA C , DüRMUTH M , PERITO D . Adaptive password-strength meters from Markov models[C]// Proceedings of NDSS. 2012.
[12]	CHOU H C , LEE H C , YU H J ,et al. Password cracking based on learned patterns from disclosed passwords[J]. IJICIC, 2013,9(2): 821-839.
[13]	SHAY R , KOMANDURI S , KELLEY P G ,et al. Encountering stronger password requirements:user attitudes and behaviors[C]// Proceedings of the Sixth Symposium on Usable Privacy and Security. 2010: 1-20.
[14]	VERAS R , COLLINS C , THORPE J . On semantic patterns of passwords and their security impact[C]// Proceedings of NDSS. 2014.
[15]	XU M , YU J , ZHANG X ,et al. Improving real-world password guessing attacks via bi-directional transformers[C]// Proceedings of 32nd USENIX Security Symposium (USENIX Security 23). 2023: 1001-1018.
[16]	ZHAO W X , ZHOU K , LI J ,et al. A survey of large language models[J]. arXiv preprint arXiv:2303.18223, 2023.
[17]	SHANAHAN M . Talking about large language models[J]. arXiv preprint arXiv:2212.03551, 2022.
[18]	QIU X P , SUN T X , XU Y G ,et al. Pre-trained models for natural language processing:a survey[J]. Science China Technological Sciences, 2020,63(10): 1872-1897.
[19]	HAN X , ZHANG Z Y , DING N ,et al. Pre-trained models:past,present and future[J]. AI Open, 2021,2: 225-250.
[20]	RADFORD A , WU J , CHILD R ,et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019,1(8): 9.
[21]	RANDO J , PéREZ-CRUZ F , HITAJ B . PassGPT:password modeling and (guided) generation with large language models[J]. arXiv preprint arXiv:2306.01545, 2023.
[22]	XU M , WANG C W , YU J T ,et al. Chunk-level password guessing:towards modeling refined password composition representations[C]// Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2021: 5-20.
[23]	LI H , CHEN M Q , YAN S B ,et al. Password guessing via neural language modeling[M]// Machine Learning for Cyber Security. 2019: 78-93.
[24]	DELL'AMICO M , FILIPPONE M . Monte Carlo strength evaluation:fast and reliable password checking[C]// Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. 2015: 158-169.
[25]	GAGE P . A new algorithm for data compression[J]. C Users Journal, 1994,12(2): 23-38.
[26]	DEVLIN J , CHANG M W , LEE K ,et al. Bert:pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
[27]	YANG Z , DAI Z , YANG Y ,et al. XLNet:generalized autoregressive pretraining for language understanding[J]. Advances in Neural Information Processing Systems, 2019,32.
[28]	HITAJ B , GASTI P , ATENIESE G ,et al. PassGAN:A deep learning approach for password guessing[M]// Applied Cryptography and Network Security. Cham: Springer International Publishing, 2019: 217-237.

Metrics

Recommended 0

No Suggested Reading articles found!

序号	名称	泄露时间	口令数目/个	用途
1	Gmail	2014.09	4 663 677	训练集
2	Twitter	2016.06	67 095 263	测试集
3	Rockyou	2009.12	28 705 927	测试集
4	Myheritage	2018.08	84 825 745	测试集
5	CSDN	2011.12	6 387 785	训练集
6	Tianya	2011.12	30 274 001	测试集
7	178	2011.12	9 072 688	测试集
8	17173	2011.12	17 942 621	测试集

分词方法	分词级别	分词粒度	特征提取
CLT	char level	较细	字符分布
TLT	template level	较粗	结构
CKLT	chunk level	中等	词块组成

字符串	概率	字符串	概率
qq	0.060 1	wang	0.031 3
aa	0.047 8	love	0.025 7
li	0.018 8	chen	0.023 3
as	0.013 8	yang	0.020 0
ab	0.012 1	abcd	0.017 8

字符串	概率	字符串	概率
12	0.069 7	1234	0.082 1
11	0.061 5	1314	0.064 2
88	0.054 1	1111	0.012 1
00	0.052 0	0000	0.007 5
99	0.033 0	8888	0.006 4

字符串	概率	字符串	概率
..	0.261 9	!@#$	0.137 0
**	0.084 6	....	0.116 7
@@	0.052 0	+-*/	0.062 7
++	0.047 4	****	0.048 1
!!	0.039 8	/*-+	0.033 5

Research on multi-granularity password analysis based on LLM

RichHTML

PDF下载

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 21

References 28

Related Articles 3

Metrics

Recommended 0

序号	词根	序号	词根	序号	词根
1	y	7	..	13	xy
2	888	8	jian	14	aini
3	27	9	111	15	gh
4	1990	10	19830	16	02
5	am	11	152	17	ha
6	pp	12	ke	18	1234567890

模型规模	解码器模块/层	自注意力头的数目	隐藏层大小	训练参数	覆盖率
Small	6	256	4	4 781 056	79.98%
Base	12	256	12	85 934 712	80.92%

序号	口令结构	概率
1	D₈	0.220 8
2	D₉	0.114 9
3	A₈	0.044 9
4	D₁₁	0.041 8
5	D₁₀	0.037 4

[1]	Weicheng QIU, Xiuzhen CHEN, Yinghua MA, Jin MA, Zhihong ZHOU. Predicting correlation relationships of entities between attack patterns and techniques based on word embedding and graph convolutional network [J]. Chinese Journal of Network and Information Security, 2023, 9(4): 40-52.
[2]	Long DAI, Jing ZHANG, Xuefeng FAN, Xiaoyi ZHOU. NLP neural network copyright protection based on black box watermark [J]. Chinese Journal of Network and Information Security, 2023, 9(1): 140-149.
[3]	You YU, Yu FU, Xiaoping WU. Summary of text classification methods [J]. Chinese Journal of Network and Information Security, 2019, 5(5): 1-8.