安全性可控的生成式文本隐写算法

doi:10.11959/j.issn.2096-109x.2022039

网络与信息安全学报 ›› 2022, Vol. 8 ›› Issue (3): 53-65.doi: 10.11959/j.issn.2096-109x.2022039

• 专栏：多媒体内容安全 • 上一篇下一篇

安全性可控的生成式文本隐写算法

梅佳蒙¹^,², 任延珍¹^,², 王丽娜¹^,²

¹ 空天信息安全与可信计算教育部重点实验室，湖北武汉 430072
² 武汉大学国家网络安全学院，湖北武汉 430072

修回日期:2022-04-19 出版日期:2022-06-15 发布日期:2022-06-01
作者简介:梅佳蒙（1997− ），男，湖北宜昌人，武汉大学硕士生，主要研究方向为信息隐藏
任延珍（1973− ），女，陕西延安人，博士，武汉大学教授、博士生导师，主要研究方向为多媒体内容安全、AI交互安全、多媒体信息隐藏和隐写分析
王丽娜（1964− ），女，辽宁沈阳人，博士，武汉大学教授、博士生导师，主要研究方向为多媒体安全、云计算安全和网络安全
基金资助:
国家自然科学基金(61872275);国家自然科学基金(62172306);湖北省重点研发计划(2021BAA034);湖北省重点研发计划(2020BAB018)

Generation-based linguistic steganography with controllable security

Jiameng MEI¹^,², Yanzhen REN¹^,², Lina WANG¹^,²

¹ Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, Wuhan 430072, China
² School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, China

Revised:2022-04-19 Online:2022-06-15 Published:2022-06-01
Supported by:
The National Nature Science Foundation of China(61872275);The National Nature Science Foundation of China(62172306);The Key R＆D Program of Hubei Province(2021BAA034);The Key R＆D Program of Hubei Province(2020BAB018)

摘要/Abstract

摘要：

生成式文本隐写算法通过对候选池中的单词进行控制性选择映射来隐藏秘密信息，通常包含3个模块：文本生成模型、候选池概率分布截断和隐写嵌入算法。由于不同时刻文本生成模型输出的概率分布差异巨大，现有算法通常采用top-k或top-p对候选池单词的概率分布进行截断，以减少低概率的生成词，提高生成文本的安全性。当文本生成模型输出的候选池概率分布过于集中（over-concentrated）或过于平坦（over-flat）时，原有的top-k或top-p截断方式不足以应对概率分布的变化，容易产生概率较低的词或忽略概率较高的词，导致生成文本的安全性指标出现异常。针对此类问题，提出了安全性可控的生成式文本隐写算法，在候选池中根据秘密信息控制性选择生成词时，所提算法基于困惑度和KL散度的参数约束，动态进行候选池概率分布的截断，使候选池中所有单词都满足参数约束，提高了生成文本的安全性。实验结果表明，所提算法生成的隐写文本困惑度和KL散度可控；在相同KL散度情况下，生成文本的困惑度较现有算法下降最高达20%～30%；可以同时控制困惑度和KL散度，在指标合理的情况下，使生成的文本同时满足困惑度和KL散度两个指标。在使用3种文本隐写分析算法检测生成的隐写文本时，检测准确率均在50%左右，表现出很好的统计安全性。

关键词: 生成式文本隐写算法, 算术编码, 安全性可控, 候选池截断

Abstract:

Generation-based linguistic steganography hides secret information through controllable modification and mapping of words in the candidate pool.It usually consists of three parts: text generation model, candidate pool probability distribution truncation and steganographic embedding algorithm.Due to the huge difference in the probability distribution of the text generation model outputs at different times, existing algorithms usually use top-k or top-p methods to truncate the probability distribution of words in the candidate pool to reduce the low-probability generated words and improve the security of the generated text.When the probability distribution of the candidate pool output by the text generation model is over-concentrated or over-flat, the original top-k or top-p truncation method will be not enough to cope with the change of the probability distribution, and it is easy to generate low-probability words or ignore high-probability words.This will lead to abnormal security metrics of the generated text.To address these problems, a generation-based linguistic steganography with controllable security was proposed.When selecting generated words with controllability in the candidate pool according to secret information, the proposed algorithm was based on the parameter constraints of perplexity and KL divergence.The truncation of the candidate pool probability distribution made all words satisfy the parameter constraints, which improved the security of the generated text.Experiment results showed that the perplexity and KL divergence of the steganographic text generated by the proposed algorithm are controllable.Under the same KL divergence, the perplexity of the text generated by the proposed algorithm is reduced by up to 20%～30% compared with the existing algorithm.This algorithm could control the perplexity and KL divergence at the same time, and make the generated text satisfy both perplexity and KL divergence when the indicators are reasonable.When using the three text steganalysis algorithms to detect the generated steganographic text, the detection accuracy is about 50%, showing excellent statistical security.

Key words: generation-based linguistic steganography, arithmetic coding, controllable security, the truncation of candidate pool

中图分类号:

TP37

梅佳蒙, 任延珍, 王丽娜. 安全性可控的生成式文本隐写算法[J]. 网络与信息安全学报, 2022, 8(3): 53-65.

Jiameng MEI, Yanzhen REN, Lina WANG. Generation-based linguistic steganography with controllable security[J]. Chinese Journal of Network and Information Security, 2022, 8(3): 53-65.

图/表 15

图1

图2

图3

图4

图5

表1

图6

表2

图7

图8

图9

表3

表4

表5

表6

参考文献 35

[1]	BENNETT K . Linguistic steganography:survey,analysis,and robustness concerns for hiding information in text[R]. 2004.
[2]	XIANG L , WANG X , YANG C ,et al. A novel linguistic steganography based on synonym run-length encoding[J]. IEICE Transactions on Information and Systems, 2017,100(2): 313-322.
[3]	KHOSRAVI B , KHOSRAVI B , KHOSRAVI B ,et al. A new method for pdf steganography in justified texts[J]. Journal of Information Security and Applications, 2019,45(APR.): 61-70.
[4]	ALATTAR A M , MEMON N D , HEITZENRATER C D ,et al. Linguistic steganography on Twitter:hierarchical language modeling with manual interaction[C]// Media Watermarking,Security,＆Forensics.International Society for Optics and Photonics. 2014:902803.
[5]	DAI W , YU Y , DAI Y ,et al. Text steganography system using markov chain source model and DES algorithm[J]. Journal of Software, 2010,5(7): 785-792.
[6]	MORALDO H H . An Approach for text steganography based on Markov chains[J]. arXiv preprint arXiv:1409.0915, 2014.
[7]	LUO Y , HUANG Y , LI F ,et al. Text selenography based on cia-poetry generation using markov chain model[J]. Ksii Transactions on Internet ＆ Information Systems, 2016,10(9).
[8]	YANG Z , JIN S , HUANG Y ,et al. Automatically generate steganographic text based on Markov model and Huffman coding[J]. arXiv preprint arXiv:1811.04720, 2018.
[9]	ZAREMBA W , SUTSKEVER I , VINYALS O . Recurrent neural network regularization[J]. arXiv preprint arXiv:1409.2329, 2014.
[10]	YANG Z L , ZHANG S Y , HU Y T ,et al. VAE-Stega:linguistic steganography based on variational auto-encoder[J]. IEEE Transactions on Information Forensics and Security, 2020,16: 880-895.
[11]	VASWANI A , SHAZEER N , PARMAR N ,et al. Attention is all you need[C]// Advances In Neural Information Processing Systems. 2017: 5998-6008.
[12]	DEVLIN J , CHANG M W , LEE K ,et al. Bert:Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
[13]	RADFORD A , WU J , CHILD R ,et al. Language models are unsupervised multitask learners[J]. OpenAI blog, 2019,1(8): 9.
[14]	GOODFELLOW I , POUGET-ABADIE J , MIRZA M ,et al. Generative adversarial nets[J]. Advances in Neural Information Processing Systems, 2014,27.
[15]	FANG T , JAGGI M , ARGYRAKI K . Generating steganographic text with LSTMs[J]. arXiv preprint arXiv:1705.10742, 2017.
[16]	YANG Z , ZHANG P , JIANG M ,et al. Rits:real-time interactive text steganography based on automatic dialogue model[C]// International Conference on Cloud Computing and Security. 2018: 253-264.
[17]	YANG Z L , GUO X Q , CHEN Z M ,et al. RNN-stega:linguistic steganography based on recurrent neural networks[J]. IEEE Transactions on Information Forensics and Security, 2018,14(5): 1280-1295.
[18]	HUFFMAN D A . A method for the construction of minimum-redundancy codes[J]. Proceedings of the IRE, 1952,40(9): 1098-1101.
[19]	DAI F Z , CAI Z . Towards near-imperceptible steganographic text[J]. arXiv preprint arXiv:1907.06679, 2019.
[20]	SUTSKEVER I , VINYALS O , LE Q V . Sequence to sequence learning with neural networks[C]// Advances in Neural Information Processing Systems. 2014: 3104-3112.
[21]	ZIEGLER Z M , DENG Y , RUSH A M . Neural linguistic steganography[J]. arXiv preprint arXiv:1909.01496, 2019.
[22]	SHEN J , JI H , HAN J . Near-imperceptible neural linguistic steganography via self-adjusting arithmetic coding[J]. arXiv preprint arXiv:2010.00677, 2020.
[23]	WITTEN I H , NEAL R M , CLEARY J G . Arithmetic coding for data compression[J]. Communications of the ACM, 1987,30(6): 520-540.
[24]	FAN A , LEWIS M , DAUPHIN Y . Hierarchical neural story generation[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 889-898.
[25]	HOLTZMAN A , BUYS J , FORBES M ,et al. Learning to write with cooperative discriminators[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 1638-1649.
[26]	HOLTZMAN A , BUYS J , DU L ,et al. The curious case of neural text degeneration[C]// International Conference on Learning Representations. 2019.
[27]	HERMANN K M , KOCISKY T , GREFENSTETTE E ,et al. Teaching machines to read and comprehend[J]. Advances in Neural Information Processing Systems, 2015,28: 1693-1701.
[28]	NALLAPATI R , ZHOU B , GULCEHRE C ,et al. Abstractive text summarization using sequence-to-sequence rnns and beyond[J]. arXiv preprint arXiv:1602.06023, 2016.
[29]	JOULIN A , GRAVE E , BOJANOWSKI P ,et al. Bag of tricks for efficient text classification[J]. arXiv preprint arXiv:1607.01759, 2016.
[30]	YANG Z , WEI N , SHENG J ,et al. TS-CNN:Text steganalysis from semantic space based on convolutional neural network[J]. arXiv preprint arXiv:1810.08136, 2018.
[31]	YANG Z , WANG K , LI J ,et al. TS-RNN:text steganalysis based on recurrent neural networks[J]. IEEE Signal Processing Letters, 2019,26(12): 1743-1747.
[32]	WANG S I , MANNING C D . Baselines and bigrams:Simple,good sentiment and topic classification[C]// Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2:Short Papers). 2012: 90-94.
[33]	KINGMA D P , BA J . Adam:A method for stochastic optimization[J]. arXiv preprint arXiv:1412.6980, 2014.
[34]	PRECHELT L . Early stopping-but when[M]. Neural Networks: Tricks of the trade.Springer,Berlin,Heidelberg, 1998: 55-69.
[35]	ZHOU P , SHI W , TIAN J ,et al. Attention-based bidirectional long short-term memory networks for relation classification[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 207-212.

单词序号	单词	概率	困惑度	KL散度
1	command	0.036	28.310	4.796
2	first	0.035	42.440	3.806
3	way	0.024	44.488	3.398
4	commander	0.022	49.481	3.091
5	headquarters	0.020	52.382	2.840
6	battle	0.019	60.474	2.645
7	reg	0.017	61.558	2.482
8	footing	0.016	67.646	2.355
9	last	0.015	69.348	2.240
10	ability	0.014	75.906	2.142
11	commanding	0.013	76.047	2.053
12	original	0.013	79.395	1.976
…	…	…	…	…

单词序号	单词	概率	困惑度	KL散度
1	s	0.746	11.248	0.423
2	,	0.089	13.695	0.261
3	in	0.063	20.257	0.140
4	\	0.049	115.357	0.063
5	of	0.009	329.106	0.050
6	The	0.003	330.929	0.046
7	and	0.003	437.421	0.041
…	…	…	…	…

数据集名称	分组	数据集容量
cover	top-k (k=50)	2 000
	block=3
stego-block^[15]	block=4	2 000×3
	block=5
	m=8
stego-Huffman^[17]	m=16	2 000×3
	m=32
	top-k (k=300)
stego-Arithmetic^[21]	top-k (k=600)	2 000×3
	top-k (k=900)
	δ=0.05
stego-SAAC^[22]	δ=0.10	2 000×3
	δ=0.15
	ppl_thr=30
stego-ppl	ppl_thr=60	2 000×3
	ppl_thr=90
	kl_thr=0.1
stego-KL	kl_thr=0.2	2 000×3
	kl_thr=0.3

文本隐写算法	困惑度	KL散度	嵌入容量
stego-block^[15](block=3)	49.59±9.99	2.62±0.21	3
stego-block^[15](block =4)	113.17±26.83	2.80±0.24	4
stego-block^[15](block =5)	252.18±69.04	2.93±0.27	5
stego-Huffman^[17](k=8)	11.46±1.91	1.08±0.14	2.41±0.14
stego-Huffman^[17](k=16)	14.52±2.95	0.85±0.14	2.98±0.19
stego-Huffman^[17](k=32)	18.31±4.24	0.68±0.14	3.48±0.27
stego-Arithmetic^[21](k=300)	22.49±8.17	0.23±0.09	4.15±0.51
stego-Arithmetic^[21](k=600)	27.284±11.22	0.20±0.09	4.42±0.57
stego-Arithmetic^[21](k=900)	30.07±13.24	0.18±0.09	4.58±0.61
stego-SAAC^[22](δ=0.05)	34.28±18.33	0.16±0.08	4.74±0.72
stego-SAAC^[22](δ=0.10)	27.93±14.03	0.19±0.08	4.44±0.67
stego-SAAC^[22](δ=0.15)	24.63±11.61	0.21±0.08	4.25±0.64
stego-ppl_thr=30	23.52±4.15	0.22±0.16	4.31±0.24
stego-ppl_thr=60	35.63±12.63	0.13±0.10	4.89±0.51
stego-ppl_thr=90	39.99±17.46	0.08±0.07	5.06±0.64
stego-KL_thr=0.1	32.21±19.50	0.09±0.01	4.73±0.76
stego-KL_thr=0.2	22.79±10.94	0.18±0.01	4.15±0.66
stego-KL_thr=0.3	17.42±7.16	0.26±0.02	3.72±0.56

步骤t	已生成的文本	k值	困惑度	KL散度嵌入比特数	生成词
t=1	“”	9	15.22	0.592	“In”
t=2	“In”	27	10.9	0.212	“18”
t=3	“In 18”	73	28.79	0.019	“39”
t=4	“In 1839”	10	13.46	0.061	“,”
t=5	“In 1839, ”	519	18.70	0.065	“President”
t=6	“In 1839, President”	433	17.95	0.034	“Abraham”
t=7	“In 1839, President Abraham”	53	11.89	0.0010	“Lincoin”
t=8	“In 1839, President Abraham Lincoin”	5914	13.81	0.0013	“declared”
t=9	“In 1839, President Abraham Lincoin declare”	9789	13.78	0.0016	“war”
t=10	“In 1839, President Abraham Lincoin declare war”	3702	10.79	0.00021	“on”

安全性可控的生成式文本隐写算法

Generation-based linguistic steganography with controllable security

在线阅读

pdf下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 15

参考文献 35

相关文章 3

Metrics

推荐阅读 0

文本隐写算法	TS-CNN		TS-RNN		FastText
文本隐写算法	Acc	F1	Acc	F1	Acc	F1
stego-block^[15](block=3)	0.83	0.82	0.78	0.80	0.80	0.79
stego-block^[15](block=4)	0.86	0.87	0.88	0.86	0.88	0.87
stego-block^[15](block=5)	0.92	0.93	0.96	0.95	0.92	0.93
stego-Huffman^[17](k=8)	0.60	0.55	0.65	0.66	0.65	0.67
stego-Huffman^[17](k=16)	0.52	0.61	0.55	0.60	0.55	0.50
stego-Huffman^[17](k=32)	0.52	0.59	0.50	0.54	0.52	0.51
stego-Arithmetic^[21](k=300)	0.56	0.49	0.55	0.33	0.53	0.50
stego-Arithmetic^[21](k=600)	0.56	0.61	0.56	0.55	0.55	0.53
stego-Arithmetic^[21](k=900)	0.59	0.60	0.56	0.68	0.57	0.61
stego-SAAC^[22](δ=0.05)	0.52	0.61	0.60	0.65	0.60	0.66
stego-SAAC^[22](δ=0.10)	0.51	0.63	0.55	0.64	0.52	0.63
stego-SAAC^[22](δ=0.15)	0.55	0.57	0.51	0.55	0.54	0.53
stego-ppl_thr=30	0.54	0.52	0.52	0.37	0.53	0.51
stego-ppl_thr=60	0.51	0.53	0.52	0.43	0.54	0.62
stego-ppl_thr=90	0.51	0.53	0.55	0.56	0.50	0.51
stego-KL_thr 0.1	0.54	0.42	0.53	0.54	0.53	0.36
stego-KL_thr=0.2	0.53	0.58	0.55	0.58	0.51	0.63
stego-KL_thr=0.3	0.50	0.51	0.58	0.66	0.53	0.52

[1]	陈万泽, 黄丽清, 陈家祯, 叶锋, 黄添强, 罗海峰. 融合小波快捷连接生成对抗网络的面部性别伪造[J]. 网络与信息安全学报, 2023, 9(3): 150-160.
[2]	乔通, 姚宏伟, 潘彬民, 徐明, 陈艳利. 基于深度学习的数字图像取证技术研究进展[J]. 网络与信息安全学报, 2021, 7(5): 13-28.
[3]	李巧玲,关晴骁,赵险峰. 基于卷积神经网络的图像生成方式分类方法[J]. 网络与信息安全学报, 2016, 2(9): 40-48.