基于多模态融合提升的文本分类方法

doi:10.11959/j.issn.2096-0271.2023067

摘要/Abstract

摘要：

尽管基于多模态的文本分类技术在应用到具体场景中具有潜力，但仍存在局限性。现有多模态融合模型要求输入数据模态对齐，因此大量不完整的多模态数据被直接浪费，从而限制了推理时可用数据的规模和灵活性。为了解决这个问题，提出了一种基于多模态融合提升的文本分类模型和不充分多模态资源训练方法。与传统方法相比，提出的模型在标准数据集上的性能平均提高了约4.25%。此外，在除文本输入模态外的其他模态缺失率为50%的情况下，不充分多模态资源训练方法的性能比传统多路由策略提高了约4%。这表明所提出的模型和训练方法具有明显的优势和有效性。

关键词: 文本分类, 交叉注意力, 多模态融合, 不充分多模态资源训练方法

Abstract:

Although multimodal text classification techniques have potential when applied to specific scenarios, there are still some limitations.Existing multimodal fusion models require modal alignment in the input data, resulting in a large amount of incomplete multimodal data being directly discarded, thus limiting the scale and flexibility of available data for inference.To address this problem, we proposed a text classification model based on multimodal fusion enhancement and an insufficient multimodal resource training method.Compared with traditional methods, our model had shown an improved performance of an average of 4.25% on a standard dataset.Furthermore, when the missing rate of other modalities except for text input was 50%, using the insufficient multimodal resource training method improved the performance by about 4% compared with traditional multi-route strategies.The experimental results demonstrate the effectiveness of the proposed model and training method.

Key words: text classification, cross attention, multimodal fusion, insufficient multimodal resource training method

中图分类号:

TP183

刘德志, 何柳, 刘幼峰, 韩德纯. 基于多模态融合提升的文本分类方法[J]. 大数据, 2024, 10(2): 80-93.

Dezhi LIU, Liu HE, Youfeng LIU, Dechun HAN. A text classification method based on multimodal fusion enhancement[J]. Big Data Research, 2024, 10(2): 80-93.

图/表 16

图1

图2

图3

图4

图5

图6

图7

图8

图9

表1

表2

表3

表4

表5

表6

表7

参考文献 27

[1]	ZADEH A , ZELLERS R , PINCUS E ,et al. Mosi:multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[EB]. arXiv preprint, 2016,arXiv:1606.06259.
[2]	PORIA S , CAMBRIA E , HAZARIKA D ,et al. Multi-level multiple attentions for contextual multimodal sentiment analysis[C]// Proceedings of 2017 IEEE International Conference on Data Mining (ICDM). Piscataway:IEEE Press, 2017: 1033-1038.
[3]	GUO W , WANG J , WANG S . Deep multimodal representation learning:a survey[J]. IEEE Access, 2019,7: 63373-63394.
[4]	CAMBRIA E , HAZARIKA D , PORIA S ,et al. Benchmarking multimodal sentiment analysis[M]. Computational linguistics and intelligent text processing. Cham: Springer, 2018: 166-179.
[5]	ZADEH A , CHEN M H , PORIA S ,et al. Tensor fusion network for multimodal sentiment analysis[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg:Association for Computational Linguistics, 2017: 1103-1114.
[6]	ZADEH A , LIANG P P , MAZUMDER N ,et al. Memory fusion network for multiview sequential learning[C]// Proceedings of the AAAI Conference on Artificial Intelligence.[S.l.:s.n.], 2018.
[7]	DEVLIN J , CHANG M W , LEE K ,et al. Bert:pre-training of deep bidirectional transformers for language understanding[EB]. arXiv preprint 2018,arXiv:1810.04805.
[8]	SUN Y , WANG S , LI Y ,et al. Ernie:enhanced representation through knowledge integration[EB]. arXiv preprint, 2019,arXiv:1904.09223.
[9]	CUI Y M , CHE W X , LIU T ,et al. Revisiting pre-trained models for Chinese natural language processing[C]// Proceedings of Findings of the Association for Computational Linguistics:EMNLP 2020. Stroudsburg:Association for Computational Linguistics, 2020: 657-668.
[10]	LIU Y , OTT M , GOYAL N ,et al. Roberta:a robustly optimized bert pretraining approach[EB]. arXiv preprint, 2019,arXiv:1907.11692.
[11]	SENNRICH R , HADDOW B , BIRCH A . Neural machine translation of rare words with subword units[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2016: 1715-1725.
[12]	HE K M , ZHANG X Y , REN S Q ,et al. Deep residual learning for image recognition[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2016: 770-778.
[13]	PETERS M E , NEUMANN M , IYYER M ,et al. Deep contextualized word representations[EB]. arXiv preprint, 2018,arXiv:1802.05365.
[14]	RADFORD A , NARASIMHAN K , SALIMANS T ,et al. Improving language understanding by generative pretraining[Z]. OpenAI, 2018.
[15]	VASWANI A , SHAZEER N , PARMAR N ,et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. New York:ACM, 2017: 6000-6010.
[16]	IOFFE S , SZEGEDY C . Batch normalization:accelerating deep network training by reducing internal covariate shift[C]// Proceedings of the 32nd International Conference on International Conference on Machine Learning. New York:ACM, 2015: 448-456.
[17]	HENDRYCKS D , GIMPEL K . Gaussian error linear units (GELUs)[EB]. arXivpreprint, 2016,arXiv:1606.08415.
[18]	QI P , CAO J , YANG T Y ,et al. Exploiting multi-domain visual information for fake news detection[C]// Proceedings of 2019 IEEE International Conference on Data Mining. Piscataway:IEEE Press, 2020: 518-527.
[19]	JIN Z W , CAO J , GUO H ,et al. Multimodal fusion with recurrent neural networks for rumor detection on microblogs[C]// Proceedings of the 25th ACM international conference on Multimedia. New York:ACM, 2017: 795-816.
[20]	BOIDIDOU C , PAPADOPOULOS S , KOMPATSIARIS Y ,et al. Challenges of computational verification in social multimedia[C]// Proceedings of the 23rd International Conference on World Wide Web. New York:ACM, 2014: 743-748.
[21]	ANTOL S , AGRAWAL A , LU J ,et al. Vqa:visual question answering[C]// Proceedings of the IEEE International Conference on Computer Vision. Piscataway:IEEE Press, 2015: 2425-2433.
[22]	VINYALS O , TOSHEV A , BENGIO S ,et al. Show and tell:a neural image caption generator[C]// Proceedings of the IEEE conference on computer vision and pattern recognition. Piscataway:IEEE Press, 2015: 3156-3164.
[23]	WANG Y Q , MA F L , JIN Z W ,et al. EANN:event adversarial neural networks for multi-modal fake news detection[C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery ＆ Data Mining. New York:ACM, 2018: 849-857.
[24]	JINSHUO L , KUO F , PAN J Z ,et al. MSRD:multi-modal web rumor detection method[J]. Journal of Computer Research and Development, 2020,57(11): 2328-2336.
[25]	JIANA M , XIAOPEI W , TING L ,et al. Cross-modal rumor detection based on adversarial neural network[J]. Data Analysis and Knowledge Discovery, 2023,6(12): 32-42.
[26]	MIYATO T , DAI A M , GOODFELLOW I ,et al. Adversarial training methods for semisupervised text classification[EB]. arXiv preprint, 2016,arXiv:605.07725.
[27]	MADRY A , MAKELOV A , SCHMIDT L ,et al. Towards deep learning models resistant to adversarial attacks[EB]. arXiv preprint, 2017,arXiv:1706.06083.

数据集	训练集条目/条	测试集条目/条	平均长度
Weibo	7 481	1 917	110.8
Twitter	9 307	1 387	79.3

方法	准确率	真实内容			谣言内容
方法	准确率	精确率	召回率	宏F1	精确率	召回率	宏F1
Textual	0.592	0.605	0.531	0.566	0.581	0.653	0.615
Visual	0.608	0.61	0.605	0.607	0.607	0.611	0.609
Socal Content	0.65	0.672	0.591	0.629	0.634	0.71	0.67
Early Fusion	0.603	0.612	0.567	0.589	0.595	0.639	0.616
Late Fushion	0.669	0.693	0.611	0.649	0.651	0.728	0.687
VQA^[21]	0.736	0.797	0.634	0.706	0.695	0.838	0.76
NeuralTalk^[22]	0.726	0.794	0.613	0.692	0.684	0.84	0.754
att-RNN^[19]	0.772	0.854	0.656	0.742	0.72	0.889	0.795
EANN^[23]	0.782	0.827	0.697	0.756	0.752	0.863	0.804
MSRD^[24]	0.794	0.854	0.716	0.779	—	—	—
DCNN^[25]	0.803	0.799	0.801	0.809	—	—	—
MBN-ot	0.803	0.894	0.666	0.763	0.753	0.928	0.832
MBN-to	0.823	0.887	0.721	0.795	0.783	0.916	0.844

方法	准确率	真实内容			谣言内容
方法	准确率	精确率	召回率	宏F1	准确率	召回率	宏F1
Textual	0.532	0.598	0.541	0.568	0.462	0.52	0.489
Visual	0.596	0.695	0.518	0.593	0.524	0.7	0.599
Socal Content	0.509	0.566	0.589	0.577	0.426	0.403	0.414
Early Fusion	0.619	0.727	0.528	0.612	0.542	0.738	0.625
Late Fushion	0.594	0.661	0.589	0.623	0.526	0.602	0.561
VQA^[21]	0.631	0.765	0.509	0.611	0.55	0.794	0.65
NeuralTalk^[22]	0.61	0.728	0.504	0.595	0.534	0.752	0.625
att-RNN^[19]	0.664	0.749	0.615	0.676	0.589	0.728	0.651
EANN^[23]	0.648	0.810	0.498	0.617	0.584	0.759	0.66
MSRD^[24]	0.685	0.725	0.636	0.678	—	—	—
DCNN^[25]	—	—	—	—	—	—	—
MBN-ot	0.720	0.830	0.794	0.812	0.427	0.486	0.455
MBN-to	0.750	0.837	0.825	0.831	0.507	0.528	0.517

方法(MBN-to+)	Weibo		Twitter
方法(MBN-to+)	准确率	宏F1	准确率	宏F1
Base(bert)	0.823	0.818	0.750	0.674
ERNIE	0.830	0.826	0.782	0.680
Bert-wwm	0.827	0.824	—	—
RoBERTa-wwm	0.841	0.838	—	—

方法(MBN-to+Bestpretrain)	Weibo		Twitter
方法(MBN-to+Bestpretrain)	准确率	宏F1	准确率	宏F1
Base(no adv)	0.841	0.838	0.782	0.680
FGM	0.847	0.844	0.807	0.731
PGD	0.841	0.838	0.808	0.694