多模态语义通信研究综述

doi:10.11959/j.issn.1000-436x.2023105

通信学报 ›› 2023, Vol. 44 ›› Issue (5): 28-41.doi: 10.11959/j.issn.1000-436x.2023105

• 专题：多/跨模态语义通信 • 上一篇下一篇

多模态语义通信研究综述

秦志金¹, 赵菼菼², 李凡², 陶晓明¹

¹ 清华大学电子工程系，北京 100084
² 西安交通大学信息与通信工程学院，陕西西安 710049

修回日期:2023-05-06 出版日期:2023-05-25 发布日期:2023-05-01
作者简介:秦志金（1989- ），女，山西太原人，博士，清华大学副教授、博士生导师，主要研究方向为语义通信等
赵菼菼（1991- ），女，甘肃陇南人，西安交通大学博士生，主要研究方向为无线安全传输、移动边缘计算、深度强化学习、联邦学习等
李凡（1981- ），男，陕西宝鸡人，博士，西安交通大学教授、博士生导师，主要研究方向为基于深度学习的图像视频编码、基于机器学习的图像视频质量评价、图像视频的深度理解和处理等
陶晓明（1981- ），女，河北石家庄人，博士，清华大学教授、博士生导师，主要研究方向为无线多媒体通信理论及关键技术应用等
基金资助:
国家自然科学基金资助项目(61925105);清华大学-中国移动联合研究院基金资助项目

Survey of research on multimodal semantic communication

Zhijin QIN¹, Tantan ZHAO², Fan LI², Xiaoming TAO¹

¹ Department of Electronic Engineering, Tsinghua University, Beijing 100084, China
² School of Information and Communication Engineering, Xi’an Jiaotong University, Xi’an 710049, China

Revised:2023-05-06 Online:2023-05-25 Published:2023-05-01
Supported by:
The National Natural Science Foundation of China(61925105);Tsinghua University-China Mobile Com-munications Group Co., Ltd.Joint Institute

摘要/Abstract

摘要：

随着人工智能与通信的交叉融合，文本、图像、音频、视频等多模态数据处理技术蓬勃发展，模态语义的共享维度被深度挖掘，多模态语义信息的高度抽象、智能简约等特性被充分利用，为语义通信带来了全新的思路和手段。首先，介绍了语义通信的基础理论和分类，分别针对文本、图像、音频、视频综述了单模态语义通信的研究现状；然后，综述了多模态语义通信的研究现状，介绍了多模态数据融合技术和安全语义通信的研究；最后，总结了多模态语义通信面临的挑战。

关键词: 语义通信, 多模态数据融合, 多模态语义通信

Abstract:

With the cross-integration of artificial intelligence and communications, technologies for processing multimodal data such as text, image, audio, and video are booming, the shared dimension of modal semantics is deeply excavated, and the characteristics of multimodal semantic information such as high abstraction, intelligence and simplicity are being fully utilized, which brings new ideas and means to semantic communications.First, the fundamental theories and classifications of semantic communication were introduced, and the research status of single-modal semantic communication was reviewed for text, image, audio, and video respectively.Then, the research status of multimodal semantic communication was reviewed, and multimodal data fusion technology and secure semantic communication were introduced.Finally, the challenges faced by multimodal semantic communication were summarized.

Key words: semantic communication, multimodal data fusion, multimodal semantic communication

中图分类号:

TN919.8

秦志金, 赵菼菼, 李凡, 陶晓明. 多模态语义通信研究综述[J]. 通信学报, 2023, 44(5): 28-41.

Zhijin QIN, Tantan ZHAO, Fan LI, Xiaoming TAO. Survey of research on multimodal semantic communication[J]. Journal on Communications, 2023, 44(5): 28-41.

图/表 5

参考文献 82

[6]	SHANNON C E , WEAVER W . The mathematical theory of communication[M]. Urbana: University of Illinois Press, 1998.
[7]	ZHANG P , XU W , GAO H ,et al. Toward wisdom-evolutionary and primitive-concise 6G:a new paradigm of semantic communication networks[J]. Engineering, 2022,8: 60-73.
[8]	CARNAP R , BAR-HILLEL Y . An outline of a theory of semantic information[J]. The Journal of Symbolic Logic, 1954,19(3): 230-232.
[9]	BAO J , BASU P , DEAN M K ,et al. Towards a theory of semantic communication[C]// Proceedings of 2011 IEEE Network Science Workshop. Piscataway:IEEE Press, 2011: 110-117.
[10]	刘传宏, 郭彩丽, 杨洋 ,等. 面向智能任务的语义通信:理论、技术和挑战[J]. 通信学报, 2022,43(6): 41-57.
	LIU C H , GUO C L , YANG Y ,et al. Intelligent task-oriented semantic communications:theory,technology and challenges[J]. Journal on Communications, 2022,43(6): 41-57.
[11]	SHAO J W , MAO Y Y , ZHANG J . Learning task-oriented communication for edge inference:an information bottleneck approach[J]. IEEE Journal on Selected Areas in Communications, 2022,40(1): 197-211.
[12]	张海君, 陈安琪, 李亚博 ,等. 6G移动网络关键技术[J]. 通信学报, 2022,43(7): 189-202.
	ZHANG H J , CHEN A Q , LI Y B ,et al. Key technologies of 6G mobile network[J]. Journal on Communications, 2022,43(7): 189-202.
[13]	CALVANESE S E , BARBAROSSA S . 6G networks:beyond Shannon towards semantic and goal-oriented communications[J]. Computer Networks, 2021,190:107930.
[14]	SHI G , GAO D , SONG X ,et al. A new communication paradigm:from bit accuracy to semantic fidelity[J]. arXiv Preprint,arXiv:2101.12649, 2021.
[15]	TONG H N , YANG Z H , WANG S H ,et al. Federated learning for audio semantic communication[J]. Frontiers in Communications and Networks, 2021,2:734402.
[16]	WENG Z Z , QIN Z J , LI G Y . Semantic communications for speech signals[C]// Proceedings of 2021 IEEE International Conference on Communications. Piscataway:IEEE Press, 2021: 1-6.
[17]	JIANG P , WEN C K , JIN S ,et al. Wireless semantic communications for video conferencing[J]. arXiv Preprint,arXiv:2204.07790, 2022.
[18]	FARSAD N , RAO M , GOLDSMITH A . Deep learning for joint source-channel coding of text[C]// Proceedings of 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2018: 2326-2330.
[19]	PENNINGTON J , SOCHER R , MANNING C . Glove:global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg:Association for Computational Linguistics, 2014: 1532-1543.
[20]	BAHDANAU D , CHO K , BENGIO Y . Neural machine translation by jointly learning to align and translate[J]. arXiv Preprint,arXiv:1409.0473, 2014.
[21]	WU Y , SCHUSTER M , CHEN Z ,et al. Google’s neural machine translation system:bridging the gap between human and machine translation[J]. arXiv Preprint,arXiv:1609.08144, 2016.
[22]	GRAVES A . Sequence transduction with recurrent neural networks[J]. arXiv Preprint,arXiv:1211.3711, 2012.
[23]	MIKOLOV T , CHEN K , CORRADO G ,et al. Efficient estimation of word representations in vector space[J]. arXiv Preprint,arXiv:1301.3781, 2013.
[24]	XIE H Q , QIN Z J , LI G Y ,et al. Deep learning enabled semantic communication systems[J]. IEEE Transactions on Signal Processing, 2021,69: 2663-2675.
[25]	SANA M , STRINATI E C . Learning semantics:an opportunity for effective 6G communications[C]// Proceedings of 2022 IEEE 19th Annual Consumer Communications ＆ Networking Conference (CCNC). Piscataway:IEEE Press, 2022: 631-636.
[1]	QIN Z , TAO X , LU J ,et al. Semantic communications:principles and challenges[J]. arXiv Preprint,arXiv:2201.01389, 2022.
[2]	刘传宏, 郭彩丽, 杨洋 ,等. 人工智能物联网中面向智能任务的语义通信方法[J]. 通信学报, 2021,42(11): 97-108.
[26]	ZHOU Q Y , LI R P , ZHAO Z F ,et al. Semantic communication with adaptive universal transformer[J]. IEEE Wireless Communications Letters, 2022,11(3): 453-457.
[27]	DEHGHANI M , GOUWS S , VINYALS O ,et al. Universal transformers[J]. arXiv Preprint,arXiv:1807.03819, 2018.
[2]	LIU C H , GUO C L , YANG Y ,et al. Intelligent task-oriented semantic communication method in artificial intelligence of things[J]. Journal on Communications, 2021,42(11): 97-108.
[3]	LI A , WEI X , WU D ,et al. Cross-modal semantic communications[J]. IEEE Wireless Communications, 2022,29(6): 144-151.
[28]	GRAVES A . Adaptive computation time for recurrent neural networks[J]. arXiv Preprint,arXiv:1603.08983, 2016.
[29]	LEE C H , LIN J W , CHEN P H ,et al. Deep learning-constructed joint transmission-recognition for Internet of things[J]. IEEE Access, 2019,7: 76547-76561.
[4]	ZHONG Y X . A theory of semantic information[J]. China Communications, 2017,14(1): 1-17.
[5]	MORRIS C W . Foundations of the theory of signs[M]. Chicago: University of Chicago Press, 1938.
[30]	HE K M , ZHANG X Y , REN S Q ,et al. Deep residual learning for image recognition[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2016: 770-778.
[31]	XU J L , AI B , CHEN W ,et al. Wireless image transmission using deep source channel coding with attention modules[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022,32(4): 2315-2328.
[32]	HU Q , ZHANG G , QIN Z ,et al. Robust semantic communications against semantic noise[J]. arXiv Preprint,arXiv:2202.03338, 2022.
[33]	HE K M , CHEN X L , XIE S N ,et al. Masked autoencoders are scalable vision learners[C]// Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2022: 15979-15988.
[34]	SCHNEIDER S , BAEVSKI A , COLLOBERT R ,et al. Wav2Vec:unsupervised pre-training for speech recognition[J]. arXiv Preprint,arXiv:1904.05862, 2019.
[35]	WENG Z Z , QIN Z J . Semantic communication systems for speech transmission[J]. IEEE Journal on Selected Areas in Communications, 2021,39(8): 2434-2444.
[36]	WENG Z Z , QIN Z J , LI G Y . Semantic communications for speech recognition[J]. arXiv Preprint,arXiv:2107.11190, 2021.
[37]	SCHUSTER M , PALIWAL K K . Bidirectional recurrent neural networks[J]. IEEE Transactions on Signal Processing, 1997,45(11): 2673-2681.
[38]	TUNG T Y , GüNDüZ D . DeepWiVe:deep-learning-aided wireless video transmission[J]. IEEE Journal on Selected Areas in Communications, 2022,40(9): 2570-2583.
[39]	WANG S , DAI J , LIANG Z ,et al. Wireless deep video semantic transmission[J]. arXiv Preprint,arXiv:2205.13129, 2022.
[40]	TAO X M , DUAN Y P , XU M ,et al. Learning QoE of mobile video transmission with deep neural network:a data-driven approach[J]. IEEE Journal on Selected Areas in Communications, 2019,37(6): 1337-1348.
[41]	FRIED O , TEWARI A , ZOLLH?FER M , ,et al. Text-based editing of talking-head video[J]. ACM Transactions on Graphics, 2019,38(4): 1-14.
[42]	TANDON P , CHANDAK S , PATARANUTAPORN P ,et al. Txt2Vid:ultra-low bitrate compression of talking-head videos via text[J]. arXiv Preprint,arXiv:2106.14014, 2021.
[43]	赵亮 . 多模态数据融合算法研究[D]. 大连:大连理工大学, 2018.
	ZHAO L . Research on multimodal data fusion algorithm[D]. Dalian:Dalian University of Technology, 2018.
[44]	任泽裕, 王振超, 柯尊旺 ,等. 多模态数据融合综述[J]. 计算机工程与应用, 2021,57(18): 49-64.
	REN Z Y , WANG Z C , KE Z W ,et al. Survey of multimodal data fusion[J]. Computer Engineering and Applications, 2021,57(18): 49-64.
[45]	LAHAT D , ADALI T , JUTTEN C . Multimodal data fusion:an overview of methods,challenges,and prospects[J]. Proceedings of the IEEE, 2015,103(9): 1449-1477.
[46]	PEREZ-RUA J M , VIELZEUF V , PATEUX S ,et al. MFAS:multimodal fusion architecture search[C]// Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2020: 6959-6968.
[47]	VIELZEUF V , LECHERVY A , PATEUX S ,et al. CentralNet:a multilayer approach for multimodal fusion[J]. arXiv Preprint,arXiv:1808.07275, 2018.
[48]	SNOEK C G M , WORRING M , SMEULDERS A W M . Early versus late fusion in semantic video analysis[C]// Proceedings of the 13th Annual ACM International Conference on Multimedia. New York:ACM Press, 2005: 399-402.
[49]	NATARAJAN P , WU S , VITALADEVUNI S ,et al. Multimodal feature fusion for robust event detection in web videos[C]// Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2012: 1298-1305.
[50]	BEN-YOUNES H , CADENE R , CORD M ,et al. MUTAN:multimodal tucker fusion for visual question answering[C]// Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Piscataway:IEEE Press, 2017: 2631-2639.
[51]	YE G N , LIU D , JHUO I H ,et al. Robust late fusion with rank minimization[C]// Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2012: 3021-3028.
[52]	MNIH V , HEESS N , GRAVES A ,et al. Recurrent models of visual attention[J]. arXiv Preprint,arXiv:1406.6247, 2014.
[53]	WANG F , JIANG M Q , QIAN C ,et al. Residual attention network for image classification[C]// Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2017: 6450-6458.
[54]	VASWANI A , SHAZEER N , PARMAR N ,et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. New York:ACM Press, 2017: 6000-6010.
[55]	KIM J H , ON K W , LIM W ,et al. Hadamard product for low-rank bilinear pooling[J]. arXiv Preprint,arXiv:1610.04325, 2016.
[56]	YANG Z C , HE X D , GAO J F ,et al. Stacked attention networks for image question answering[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2016: 21-29.
[57]	ANDERSON P , HE X D , BUEHLER C ,et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2018: 6077-6086.
[58]	LU J S , YANG J W , BATRA D ,et al. Hierarchical question-image co-attention for visual question answering[C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. New York:ACM Press, 2016: 289-297.
[59]	YU Z , YU J , CUI Y H ,et al. Deep modular Co-attention networks for visual question answering[C]// Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2020: 6274-6283.
[60]	NAM H , HA J W , KIM J . Dual attention networks for multimodal reasoning and matching[C]// Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2017: 2156-2164.
[61]	XIE H , QIN Z , LI G Y . Task-oriented semantic communications for multimodal data[J]. arXiv Preprint,arXiv:2108.07357, 2021.
[62]	RUSSAKOVSKY O , DENG J , SU H ,et al. ImageNet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015,115(3): 211-252.
[63]	HUDSON D A , MANNING C D . Compositional attention networks for machine reasoning[J]. arXiv Preprint,arXiv:1803.03067, 2018.
[64]	XIE H Q , QIN Z J , TAO X M ,et al. Task-oriented multi-user semantic communications[J]. IEEE Journal on Selected Areas in Communications, 2022,40(9): 2584-2597.
[65]	ZHANG G , HU Q , QIN Z ,et al. A unified multi-task semantic communication system with domain adaptation[J]. arXiv Preprint,arXiv:2206.00254, 2022.
[66]	LUO X W , GAO R B , CHEN H H ,et al. Multi-modal and multi-user semantic communications for channel-level information fusion[J]. IEEE Wireless Communications, 2022:doi.org/10.1109/MWC.011.2200288.
[67]	YANG W , LIEW Z Q , LIM W Y B ,et al. Semantic communication meets edge intelligence[J]. arXiv Preprint,arXiv:2202.06471, 2022.
[68]	KIM B , SAGDUYU Y E , DAVASLIOGLU K ,et al. Channel-aware adversarial attacks against deep learning-based wireless signal classifiers[J]. IEEE Transactions on Wireless Communications, 2022,21(6): 3868-3880.
[69]	ZHENG Z R , LI Z T , JIANG H B ,et al. Semantic-aware privacy-preserving online location trajectory data sharing[J]. IEEE Transactions on Information Forensics and Security, 2022,17: 2256-2271.
[70]	BAJI?I V , LIN W S , TIAN Y H . Collaborative intelligence:challenges and opportunities[C]// Proceedings of 2021 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Piscataway:IEEE Press, 2021: 8493-8497.
[71]	MIRESHGHALLAH F , TARAM M , RAMRAKHYANI P ,et al. Shredder:learning noise distributions to protect inference privacy[C]// Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. New York:ACM Press, 2020: 3-18.
[72]	GOODFELLOW I , POUGET-ABADIE J , MIRZA M ,et al. Generative adversarial networks[J]. Communications of the ACM, 2020,63(11): 139-144.
[73]	TUNG T Y , GUNDUZ D . Deep joint source-channel and encryption coding:secure semantic communications[J]. arXiv Preprint,arXiv:2208.09245, 2022.
[74]	LUO X , CHEN Z , TAO M ,et al. Encrypted semantic communication using adversarial training for privacy preserving[J]. arXiv Preprint,arXiv:2209.09008, 2022.
[75]	LU K , ZHOU Q Y , LI R P ,et al. Rethinking modern communication from semantic coding to semantic communication[J]. IEEE Wireless Communications, 2023,30(1): 158-164.
[76]	SEO H , PARK J , BENNIS M ,et al. Semantics-native communication with contextual reasoning[J]. arXiv Preprint,arXiv:2108.05681, 2021.
[77]	ZHAO T T , LI G B , ZHANG G M ,et al. Security-enhanced user pairing for MISO-NOMA downlink transmission[C]// Proceedings of 2018 IEEE Global Communications Conference (GLOBECOM). Piscataway:IEEE Press, 2019: 1-6.
[78]	ZHAO T T , HE L J , HUANG X Y ,et al. QoE-driven secure video transmission in cloud-edge collaborative networks[J]. IEEE Transactions on Vehicular Technology, 2022,71(1): 681-696.
[79]	ZHAO T T , HE L J , HUANG X Y ,et al. DRL-based secure video offloading in MEC-enabled IoT networks[J]. IEEE Internet of Things Journal, 2022,9(19): 18710-18724.
[80]	ZHAO T T , LI F , HE L J . DRL-based joint resource allocation and device orchestration for hierarchical federated learning in NOMA-enabled industrial IoT[J]. IEEE Transactions on Industrial Informatics, 2022:doi.org/10.1109/TII.2022.3170900.
[81]	LIU Y Q , XU K D , LI J X ,et al. Millimeter-wave E-plane waveguide bandpass filters based on spoof surface plasmon polaritons[J]. IEEE Transactions on Microwave Theory and Techniques, 2022,70(10): 4399-4409.
[82]	LIU Y Q , XU K D . Design of millimeter-wave bandpass filter using edge-coupling dual-mode resonator[C]// Proceedings of 2021 IEEE Asia-Pacific Microwave Conference (APMC). Piscataway:IEEE Press, 2022: 154-156.

多模态语义通信研究综述

Survey of research on multimodal semantic communication

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 5

参考文献 82

相关文章 8

Metrics

推荐阅读 0

[1]	李荣鹏, 汪丙炎, 张宏纲, 赵志峰. 知识增强的语义通信接收端设计[J]. 通信学报, 2023, 44(6): 70-76.
[2]	张平, 牛凯, 姚圣时, 戴金晟. 面向未来的语义通信：基本原理与实现方法[J]. 通信学报, 2023, 44(5): 1-14.
[3]	石光明, 杨旻曦, 高大化, 柴靖轩. 面向语义信息直传的通信架构[J]. 通信学报, 2023, 44(5): 15-27.
[4]	张平, 戴金晟, 张育铭, 王思贤, 秦晓琦, 牛凯. 面向语义通信的非线性变换编码[J]. 通信学报, 2023, 44(4): 1-14.
[5]	江沸菠, 彭于波, 董莉. 面向6G的深度图像语义通信模型[J]. 通信学报, 2023, 44(3): 198-208.
[6]	张海君, 陈安琪, 李亚博, 隆克平. 6G移动网络关键技术[J]. 通信学报, 2022, 43(7): 189-202.
[7]	刘传宏, 郭彩丽, 杨洋, 陈九九, 朱美逸, 孙鲁楠. 面向智能任务的语义通信：理论、技术和挑战[J]. 通信学报, 2022, 43(6): 41-57.
[8]	刘传宏, 郭彩丽, 杨洋, 冯春燕, 孙启政, 陈九九. 人工智能物联网中面向智能任务的语义通信方法[J]. 通信学报, 2021, 42(11): 97-108.