面向6G的跨模态信号重建技术

摘要/Abstract

摘要： 6G时代下，为了兼顾多媒体用户音频、视频、触觉的沉浸式体验需求与低时延、高可靠、大容量的通信质量，提出一种跨模态信号重建架构及由视频信号重建触觉信号的深度学习模型。首先，通过控制机器人触摸各种材质，构建了包含音、视、触信号的数据集VisTouch，为后续各种跨模态问题的研究奠定基础；其次，通过利用多模态信号间的语义关联性，设计一种普适的、鲁棒的端到端跨模态信号重建架构；随后，以通过视频信号重建触觉信号为例，构建视频辅助的触觉重建模型，包括基于3D CNN的视频特征提取网络，基于全卷积网络的GAN生成网络与基于CNN的GAN辨别网络；最后，通过实验结果验证跨模态信号重建架构的可靠性以及触觉重建模型的准确性。

关键词: 6G, 跨模态信号重建, 多模态数据集, 3D卷积神经网络, 生成对抗网络

Abstract: In the 6G era, to balance the immersive experience needs of multimedia users for audio, video, and haptics with low-latency, high-reliability, and large-capacity communication, a cross-modal signal reconstruction framework and video-to-haptic reconstruction model was proposed. First, robots were controlled to touch various materials. In this way, a large-scale dataset VisTouch that includes audio, video, and haptic signals was constructed. This dataset can lay the foun-dation for subsequent researches on various cross-modal problems. In addition, based on the semantic relations of mul-ti-modal signals, a universe and robust end-to-end cross-modal signal reconstruction framework was designed. Further-more, take the reconstruction from video to haptic signals as an example. A video-assisted haptic reconstruction model was established, including a 3D CNN based video extraction sub-network, a fully convolutional network based GAN generation sub-network and a CNN based GAN discrimination sub-network. Finally, the reliability of the cross-modal signal reconstruction framework and the accuracy of the proposed video-to-haptic model were verified through experi-mental results.

李昂, 陈建新, 魏昕, 周亮, . 面向6G的跨模态信号重建技术[J]. 通信学报.

LI Ang, CHEN Jianxin, WEI Xin, ZHOU Liang, . 6G-Oriented cross-modal signal reconstruction technology[J]. Journal on Communications.

参考文献

[1] 中国信息通信研究院. 6G 总体愿景与潜在关键技术白皮书[EB]. 2021.
China Academy of Information and Communications Technology. 6G overall vision and potential key technology white paper[EB]. 2021.
[2] VAN D B D, Glans R, et al. Challenges in Haptic Communications Over the Tactile Internet[J]. IEEE Access, 2017, 5: 23502-23518.
[3] ZHOU L, WU D, CHEN J, et al. Cross-modal collaborative commu-nications[J]. IEEE Wireless Communications, 2019, 27(2): 112-117.
[4] WEI X, ZHOU L. AI-Enabled Cross-Modal Communications[J]. IEEE Wireless Communications, 2021,28(4): 182-189.
[5] 高赟,魏昕,周亮. 跨模态通信理论及关键技术初探[J]. 中国传媒大学学报(自然科学版), 2021, 28(01): 55-63.

GAO Y, WEI X, ZHOU L. Preliminary Study on Theory and Key Technology of Cross-Modal Communications[J]. Journal of Commu-nication University of China(Science and Technology), 2021, 28(01): 55-63.

[6] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classifi-cation with deep convolutional neural networks[J]. Advances in neural information processing systems, 2012, 25: 1097-1105.
[7] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
[8] HE K, Zhang X, REN S, et al. Deep residual learning for image recognition[C]. Proceedings of the IEEE conference on computer vi-sion and pattern recognition. 2016: 770-778.
[9] BAZZICA A, Van Gemert J C, LIEM C C S, et al. Vision-based detec-tion of acoustic timed events: a case study on clarinet note onsets[J]. arXiv preprint arXiv:1706.09556, 2017.
[10] LI B, LIU X, DINESH K, et al. Creating a musical performance da-taset for multimodal music analysis: Challenges, insights, and applica-tions[J]. IEEE Trans. Multimedia, submitted. Available: https://arxiv. org/abs/1612.08727, 2016.
[11] ZHAO H, GAN C, ROUDITCHENKO A, et al. The sound of pix-els[C]. Proceedings of the European conference on computer vision. 2018: 570-586.
[12] MONTESINOS J F, SLIZOVSKAIA O, HARO G. Solos: A dataset for audio-visual music analysis[C]. 2020 IEEE 22nd International Work-shop on Multimedia Signal Processing. 2020: 1-6.
[13] KURMI V K, BAJAJ V, PATRO B N, et al. Collaborative Learning to Generate Audio-Video Jointly[C]. IEEE International Conference on Acoustics, Speech and Signal Processing. 2021: 4180-4184.
[14] ROTH J, CHAUDHURI S, KLEJCH O, et al. Ava active speaker: An audio-visual dataset for active speaker detection[C]. IEEE Internation-al Conference on Acoustics, Speech and Signal Processing. 2020: 4492-4496.
[15] TSUCHIDA S, FUKAYAMA S, HAMASAKI M, et al. AIST Dance Video Database: Multi-Genre, Multi-Dancer, and Multi-Camera Da-tabase for Dance Information Processing[C]. Proceedings of the 20th International Society for Music Information Retrieval Conference. 2019: 501-510.
[16] LI R, YANG S, ROSS D A, et al. Ai choreographer: Music conditioned 3d dance generation with aist++[C]. Proceedings of the IEEE/CVF In-ternational Conference on Computer Vision. 2021: 13401-13412.
[17] HONG S, IM W, YANG H S. Content-based video-music retrieval using soft intra-modal structure constraint[J]. arXiv preprint arXiv:1704.06761, 2017.
[18] LI Y, ZHU J Y, TEDRAKE R, et al. Connecting touch and vision via cross-modal prediction[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 10609-10618.
[19] YUAN W, DONG S, ADELSON E H. Gelsight: High-resolution robot tactile sensors for estimating geometry and force[J]. Sensors, 2017, 17(12): 2762.
[20] SUNDARAM S, KELLNHOFER P, LI Y, et al. Learning the signa-tures of the human grasp using a scalable tactile glove[J]. Nature, 2019, 569(7758): 698-702.
[21] DUAN B, WANG W, TANG H, et al. Cascade attention guided residue learning gan for cross-modal translation[C]. 2020 25th International Conference on Pattern Recognition. 2021: 1336-1343
[22] HAO W, ZHANG Z, GUAN H. Cmcgan: A uniform framework for cross-modal visual-audio mutual generation[C]. Proceedings of the AAAI Conference on Artificial Intelligence. 2018, 32(1).