面向6G的跨模态信号重建技术

doi:10.11959/j.issn.1000-436x.2022093

通信学报 ›› 2022, Vol. 43 ›› Issue (6): 28-40.doi: 10.11959/j.issn.1000-436x.2022093

• 专题：面向6G的智能至简网络关键技术 • 上一篇下一篇

面向6G的跨模态信号重建技术

李昂¹^,², 陈建新¹^,², 魏昕¹^,², 周亮¹^,²

¹ 南京邮电大学通信与信息工程学院，江苏南京 210003
² 南京邮电大学宽带无线通信与传感网技术教育部重点实验室，江苏南京 210003

修回日期:2022-03-22 出版日期:2022-06-01 发布日期:2022-06-01
作者简介:李昂（1995- ），男，河南周口人，南京邮电大学博士生，主要研究方向为多媒体通信、人工智能
陈建新（1973- ），男，江苏南通人，博士，南京邮电大学副教授、硕士生导师，主要研究方向为无线通信、人机交互
魏昕（1983- ），男，江苏南京人，博士，南京邮电大学教授、硕士生导师，主要研究方向为多媒体通信
周亮（1981- ），男，安徽芜湖人，博士，南京邮电大学教授、博士生导师，主要研究方向为多媒体通信
基金资助:
国家自然科学基金资助项目(62071254);江苏高校优势学科建设工程基金资助项目

6G-oriented cross-modal signal reconstruction technology

Ang LI¹^,², Jianxin CHEN¹^,², Xin WEI¹^,², Liang ZHOU¹^,²

¹ College of Telecommunications ＆Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
² Key Laboratory of Broadband Wireless Communication and Sensor Network Technology(Ministry of Education), Nanjing University of Posts and Telecommunications, Nanjing 210003, China

Revised:2022-03-22 Online:2022-06-01 Published:2022-06-01
Supported by:
The National Natural Science Foundation of China(62071254);Priority Academic Program Development of Jiangsu Higher Education Institutions

摘要/Abstract

摘要：

目的：众所周知，包含音频、视频、触觉的多模态业务如混合现实、数字孪生、元宇宙等势必会成为6G时代下的杀手级应用，然而，该业务产生的大量多模态数据极易对现有通信系统的信号处理、传输、存储等造成负担。因此，为了满足用户沉浸式体验需求和保障低时延、高可靠、大容量的通信质量，迫切需要一种跨模态信号重建方案来减少传输数据量，以支持6G沉浸式多模态业务。

方法：首先，通过控制机器人触摸各种材质，构建了包含音、视、触信号的数据集VisTouch，为后续各种跨模态问题的研究奠定基础；其次，通过利用多模态信号间的语义关联性，设计一种普适的、稳健的端到端跨模态信号重建架构，包含特征提取模块、重建模块、评估模块3个部分，特征提取模块将源模态信号映射为公共语义空间中的语义特征向量，重建模块将此语义特征向量反变换为目标模态信号，2种模块的级联结构是跨越模态“壁垒”的关键，评估模块从语义维度、信号本身的时空维度对重建质量进行评估，并在框架训练过程中反馈优化信息给特征提取模块与重建模块，形成闭环回路，通过不断迭代实现精准信号重建；再次，以通过视频信号重建触觉信号为例，构建视频辅助的触觉重建模型，包括基于3D CNN的视频特征提取网络，基于全卷积网络的GAN生成网络与基于CNN的GAN辨别网络；进一步地，设计了一种遥操作平台，将所构建触觉重建模型部署到编解码器中，以实际验证模型的运行效率；最后，通过实验结果验证跨模态信号重建架构的可靠性以及触觉重建模型的准确性。

结果：所构建的VisTouch数据集涉及音频、视频、触觉三种模态，包含47种生活中常见的片状样本，数据采集手段为脚本控制机械手滑动触摸各种材质，并记录滑动触摸过程中指尖与材质摩擦产生的滑动摩擦力作为触觉信号，同时利用高清摄像头及挂载在机械手的单向拾音器采集视频、音频信号，并用时间戳进行同步；所构建视频辅助的触觉重建模型在VisTouch数据集上的平均绝对误差与准确度分别达到0.0135与0.78，为了将所提跨模态信号重建框架落地到实际应用场景，利用机械人、英伟达开发板进一步搭建了一种遥操作平台，用于实现工业场景中远程抓取物体的任务，该平台运行结果表明，实际平均绝对误差为0.0126，端到端总时延127ms，重建模型时延98ms，同时采用问卷调查方式评估用户满意度，其中触觉真实性满意度均值为4.43，方差为0.72，时延满意度均值为3.87，方差为1.07。

结论：数据集运行结果充分证明了所构建VisTouch数据集的实用性和视频辅助下的触觉重建模型的准确性，同时遥操作平台实际测试结果表明，用户认为该模型所生成出的触觉信号比较贴近实际信号，但对算法运行时间满意度一般，即本模态复杂度有待进一步优化。

关键词: 6G, 跨模态信号重建, 多模态数据集, 3D卷积神经网络, 生成对抗网络

Abstract:

Objectives:It is well known that multimodal services containing audio,video and haptics such as mixed reality,digital twin and metaverse are bound to become killer applications in the 6G era,however,the large amount of multimodal data generated by such services is highly likely to burden the signal processing, transmission and storage of existing communication systems. Therefore, a cross-modal signal reconstruction scheme is urgently needed to reduce the amount of transmitted data to support 6G immersive multimodal services in order to meet the user's immersive experience requirements and guarantee low latency,high reliability and high capacity communication quality.

Methods:Firstly,by controlling the robot to touch various materials,a dataset containing audio, visual and touch signals, VisTouch, is constructed to lay the foundation for subsequent research on various cross-modal problems; secondly, by exploiting the semantic correlation between multimodal signals, a universal and robust end-to-end cross-modal signal reconstruction architecture is designed, comprising three parts: a feature extraction module, a reconstruction module and an evaluation module. The feature extraction module maps the source modal signals into a semantic feature vector in the common semantic space, and the reconstruction module inverse transforms this semantic feature vector into the target modal signal.The evaluation module evaluates the reconstruction quality in semantic and spatio-temporal dimensions, and feeds the optimization information to the feature extraction module and the reconstruction module during the training process of the framework, forming a closed-loop loop to achieve accurate signal reconstruction through continuous iteration. Further, a teleoperated platform is designed to deploy the constructed haptic reconstruction model into the codec to actually verify the operational efficiency of the model; finally, the reliability of the cross-modal signal reconstruction architecture and the accuracy of the haptic reconstruction model are verified by experimental results.

Results: The constructed VisTouch dataset involves three modalities: audio, video and haptics, and contains 47 common slices of life samples. The average absolute error and accuracy of the constructed video-assisted haptic reconstruction model on the VisTouch dataset reached 0.0135 and 0.78 respectively. In order to implement the proposed cross-modal signal reconstruction framework into practical application scenarios, a teleoperation platform was further built using the robot and Nvidia development board for the industrial scenario of The results of running on this platform show that the actual mean absolute error is 0.0126,the total end-to-end delay is 127ms and the reconstruction model delay is 98ms.A questionnaire was also used to assess user satisfaction,where the mean value of haptic realism satisfaction is 4.43 with a variance of 0.72 and the mean value of time delay satisfaction is 3.87 with a variance of 1.07.

Conclusions: The results of the dataset runs fully demonstrate the practicality of the constructed VisTouch dataset and the accuracy of the video-assisted haptic reconstruction model, while the actual test results of the teleoperated platform indicate that users consider the haptic signals generated by the model to be closer to the actual signals,but are generally satisfied with the running time of the algorithm, i.e. the complexity of this modality needs further optimization.

Key words: 6G, cross-modal signal reconstruction, multi-modal dataset, 3D CNN, GAN

中图分类号:

TP391

李昂, 陈建新, 魏昕, 周亮. 面向6G的跨模态信号重建技术[J]. 通信学报, 2022, 43(6): 28-40.

Ang LI, Jianxin CHEN, Xin WEI, Liang ZHOU. 6G-oriented cross-modal signal reconstruction technology[J]. Journal on Communications, 2022, 43(6): 28-40.

图/表 14

图1

表1

表2

表3

图2

图3

图4

表4

图5

图6

表5

图7

图8

表6

参考文献 25

[1]	中国信息通信研究院. 6G 总体愿景与潜在关键技术白皮书[R]. 2021.
	China Academy of Information and Communications Technology.. 6G overall vision and potential key technology white paper[R]. 2021.
[2]	VAN D B D , GLANS R , KONING D D ,et al. Challenges in haptic communications over the tactile Internet[J]. IEEE Access, 2017,5: 23502-23518.
[3]	ZHOU L , WU D , CHEN J X ,et al. Cross-modal collaborative communications[J]. IEEE Wireless Communications, 2020,27(2): 112-117.
[4]	WEI X , ZHOU L . AI-enabled cross-modal communications[J]. IEEE Wireless Communications, 2021,28(4): 182-189.
[5]	高赟, 魏昕, 周亮 . 跨模态通信理论及关键技术初探[J]. 中国传媒大学学报(自然科学版), 2021,28(1): 55-63.
	GAO Y , WEI X , ZHOU L . Preliminary study on theory and key technology of cross-modal communications[J]. Journal of Communication University of China (Science and Technology), 2021,28(1): 55-63.
[6]	KRIZHEVSKY A , SUTSKEVER I , HINTON G E . ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017,60(6): 84-90.
[7]	SIMONYAN K , ZISSERMAN A . Very deep convolutional networks for large-scale image recognition[J]. arXiv Preprint,arXiv:1409.1556, 2014.
[8]	HE K M , ZHANG X Y , REN S Q ,et al. Deep residual learning for image recognition[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2016: 770-778.
[9]	BAZZICA A , VAN GEMERT J C , LIEM C C S ,et al. Vision-based detection of acoustic timed events:a case study on clarinet note onsets[J]. arXiv Preprint,arXiv:1706.09556, 2017.
[10]	LI B C , LIU X Z , DINESH K ,et al. Creating a multitrack classical music performance dataset for multimodal music analysis:challenges,insights,and applications[J]. IEEE Transactions on Multimedia, 2019,21(2): 522-535.
[11]	ZHAO H , GAN C , ROUDITCHENKO A ,et al. The sound of pixels[C]// Proceedings of the European Conference on Computer Vision. Berlin:Springer, 2018: 570-586.
[12]	MONTESINOS J F , SLIZOVSKAIA O , HARO G . Solos:a dataset for audio-visual music analysis[C]// Proceedings of 2020 IEEE 22nd International Workshop on Multimedia Signal Processing. Piscataway:IEEE Press, 2020: 1-6.
[13]	KURMI V K , BAJAJ V , PATRO B N ,et al. Collaborative learning to generate audio-video jointly[C]// Proceedings of 2021 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2021: 4180-4184.
[14]	ROTH J , CHAUDHURI S , KLEJCH O ,et al. Ava active speaker:an audio-visual dataset for active speaker detection[C]// Proceedings of 2020 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE Press, 2020: 4492-4496.
[15]	TSUCHIDA S , FUKAYAMA S , HAMASAKI M ,et al. AIST dance video database:multi-genre,multi-dancer,and multi-camera database for dance information processing[C]// Proceedings of the 20th International Society for Music Information Retrieval Conference.[S.l.:s.n.], 2019: 501-510.
[16]	LI R L , YANG S , ROSS D A ,et al. AI choreographer:music conditioned 3D dance generation with AIST++[C]// Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway:IEEE Press, 2021: 13381-13392.
[17]	HONG S , IM W , YANG H S . Content-based video-music retrieval using soft intra-modal structure constraint[J]. arXiv Preprint,arXiv:1704.06761, 2017.
[18]	LI Y Z , ZHU J Y , TEDRAKE R ,et al. Connecting touch and vision via cross-modal prediction[C]// Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2019: 10601-10610.
[19]	YUAN W Z , DONG S Y , ADELSON E H . GelSight:high-resolution robot tactile sensors for estimating geometry and force[J]. Sensors (Basel,Switzerland), 2017,17(12): 2762.
[20]	SUNDARAM S , KELLNHOFER P , LI Y Z ,et al. Learning the signatures of the human grasp using a scalable tactile glove[J]. Nature, 2019,569(7758): 698-702.
[21]	DUAN B , WANG W , TANG H ,et al. Cascade attention guided residue learning GAN for cross-modal translation[C]// Proceedings of 2020 25th International Conference on Pattern Recognition (ICPR). Piscataway:IEEE Press, 2021: 1336-1343.
[22]	HAO W L , ZHANG Z X , GUAN H . CMCGAN:a uniform framework for cross-modal visual-audio mutual generation[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2018: 6886-6893.
[23]	CHATTERJEE M , CHERIAN A . Sound2Sight:generating visual dynamics from sound and context[C]// European Conference on Computer Vision. Berlin:Springer, 2020: 701-719.
[24]	WEI X , SHI Y Y , ZHOU L . Haptic signal reconstruction for cross-modal communications[J]. IEEE Transactions on Multimedia, 2021:doi.org/10.1109/TMM.2021.3119860.
[25]	王万良, 李卓蓉 . 生成式对抗网络研究进展[J]. 通信学报, 2018,39(2): 135-148.
	WANG W L , LI Z R . Advances in generative adversarial network[J]. Journal on Communications, 2018,39(2): 135-148.

采集设备	采集信号	设备参数
铁三角单向拾音器		立体声/单声道：单声道
AT9912	音频	频率响应：70～16 000 Hz
		灵敏度：-39 dB
海康威视高清摄像头	视频	最高分辨率：1 920×1 280
DS-U32W		视频帧率：30 frame/s
		镜头焦距：2.7～13 mm
因时机械手RH56BFX-2L	触觉	采样频率：100 Hz
指关节力传感器		力分辨率：0.5 N

样本大类	样本小类	类别数量/种
塑料	聚对苯二甲酸乙二醇酯、高密度聚乙烯、聚氯乙烯、低密度聚乙烯、聚丙烯、聚苯乙烯	6
金属	铁、铝、镍、锌、钛、铟、铜、钽、钼	9
木材	樱桃木、松木、黑胡桃、竹	4
纸	打印纸、报纸、硬纸板	3
陶瓷	传统陶瓷、特种陶瓷	2
橡胶	天然橡胶、合成橡胶	2
天然纺织品	棉、亚麻、丝绸	3
合成纺织品	锦纶、涤纶、腈纶、维伦、丙纶、氨纶、碳纤维	7
玻璃	普通玻璃、石英玻璃	2
皮革	牛皮、羊皮、人造皮革	3
石头	花岗岩、大理岩、石灰岩、板岩、泥岩、安山岩	6

数据集名称	内容	类别数量/种	帧数量
C4S^[9]	音频、视频	1	十万级
URMP^[10]	音频、视频	14	百万级
MUSIC^[11]	音频、视频	12	百万级
Solos^[12]	音频、视频	13	百万级
HMMD^[13]	音频、视频	7	百万级
AVA-ActiveSpeaker^[14]	音频、视频	—	百万级
AIST^[15]	音频、视频	—	百万级
AIST++ ^[16]	音频、视频	—	百万级
HIMV-200K ^[17]	音频、视频	—	百万级
VisGel^[18]	视频、触觉	195	百万级
STAG^[20]	触觉	26	十万级
VisTouch	音频、视频、触觉	47	千万级

模块序号	模块参数	输出张量尺寸
1.1	输入	4 096×7×7
2.1	反卷积层k=(3,3),p=(1,0),s=1	256×7×9
2.2	批归一化层	256×7×9
2.3	ReLU激活函数	256×7×9
3.1	反卷积层k=(2,5),p=(0,0),s=2	128×14×21
3.2	批归一化层	128×14×21
3.3	ReLU激活函数	128×14×21
4.1	反卷积层k=(4,5),p=(2,2),s=2	64×26×41
4.2	批归一化层	64×26×41
4.3	ReLU激活函数	64×26×41
5.1	卷积层k=(1,1),p=(0,0),s=1	2×26×41
5.2	批归一化层	2×26×41
5.3	Tanh激活函数	2×26×41

模型	MAE	ACC
模型1	0.093 3	0.57
模型2	0.058 6	0.65
本文模型	0.013 5	0.78

面向6G的跨模态信号重建技术

6G-oriented cross-modal signal reconstruction technology

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 14

参考文献 25

相关文章 15

Metrics

推荐阅读 0

指标	评估结果
MAE	0.012 6
发送与反馈总时延/ms	127
重建模型时延/ms	98
触觉真实性满意度（均值，方差）	（4.43, 0.72）
时延满意度（均值，方差）	（3.87, 1.07）

[1]	张佳乐, 朱诚诚, 孙小兵, 陈兵. 基于GAN的联邦学习成员推理攻击与防御方法[J]. 通信学报, 2023, 44(5): 193-205.
[2]	苏新, 张桂福, 行鸿彦, Zenghui Wang. 基于平衡生成对抗网络的海洋气象传感网入侵检测研究[J]. 通信学报, 2023, 44(4): 124-136.
[3]	江沸菠, 彭于波, 董莉. 面向6G的深度图像语义通信模型[J]. 通信学报, 2023, 44(3): 198-208.
[4]	杨静雅, 唐晓刚, 周一青, 刘玲, Jiangzhou Wang. 意图抽象与知识联合驱动的6G内生智能网络架构[J]. 通信学报, 2023, 44(2): 12-26.
[5]	王晓云, 张小舟, 马良, 王亚娟, 楼梦婷, 姜涛, 金婧, 王启星, 刘光毅. 6G通信感知一体化网络的感知算法研究与优化[J]. 通信学报, 2023, 44(2): 219-230.
[6]	刘延华, 李嘉琪, 欧振贵, 高晓玲, 刘西蒙, MENG Weizhi, 刘宝旭. 对抗训练驱动的恶意代码检测增强方法[J]. 通信学报, 2022, 43(9): 169-180.
[7]	王延文, 雷为民, 张伟, 孟欢, 陈新怡, 叶文慧, 景庆阳. 基于生成模型的视频图像重建方法综述[J]. 通信学报, 2022, 43(9): 194-208.
[8]	张海君, 陈安琪, 李亚博, 隆克平. 6G移动网络关键技术[J]. 通信学报, 2022, 43(7): 189-202.
[9]	廖建新, 付霄元, 戚琦, 王敬宇, 孙海峰. 6G-ADM：基于知识空间的6G网络管控体系[J]. 通信学报, 2022, 43(6): 3-15.
[10]	王志勤, 江甲沫, 刘沛西, 曹晓雯, 李阳, 韩凯峰, 杜滢, 朱光旭. 6G联邦边缘学习新范式：基于任务导向的资源管理策略[J]. 通信学报, 2022, 43(6): 16-27.
[11]	刘传宏, 郭彩丽, 杨洋, 陈九九, 朱美逸, 孙鲁楠. 面向智能任务的语义通信：理论、技术和挑战[J]. 通信学报, 2022, 43(6): 41-57.
[12]	唐盼, 林佳欣, 张建华, 田磊, 常钊玮, 夏亮, 王启星. 面向6G的太赫兹信道反射特性研究[J]. 通信学报, 2022, 43(5): 102-109.
[13]	段雪源, 付钰, 王坤. 基于VAE-WGAN的多维时间序列异常检测方法[J]. 通信学报, 2022, 43(3): 1-13.
[14]	向夏雨, 王佳慧, 王子睿, 段少明, 潘鹤中, 庄荣飞, 韩培义, 刘川意. 基于生成对抗网络技术的医疗仿真数据生成方法[J]. 通信学报, 2022, 43(3): 211-224.
[15]	张晓茜, 徐勇军. 面向零功耗物联网的反向散射通信综述[J]. 通信学报, 2022, 43(11): 199-212.