通信学报 ›› 2022, Vol. 43 ›› Issue (6): 28-40.doi: 10.11959/j.issn.1000-436x.2022093

• 专题:面向6G的智能至简网络关键技术 • 上一篇    下一篇

面向6G的跨模态信号重建技术

李昂1,2, 陈建新1,2, 魏昕1,2, 周亮1,2   

  1. 1 南京邮电大学通信与信息工程学院,江苏 南京 210003
    2 南京邮电大学宽带无线通信与传感网技术教育部重点实验室,江苏 南京 210003
  • 修回日期:2022-03-22 出版日期:2022-06-01 发布日期:2022-06-01
  • 作者简介:李昂(1995- ),男,河南周口人,南京邮电大学博士生,主要研究方向为多媒体通信、人工智能
    陈建新(1973- ),男,江苏南通人,博士,南京邮电大学副教授、硕士生导师,主要研究方向为无线通信、人机交互
    魏昕(1983- ),男,江苏南京人,博士,南京邮电大学教授、硕士生导师,主要研究方向为多媒体通信
    周亮(1981- ),男,安徽芜湖人,博士,南京邮电大学教授、博士生导师,主要研究方向为多媒体通信
  • 基金资助:
    国家自然科学基金资助项目(62071254);江苏高校优势学科建设工程基金资助项目

6G-oriented cross-modal signal reconstruction technology

Ang LI1,2, Jianxin CHEN1,2, Xin WEI1,2, Liang ZHOU1,2   

  1. 1 College of Telecommunications &Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
    2 Key Laboratory of Broadband Wireless Communication and Sensor Network Technology(Ministry of Education), Nanjing University of Posts and Telecommunications, Nanjing 210003, China
  • Revised:2022-03-22 Online:2022-06-01 Published:2022-06-01
  • Supported by:
    The National Natural Science Foundation of China(62071254);Priority Academic Program Development of Jiangsu Higher Education Institutions

摘要:

目的:众所周知,包含音频、视频、触觉的多模态业务如混合现实、数字孪生、元宇宙等势必会成为6G时代下的杀手级应用,然而,该业务产生的大量多模态数据极易对现有通信系统的信号处理、传输、存储等造成负担。因此,为了满足用户沉浸式体验需求和保障低时延、高可靠、大容量的通信质量,迫切需要一种跨模态信号重建方案来减少传输数据量,以支持6G沉浸式多模态业务。

方法:首先,通过控制机器人触摸各种材质,构建了包含音、视、触信号的数据集VisTouch,为后续各种跨模态问题的研究奠定基础;其次,通过利用多模态信号间的语义关联性,设计一种普适的、稳健的端到端跨模态信号重建架构,包含特征提取模块、重建模块、评估模块3个部分,特征提取模块将源模态信号映射为公共语义空间中的语义特征向量,重建模块将此语义特征向量反变换为目标模态信号,2种模块的级联结构是跨越模态“壁垒”的关键,评估模块从语义维度、信号本身的时空维度对重建质量进行评估,并在框架训练过程中反馈优化信息给特征提取模块与重建模块,形成闭环回路,通过不断迭代实现精准信号重建;再次,以通过视频信号重建触觉信号为例,构建视频辅助的触觉重建模型,包括基于3D CNN的视频特征提取网络,基于全卷积网络的GAN生成网络与基于CNN的GAN辨别网络;进一步地,设计了一种遥操作平台,将所构建触觉重建模型部署到编解码器中,以实际验证模型的运行效率;最后,通过实验结果验证跨模态信号重建架构的可靠性以及触觉重建模型的准确性。

结果:所构建的VisTouch数据集涉及音频、视频、触觉三种模态,包含47种生活中常见的片状样本,数据采集手段为脚本控制机械手滑动触摸各种材质,并记录滑动触摸过程中指尖与材质摩擦产生的滑动摩擦力作为触觉信号,同时利用高清摄像头及挂载在机械手的单向拾音器采集视频、音频信号,并用时间戳进行同步;所构建视频辅助的触觉重建模型在VisTouch数据集上的平均绝对误差与准确度分别达到0.0135与0.78,为了将所提跨模态信号重建框架落地到实际应用场景,利用机械人、英伟达开发板进一步搭建了一种遥操作平台,用于实现工业场景中远程抓取物体的任务,该平台运行结果表明,实际平均绝对误差为0.0126,端到端总时延127ms,重建模型时延98ms,同时采用问卷调查方式评估用户满意度,其中触觉真实性满意度均值为4.43,方差为0.72,时延满意度均值为3.87,方差为1.07。

结论:数据集运行结果充分证明了所构建VisTouch数据集的实用性和视频辅助下的触觉重建模型的准确性,同时遥操作平台实际测试结果表明,用户认为该模型所生成出的触觉信号比较贴近实际信号,但对算法运行时间满意度一般,即本模态复杂度有待进一步优化。

关键词: 6G, 跨模态信号重建, 多模态数据集, 3D卷积神经网络, 生成对抗网络

Abstract:

Objectives:It is well known that multimodal services containing audio,video and haptics such as mixed reality,digital twin and metaverse are bound to become killer applications in the 6G era,however,the large amount of multimodal data generated by such services is highly likely to burden the signal processing, transmission and storage of existing communication systems. Therefore, a cross-modal signal reconstruction scheme is urgently needed to reduce the amount of transmitted data to support 6G immersive multimodal services in order to meet the user's immersive experience requirements and guarantee low latency,high reliability and high capacity communication quality.

Methods:Firstly,by controlling the robot to touch various materials,a dataset containing audio, visual and touch signals, VisTouch, is constructed to lay the foundation for subsequent research on various cross-modal problems; secondly, by exploiting the semantic correlation between multimodal signals, a universal and robust end-to-end cross-modal signal reconstruction architecture is designed, comprising three parts: a feature extraction module, a reconstruction module and an evaluation module. The feature extraction module maps the source modal signals into a semantic feature vector in the common semantic space, and the reconstruction module inverse transforms this semantic feature vector into the target modal signal.The evaluation module evaluates the reconstruction quality in semantic and spatio-temporal dimensions, and feeds the optimization information to the feature extraction module and the reconstruction module during the training process of the framework, forming a closed-loop loop to achieve accurate signal reconstruction through continuous iteration. Further, a teleoperated platform is designed to deploy the constructed haptic reconstruction model into the codec to actually verify the operational efficiency of the model; finally, the reliability of the cross-modal signal reconstruction architecture and the accuracy of the haptic reconstruction model are verified by experimental results.

Results: The constructed VisTouch dataset involves three modalities: audio, video and haptics, and contains 47 common slices of life samples. The average absolute error and accuracy of the constructed video-assisted haptic reconstruction model on the VisTouch dataset reached 0.0135 and 0.78 respectively. In order to implement the proposed cross-modal signal reconstruction framework into practical application scenarios, a teleoperation platform was further built using the robot and Nvidia development board for the industrial scenario of The results of running on this platform show that the actual mean absolute error is 0.0126,the total end-to-end delay is 127ms and the reconstruction model delay is 98ms.A questionnaire was also used to assess user satisfaction,where the mean value of haptic realism satisfaction is 4.43 with a variance of 0.72 and the mean value of time delay satisfaction is 3.87 with a variance of 1.07.

Conclusions: The results of the dataset runs fully demonstrate the practicality of the constructed VisTouch dataset and the accuracy of the video-assisted haptic reconstruction model, while the actual test results of the teleoperated platform indicate that users consider the haptic signals generated by the model to be closer to the actual signals,but are generally satisfied with the running time of the algorithm, i.e. the complexity of this modality needs further optimization.

Key words: 6G, cross-modal signal reconstruction, multi-modal dataset, 3D CNN, GAN

中图分类号: 

No Suggested Reading articles found!