[1] 中国信息通信研究院. 6G 总体愿景与潜在关键技术白皮书[EB]. 2021.
China Academy of Information and Communications Technology. 6G overall vision and potential key technology white paper[EB]. 2021.
[2] VAN D B D, Glans R, et al. Challenges in Haptic Communications Over the Tactile Internet[J]. IEEE Access, 2017, 5: 23502-23518.
[3] ZHOU L, WU D, CHEN J, et al. Cross-modal collaborative commu-nications[J]. IEEE Wireless Communications, 2019, 27(2): 112-117.
[4] WEI X, ZHOU L. AI-Enabled Cross-Modal Communications[J]. IEEE Wireless Communications, 2021,28(4): 182-189.
[5] 高赟,魏昕,周亮. 跨模态通信理论及关键技术初探[J]. 中国传媒大学学报(自然科学版), 2021, 28(01): 55-63.
GAO Y, WEI X, ZHOU L. Preliminary Study on Theory and Key Technology of Cross-Modal Communications[J]. Journal of Commu-nication University of China(Science and Technology), 2021, 28(01): 55-63.
[6] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classifi-cation with deep convolutional neural networks[J]. Advances in neural information processing systems, 2012, 25: 1097-1105.
[7] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
[8] HE K, Zhang X, REN S, et al. Deep residual learning for image recognition[C]. Proceedings of the IEEE conference on computer vi-sion and pattern recognition. 2016: 770-778.
[9] BAZZICA A, Van Gemert J C, LIEM C C S, et al. Vision-based detec-tion of acoustic timed events: a case study on clarinet note onsets[J]. arXiv preprint arXiv:1706.09556, 2017.
[10] LI B, LIU X, DINESH K, et al. Creating a musical performance da-taset for multimodal music analysis: Challenges, insights, and applica-tions[J]. IEEE Trans. Multimedia, submitted. Available: https://arxiv. org/abs/1612.08727, 2016.
[11] ZHAO H, GAN C, ROUDITCHENKO A, et al. The sound of pix-els[C]. Proceedings of the European conference on computer vision. 2018: 570-586.
[12] MONTESINOS J F, SLIZOVSKAIA O, HARO G. Solos: A dataset for audio-visual music analysis[C]. 2020 IEEE 22nd International Work-shop on Multimedia Signal Processing. 2020: 1-6.
[13] KURMI V K, BAJAJ V, PATRO B N, et al. Collaborative Learning to Generate Audio-Video Jointly[C]. IEEE International Conference on Acoustics, Speech and Signal Processing. 2021: 4180-4184.
[14] ROTH J, CHAUDHURI S, KLEJCH O, et al. Ava active speaker: An audio-visual dataset for active speaker detection[C]. IEEE Internation-al Conference on Acoustics, Speech and Signal Processing. 2020: 4492-4496.
[15] TSUCHIDA S, FUKAYAMA S, HAMASAKI M, et al. AIST Dance Video Database: Multi-Genre, Multi-Dancer, and Multi-Camera Da-tabase for Dance Information Processing[C]. Proceedings of the 20th International Society for Music Information Retrieval Conference. 2019: 501-510.
[16] LI R, YANG S, ROSS D A, et al. Ai choreographer: Music conditioned 3d dance generation with aist++[C]. Proceedings of the IEEE/CVF In-ternational Conference on Computer Vision. 2021: 13401-13412.
[17] HONG S, IM W, YANG H S. Content-based video-music retrieval using soft intra-modal structure constraint[J]. arXiv preprint arXiv:1704.06761, 2017.
[18] LI Y, ZHU J Y, TEDRAKE R, et al. Connecting touch and vision via cross-modal prediction[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 10609-10618.
[19] YUAN W, DONG S, ADELSON E H. Gelsight: High-resolution robot tactile sensors for estimating geometry and force[J]. Sensors, 2017, 17(12): 2762.
[20] SUNDARAM S, KELLNHOFER P, LI Y, et al. Learning the signa-tures of the human grasp using a scalable tactile glove[J]. Nature, 2019, 569(7758): 698-702.
[21] DUAN B, WANG W, TANG H, et al. Cascade attention guided residue learning gan for cross-modal translation[C]. 2020 25th International Conference on Pattern Recognition. 2021: 1336-1343
[22] HAO W, ZHANG Z, GUAN H. Cmcgan: A uniform framework for cross-modal visual-audio mutual generation[C]. Proceedings of the AAAI Conference on Artificial Intelligence. 2018, 32(1).
|