基于图像描述算法的离线盲人视觉辅助系统

doi:10.11959/j.issn.1000-0801.2022014

摘要/Abstract

摘要：

摘要：针对现有盲人视觉辅助设备存在的不便，探讨了基于模型剪枝的图像描述模型在便携式移动设备上运行的方法。回顾了图像描述模型和剪枝模型技术，重点提出了一种针对图像描述模型的改进剪枝算法。结果表明，在保证准确性的前提下，剪枝后的图像描述模型可以大幅降低工作时的处理时间和消耗的电源容量，能够随时随地快速准确地对环境信息进行描述及语音播报。

关键词: 视觉辅助系统, 图像描述模型, 模型压缩和加速, 模型剪枝算法

Abstract:

In view of the inconveniences of existing visual aid systems for the blind, the method of running the image captioning model on portable mobile devices based on model pruning was discussed.Model pruning techniques and image captioning models were reviewed.An improved model pruning algorithm for image captioning model was proposed.Experimental results show that, on the premise of ensuring accuracy, the image captioning model after pruning can greatly reduce processing time and power consumption capacity, and can quickly and accurately describe environmental information and voice broadcast anytime and anywhere.

Key words: visual assisted system, image captioning model, model compression and acceleration, model pruning algorithm

中图分类号:

TP391

陈悦, 郭宇, 谢圆琰, 米振强. 基于图像描述算法的离线盲人视觉辅助系统[J]. 电信科学, 2022, 38(1): 61-72.

Yue CHEN, Yu GUO, Yuanyan XIE, Zhenqiang MI. Offline visual aid system for the blind based on image captioning[J]. Telecommunications Science, 2022, 38(1): 61-72.

图/表 14

图1

图2

图3

表1

剪枝算法公式相关符号物理意义说明"

参数	描述
$\begin{array}{l} D = {X = {x_{0}, x_{1}, \dots, x_{N}}, \\ Y = {y_{0}, y_{1}, \dots, y_{N}}} \end{array}$	输入集合X和输出集合Y组成的训练集x_i，y_i分别表示第i _th个输入和输出
$W = {(w_{1}^{1}, b_{1}^{1}), (w_{1}^{2}, b_{1}^{2}), \dots, (w_{L}^{C_{l}}, b_{L}^{C_{l}})}$	网络模型参数，表示第 i 层的网络模型参数
W′	剪枝后的网络模型参数集合，W∈W′
C (D\|W)	预训练模型损失函数
C (D\|W′)	剪枝后的模型损失函数
C (D,h_i)	定义h_i的模型损失函数
B	非0参数的个数
$h = {z_{0}^{(1)}, z_{0}^{(2)}, \dots, z_{L}^{(C_{l})}}$	特征图集合
z_l	第l层卷积层的特征图
$w_{l}^{(k)}$	第l层卷积层的第k个卷积核
g_l	g_l∈{0, 1}

表1

图4

图5

图6

图7

图8

图9

表2

图10

图11

表3

参考文献 28

[1]	康帅, 章坚武, 朱尊杰 ,等. 改进 YOLOv4 算法的复杂视觉场景行人检测方法[J]. 电信科学, 2021,37(8): 46-56.
	KANG S , ZHANG J W , ZHU Z J ,et al. An improved YOLOv4 algorithm for pedestrian detection in complex visual scenes[J]. Telecommunications Science, 2021,37(8): 46-56.
[2]	MAO J H , XU W , YANG Y ,et al. Explain images with multimodal recurrent neural networks[EB]. 2014.
[3]	VINYALS O , TOSHEV A , BENGIO S ,et al. Show and tell:a neural image caption generator[C]// Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2015.
[4]	ANDERSON P , HE X D , BUEHLER C ,et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2018: 6077-6086.
[5]	LUO Y P , JI J Y , SUN X S ,et al. Dual-level collaborative transformer for image captioning[EB]. 2021.
[6]	YANG X , TANG K H , ZHANG H W ,et al. Auto-encoding scene graphs for image captioning[C]// Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2019: 10685-10694.
[7]	CHEN S Z , JIN Q , WANG P ,et al. Say as you wish:fine-grained control of image caption generation with abstract scene graphs[C]// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2020: 9962-9971.
[8]	WANG Z Y , FENG B , NARASIMHAN K ,et al. Towards unique and informative captioning of images[M]// Computer Vision – ECCV 2020.Cham:Springer International Publishing,[S.l.:s.n.], 2020: 629-644.
[9]	XU G H , NIU S C , TAN M K ,et al. Towards accurate text-based image captioning with content diversity exploration[C]// Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2021: 12637-12646.
[10]	DENTON E , ZAREMBA W,BRUNA , et al . Exploiting linear structure within convolutional networks for efficient evaluation[C]// Advances in neural information processing systems. Cambridge:MIT Press, 2014: 1269-1277.
[11]	ZHUANG Z W , TAN M K , ZHUANG B H ,et al. Discrimination-aware channel pruning for deep neural networks[EB]. 2018.
[12]	RASTEGARI M , ORDONEZ V , REDMON J ,et al. Xnor-net:imagenet classification using binary convolutional neural networks[C]// European conference on computer vision. Berlin:Springer, 2016: 525-542.
[13]	WANG K , LIU Z J , LIN Y J ,et al. HAQ:hardware-aware automated quantization with mixed precision[C]// Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2019: 8612-8620.
[14]	CHEN H T , WANG Y H , XU C ,et al. Data-free learning of student networks[C]// Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway:IEEE Press, 2019: 3514-3522.
[15]	LUO L C , SANDLER M , LIN Z ,et al. Large-scale generative data-free distillation[EB]. 2020.
[16]	YU X Y , LIU T L , WANG X C ,et al. On compressing deep models by low rank and sparse decomposition[C]// Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2017: 7370-7379.
[17]	YANG Z , WANG Y , LIU C ,et al. Legonet:efficient convolutional neural networks with lego filters[C]// International Conference on Machine Learning. New York:ACM Press, 2019: 7005-7014.
[18]	CHEN H T , WANG Y H , XU C J ,et al. AdderNet:do we really need multiplications in deep learning?[C]// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2020: 1468-1477.
[19]	XU Y , XU C , CHEN X ,et al. Kernel based progressive distillation for adder neural networks[EB]. 2020.
[20]	SONG D H , WANG Y H , CHEN H T ,et al. AdderSR:towards energy efficient image super-resolution[C]// Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2021: 15648-15657.
[21]	PARK Y , YUN I D . Fast adaptive RNN Encoder?Decoder for anomaly detection in SMD assembly machine[J]. Sensors (Basel,Switzerland), 2018,18(10): 3573.
[22]	XU K , BA J , KIROS R ,et al. Show,attend and tell:neural image caption generation with visual attention[EB]. 2015.
[23]	XINGJIAN S H I , CHEN Z , WANG H ,et al. Convolutional LSTM network:A machine learning approach for precipitation nowcasting[C]// Advances in neural information processing systems. Cambridge:MIT Press, 2015: 802-810.
[24]	MOLCHANOV P , TYREE S , KARRAS T ,et al. Pruning convolutional neural networks for resource efficient inference[EB]. 2016.
[25]	王从徐 . 基于泰勒级数展开及其应用探讨[J]. 红河学院学报, 2021,19(02): 154-156.
	WANG C X . Discussion on Taylor series expansion and its application[J]. Journal of Honghe University, 2021,19(02): 154-156.
[26]	HODOSH M , YOUNG P , HOCKENMAIER J . Framing image description as a ranking task:data,models and evaluation metrics[J]. Journal of Artificial Intelligence Research, 2013,47: 853-899.
[27]	蔡鑫 . 基于 Bert 模型的互联网不良信息检测[J]. 电信科学, 2020,36(11): 121-126.
	CAI X . Internet bad information detection based on Bert model[J]. Telecommunications Science, 2020,36(11): 121-126.
[28]	LIN C Y , . Rouge:a package for automatic evaluation of summaries[C]// Text summarization branches out. Barcelona:ACL, 2004: 74-81.

得分	0.91～1	0.71～0.9	0.51～0.7	0～0.5
ROUGE-1	0.62	0.28	0.08	0.02
ROUGE-2	0.59	0.12	0.22	0.07
ROUGE-L	0.62	0.28	0.08	0.02

模型	处理一张图片平均消耗的时间/s	处理一张图片平均消耗的电源容量/mAh
剪枝前的图像描述模型	4.049	0.164
剪枝后的图像描述模型	2.337	0.269