基于Transformer解码的端到端场景文本检测与识别算法

doi:10.11959/j.issn.1000-436x.2023070

摘要/Abstract

摘要：

针对任意形状的场景文本检测与识别，提出一种新的端到端场景文本检测与识别算法。首先，引入了文本感知模块基于分割思想的检测分支从卷积网络提取的视觉特征中完成场景文本的检测；然后，由基于Transformer视觉模块和Transformer语言模块组成的识别分支对检测结果进行文本特征的编码；最后，由识别分支中的融合门融合编码的文本特征，输出场景文本。在Total-Text、ICDAR2013和ICDAR2015基准数据集上进行的实验结果表明，所提算法在召回率、准确率和F值上均表现出了优秀的性能，且时间效率具有一定的优势。

关键词: 文本检测, 文本识别, 端到端, Transformer

Abstract:

Aiming at the detection and recognition task of arbitrary shape text in scene, a novelty scene text detection and recognition algorithm which could be trained by end-to-end algorithm was proposed.Firstly, the detection branch of text aware module based on segmentation idea was introduced to detect scene text from visual features extracted by convolutional network.Then, a recognition branch based on Transformer vision module and Transformer language module encoded the text features of the detection results.Finally, the text features encoded by the fusion gate in the recognition branch were fused to output the scene text.The experimental results on the three benchmark datasets of Total-Text, ICDAR2013 and ICDAR2015 show that the proposed algorithm has excellent performance in recall, precision, F-score, and has certain advantages in efficiency.

Key words: text detection, text recognition, end-to-end, Transformer

中图分类号:

TP391

郑金志, 汲如意, 张立波, 赵琛. 基于Transformer解码的端到端场景文本检测与识别算法[J]. 通信学报, 2023, 44(5): 64-78.

Jinzhi ZHENG, Ruyi JI, Libo ZHANG, Chen ZHAO. End-to-end scene text detection and recognition algorithm based on Transformer decoders[J]. Journal on Communications, 2023, 44(5): 64-78.

图/表 12

图1

图2

图3

图4

表1

图5

表2

图6

表3

图7

表4

表5

参考文献 48

[1]	LONG S B , HE X , YAO C . Scene text detection and recognition:the deep learning era[J]. International Journal of Computer Vision, 2021,129(1): 161-184.
[2]	陈卓, 王国胤, 刘群 . 结合多粒度特征融合的自然场景文本检测方法[J]. 计算机科学, 2021,48(12): 243-248.
	CHEN Z , WANG G Y , LIU Q . Natural scene text detection algorithm combining multi-granularity feature fusion[J]. Computer Science, 2021,48(12): 243-248.
[3]	邵海琳, 季怡, 刘纯平 ,等. 基于增强特征金字塔网络的场景文本检测算法[J]. 计算机科学, 2022,49(2): 248-255.
	SHAO H L , JI Y , LIU C P ,et al. Scene text detection algorithm based on enhanced feature pyramid network[J]. Computer Science, 2022,49(2): 248-255.
[4]	丁明宇, 牛玉磊, 卢志武 ,等. 基于深度学习的图片中商品参数识别方法[J]. 软件学报, 2018,29(4): 1039-1048.
	DING M Y , NIU Y L , LU Z W ,et al. Deep learning for parameter recognition in commodity images[J]. Journal of Software, 2018,29(4): 1039-1048.
[5]	LI H , WANG P , SHEN C H . Towards end-to-end text spotting with convolutional recurrent neural networks[C]// Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Piscataway:IEEE Press, 2017: 5248-5256.
[6]	LYU P Y , LIAO M H , YAO C ,et al. Mask TextSpotter:an end-to-end trainable neural network for spotting text with arbitrary shapes[C]// European Conference on Computer Vision. Berlin:Springer, 2018: 71-88.
[7]	XING L J , TIAN Z , HUANG W L ,et al. Convolutional character networks[C]// Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway:IEEE Press, 2020: 9125-9135.
[8]	LI H , WANG P , SHEN C H ,et al. Show,attend and read:a simple and strong baseline for irregular text recognition[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2019: 8610-8617.
[9]	YU D L , LI X , ZHANG C Q ,et al. Towards accurate scene text recognition with semantic reasoning networks[C]// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2020: 12110-12119.
[10]	YUE X Y , KUANG Z H , LIN C H ,et al. RobustScanner:dynamically enhancing positional clues for robust text recognition[C]// European Conference on Computer Vision. Berlin:Springer, 2020: 135-151.
[11]	FANG S C , XIE H T , WANG Y X ,et al. Read like humans:autonomous,bidirectional and iterative language modeling for scene text recognition[C]// Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2021: 7094-7103.
[12]	FENG W , HE W H , YIN F ,et al. TextDragon:an end-to-end framework for arbitrary shaped text spotting[C]// Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway:IEEE Press, 2020: 9075-9084.
[13]	LIAO M H , LYU P Y , HE M H ,et al. Mask TextSpotter:an end-to-end trainable neural network for spotting text with arbitrary shapes[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021,43(2): 532-548.
[14]	LIU Y L , CHEN H , SHEN C H ,et al. ABCNet:real-time scene text spotting with adaptive bezier-curve network[C]// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2020: 9806-9815.
[15]	LIAO M H , PANG G , HUANG J ,et al. Mask TextSpotter v3:segmentation proposal network for robust scene text spotting[C]// European Conference on Computer Vision. Berlin:Springer, 2020: 706-722.
[16]	王建新, 王子亚, 田萱 . 基于深度学习的自然场景文本检测与识别综述[J]. 软件学报, 2020,31(5): 1465-1496.
	WANG J X , WANG Z Y , TIAN X . Review of natural scene text detection and recognition based on deep learning[J]. Journal of Software, 2020,31(5): 1465-1496.
[17]	BAEK Y , LEE B , HAN D ,et al. Character region awareness for text detection[C]// Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2020: 9357-9366.
[18]	ZHANG S X , ZHU X B , HOU J B ,et al. Deep relational reasoning graph network for arbitrary shape text detection[C]// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2020: 9696-9705.
[19]	TIAN Z T , SHU M , LYU P Y ,et al. Learning shape-aware embedding for scene text detection[C]// Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2020: 4229-4238.
[20]	李煌, 王晓莉, 项欣光 . 基于文本三区域分割的场景文本检测方法[J]. 计算机科学, 2020,47(11): 142-147.
	LI H , WANG X L , XIANG X G . Scene text detection based on triple segmentation[J]. Computer Science, 2020,47(11): 142-147.
[21]	LI J C , LIN Y , LIU R R ,et al. RSCA:real-time segmentation-based context-aware scene text detection[C]// Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Piscataway:IEEE Press, 2021: 2349-2358.
[22]	LIAO M H , ZOU Z S , WAN Z Y ,et al. Real-time scene text detection with differentiable binarization and adaptive scale fusion[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023,45(1): 919-931.
[23]	SHENG F F , CHEN Z N , XU B . NRTR:a no-recurrence sequence-to-sequence model for scene text recognition[C]// Proceedings of International Conference on Document Analysis and Recognition (ICDAR). Piscataway:IEEE Press, 2020: 781-786.
[24]	YANG L , DANG F , WANG P ,et al. A holistic representation guided attention network for scene text recognition[J]. arXiv Preprint,arXiv:1904.01375v3, 2019.
[25]	QIAO L , TANG S L , CHENG Z Z ,et al. Text perceptron:towards end-to-end arbitrary-shaped text spotting[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2020: 11899-11907.
[26]	WANG P F , ZHANG C Q , QI F ,et al. PGNet:real-time arbitrarily-shaped text spotting with point gathering network[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2021: 2782-2790.
[27]	LIU X B , LIANG D , YAN S ,et al. FOTS:fast oriented text spotting with a unified network[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2018: 5676-5685.
[28]	HE T , TIAN Z , HUANG W L ,et al. An end-to-end TextSpotter with explicit alignment and attention[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2018: 5020-5029.
[29]	QIN S Y , BISSACO A , RAPTIS M ,et al. Towards unconstrained end-to-end text spotting[C]// Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway:IEEE Press, 2020: 4703-4713.
[30]	QIAO L , CHEN Y , CHENG Z Z ,et al. MANGO:a mask attention guided one-stage scene text spotter[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2021,35(3): 2467-2476.
[31]	HE K M , ZHANG X Y , REN S Q ,et al. Deep residual learning for image recognition[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2016: 770-778.
[32]	ZHOU X Y , YAO C , WEN H ,et al. EAST:an efficient and accurate scene text detector[C]// Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2017: 2642-2651.
[33]	LIAO M H , WAN Z Y , YAO C ,et al. Real-time scene text detection with differentiable binarization[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2020: 11474-11481.
[34]	WANG W H , XIE E Z , LI X ,et al. Shape robust text detection with progressive scale expansion network[C]// Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2020: 9328-9337.
[35]	VATTI B R . A generic solution to polygon clipping[J]. Communications of the ACM, 1992,35(7): 56-63.
[36]	GIRSHICK R . Fast R-CNN[C]// Proceedings of IEEE International Conference on Computer Vision (ICCV). Piscataway:IEEE Press, 2016: 1440-1448.
[37]	MILLETARI F , NAVAB N , AHMADI S A . V-net:fully convolutional neural networks for volumetric medical image segmentation[C]// Proceedings of 2016 Fourth International Conference on 3D Vision (3DV). Piscataway:IEEE Press, 2016: 565-571.
[38]	GUPTA A , VEDALDI A , ZISSERMAN A . Synthetic data for text localisation in natural images[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2016: 2315-2324.
[39]	KARATZAS D , SHAFAIT F , UCHIDA S ,et al. ICDAR 2013 robust reading competition[C]// Proceedings of 2013 12th International Conference on Document Analysis and Recognition. Piscataway:IEEE Press, 2013: 1484-1493.
[40]	CH'NG C K , CHAN C S . Total-text:a comprehensive dataset for scene text detection and recognition[C]// Proceedings of 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). Piscataway:IEEE Press, 2018: 935-942.
[41]	KARATZAS D , GOMEZ-BIGORDA L , NICOLAOU A , et al . ICDAR 2015 competition on robust reading[C]// Proceedings of 2015 13th International Conference on Document Analysis and Recognition (ICDAR). Piscataway:IEEE Press, 2015: 1156-1160.
[42]	ZHONG Z , JIN L , ZHANG S ,et al. DeepText:a unified framework for text proposal generation and text detection in natural images[J]. arXiv Preprint,arXiv:1605.07314v1, 2016.
[43]	LIAO M H , SHI B G , BAI X . TextBoxes++:a single-shot oriented scene text detector[J]. IEEE Transactions on Image Processing:a Publication of the IEEE Signal Processing Society, 2018,27(8): 3676-3690.
[44]	WANG H , LU P , ZHANG H ,et al. All You need is boundary:toward arbitrary-shaped text spotting[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2020: 12160-12167.
[45]	LIU Y L , SHEN C H , JIN L W ,et al. ABCNet v2:adaptive bezier-curve network for real-time end-to-end text spotting[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022,44(11): 8048-8064.
[46]	TANG J Q , QIAO S , CUI B L ,et al. You can even annotate text with voice:transcription-only-supervised text spotting[C]// Proceedings of the 30th ACM International Conference on Multimedia. New York:ACM Press, 2022: 4154-4163.
[47]	PENG D , WANG X , LIU Y ,et al. SPTS:single-point text spotting[J]. arXiv Preprint,arXiv:2112.07917, 2021.
[48]	LIU Y , ZHANG J , PENG D ,et al. SPTS v2:single-point scene text spotting[J]. arXiv Preprint,arXiv:2301.01635v1, 2023.

平衡因子			旋转角度
γ_v	γ_L	γ_F	45°	60°
2	2	1	73.0%	72.3%
1	1	1	74.4%	74.2%
2	1	1	72.3%	73.3%
1	2	1	73.6%	74.5%
1	1	2	73.7%	73.9%

算法	旋转45°				旋转60°
算法	召回率	准确率	F值	帧率/(frame·s^-1)	召回率	准确率	F值	帧率/(frame·s^-1)
CharNet	35.5%	34.2%	33.9%	0.3	8.40%	10.30%	9.30%	0.4
Mask TextSpotter	45.8%	66.4%	54.2%	2.9	48.3%	68.2%	56.6%	2.8
本文算法	63.7%	89.4%	74.4%	9.3	63.7%	88.9%	74.2%	8.9

算法	端到端的场景文本检测与识别				帧率/(frame·s^-1)
算法	G约束	W约束	S约束	无字典约束	帧率/(frame·s^-1)
Mask TextSpotter v1	62.4%	73.0%	79.3%	—	—
CharNet R-50	62.2%	74.5%	80.2%	60.72%	0.8
TextBoxes++	51.9%	65.9%	73.3%	—	—
TextDragon	65.2%	78.3%	82.5%	—	—
Text Perceptron	65.1%	76.6%	80.5%	—	—
Boundary TextSpotter	64.1%	75.2%	79.7%	—	—
PGNet	63.5%	78.3%	83.3%	—	—
MANGO	67.3%	78.9%	81.8%	—	—
ABCNet v2	73.0%	78.5%	82.7%	—	—
TOSS	52.4%	59.6%	65.9%	—	—
SPTS	65.8%	70.2%	77.5%	—	1.5
SPTS v2	70.3%	75.6%	81.7%	—	—
本文算法	73.7%	76.8%	80.5%	69.2%	4.4

算法	端到端的场景文本检测与识别		帧率/(frame·s^-1)
算法	无字典约束	全字典约束	帧率/(frame·s^-1)
Mask TextSpotter v1	52.9%	71.8%	—
FOTS	32.2%	—	—
CharNet H-88	66.6%	—	0.5
TextDragon	48.8%	74.8%	—
Mask TextSpotter v2	65.3%	77.4%	3.1
Unconstrained	67.8%	—	—
ABCNet	64.2%	75.7%	—
Boundary TextSpotter	65.0%	76.1%	—
PGNet	63.1%	—	—
ABCNet v2	70.4%	78.1%	3.5
TOSS	65.1%	74.8%	—
本文算法	70.9%	78.1%	6.4

模块		无字典约束			全字典约束
模块	召回率	准确率	F值	召回率	准确率	F值
TVM	61.9%	79.6%	66.7%	68.7%	87.4%	76.9%
TAM+TVM	58.3%	79.4%	67.3%	69.4%	88.4%	77.7%
TVM+TLM	61.9%	81.3%	70.3%	70.3%	87.3%	77.8%
TAM+TVM+TLM	62.1%	82.5%	70.9%	70.3%	87.7%	78.1%