基于Transformer解码的端到端场景文本检测与识别算法

doi:10.11959/j.issn.1000-436x.2023070

Abstract

Abstract:

Aiming at the detection and recognition task of arbitrary shape text in scene, a novelty scene text detection and recognition algorithm which could be trained by end-to-end algorithm was proposed.Firstly, the detection branch of text aware module based on segmentation idea was introduced to detect scene text from visual features extracted by convolutional network.Then, a recognition branch based on Transformer vision module and Transformer language module encoded the text features of the detection results.Finally, the text features encoded by the fusion gate in the recognition branch were fused to output the scene text.The experimental results on the three benchmark datasets of Total-Text, ICDAR2013 and ICDAR2015 show that the proposed algorithm has excellent performance in recall, precision, F-score, and has certain advantages in efficiency.

Key words: text detection, text recognition, end-to-end, Transformer

CLC Number:

TP391

Jinzhi ZHENG, Ruyi JI, Libo ZHANG, Chen ZHAO. End-to-end scene text detection and recognition algorithm based on Transformer decoders[J]. Journal on Communications, 2023, 44(5): 64-78.

Figures/Tables 12

References 48

[1]	LONG S B , HE X , YAO C . Scene text detection and recognition:the deep learning era[J]. International Journal of Computer Vision, 2021,129(1): 161-184.
[2]	陈卓, 王国胤, 刘群 . 结合多粒度特征融合的自然场景文本检测方法[J]. 计算机科学, 2021,48(12): 243-248.
	CHEN Z , WANG G Y , LIU Q . Natural scene text detection algorithm combining multi-granularity feature fusion[J]. Computer Science, 2021,48(12): 243-248.
[3]	邵海琳, 季怡, 刘纯平 ,等. 基于增强特征金字塔网络的场景文本检测算法[J]. 计算机科学, 2022,49(2): 248-255.
	SHAO H L , JI Y , LIU C P ,et al. Scene text detection algorithm based on enhanced feature pyramid network[J]. Computer Science, 2022,49(2): 248-255.
[4]	丁明宇, 牛玉磊, 卢志武 ,等. 基于深度学习的图片中商品参数识别方法[J]. 软件学报, 2018,29(4): 1039-1048.
	DING M Y , NIU Y L , LU Z W ,et al. Deep learning for parameter recognition in commodity images[J]. Journal of Software, 2018,29(4): 1039-1048.
[5]	LI H , WANG P , SHEN C H . Towards end-to-end text spotting with convolutional recurrent neural networks[C]// Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Piscataway:IEEE Press, 2017: 5248-5256.
[6]	LYU P Y , LIAO M H , YAO C ,et al. Mask TextSpotter:an end-to-end trainable neural network for spotting text with arbitrary shapes[C]// European Conference on Computer Vision. Berlin:Springer, 2018: 71-88.
[7]	XING L J , TIAN Z , HUANG W L ,et al. Convolutional character networks[C]// Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway:IEEE Press, 2020: 9125-9135.
[8]	LI H , WANG P , SHEN C H ,et al. Show,attend and read:a simple and strong baseline for irregular text recognition[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2019: 8610-8617.
[9]	YU D L , LI X , ZHANG C Q ,et al. Towards accurate scene text recognition with semantic reasoning networks[C]// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2020: 12110-12119.
[10]	YUE X Y , KUANG Z H , LIN C H ,et al. RobustScanner:dynamically enhancing positional clues for robust text recognition[C]// European Conference on Computer Vision. Berlin:Springer, 2020: 135-151.
[11]	FANG S C , XIE H T , WANG Y X ,et al. Read like humans:autonomous,bidirectional and iterative language modeling for scene text recognition[C]// Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2021: 7094-7103.
[12]	FENG W , HE W H , YIN F ,et al. TextDragon:an end-to-end framework for arbitrary shaped text spotting[C]// Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway:IEEE Press, 2020: 9075-9084.
[13]	LIAO M H , LYU P Y , HE M H ,et al. Mask TextSpotter:an end-to-end trainable neural network for spotting text with arbitrary shapes[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021,43(2): 532-548.
[14]	LIU Y L , CHEN H , SHEN C H ,et al. ABCNet:real-time scene text spotting with adaptive bezier-curve network[C]// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2020: 9806-9815.
[15]	LIAO M H , PANG G , HUANG J ,et al. Mask TextSpotter v3:segmentation proposal network for robust scene text spotting[C]// European Conference on Computer Vision. Berlin:Springer, 2020: 706-722.
[16]	王建新, 王子亚, 田萱 . 基于深度学习的自然场景文本检测与识别综述[J]. 软件学报, 2020,31(5): 1465-1496.
	WANG J X , WANG Z Y , TIAN X . Review of natural scene text detection and recognition based on deep learning[J]. Journal of Software, 2020,31(5): 1465-1496.
[17]	BAEK Y , LEE B , HAN D ,et al. Character region awareness for text detection[C]// Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2020: 9357-9366.
[18]	ZHANG S X , ZHU X B , HOU J B ,et al. Deep relational reasoning graph network for arbitrary shape text detection[C]// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2020: 9696-9705.
[19]	TIAN Z T , SHU M , LYU P Y ,et al. Learning shape-aware embedding for scene text detection[C]// Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2020: 4229-4238.
[20]	李煌, 王晓莉, 项欣光 . 基于文本三区域分割的场景文本检测方法[J]. 计算机科学, 2020,47(11): 142-147.
	LI H , WANG X L , XIANG X G . Scene text detection based on triple segmentation[J]. Computer Science, 2020,47(11): 142-147.
[21]	LI J C , LIN Y , LIU R R ,et al. RSCA:real-time segmentation-based context-aware scene text detection[C]// Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Piscataway:IEEE Press, 2021: 2349-2358.
[22]	LIAO M H , ZOU Z S , WAN Z Y ,et al. Real-time scene text detection with differentiable binarization and adaptive scale fusion[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023,45(1): 919-931.
[23]	SHENG F F , CHEN Z N , XU B . NRTR:a no-recurrence sequence-to-sequence model for scene text recognition[C]// Proceedings of International Conference on Document Analysis and Recognition (ICDAR). Piscataway:IEEE Press, 2020: 781-786.
[24]	YANG L , DANG F , WANG P ,et al. A holistic representation guided attention network for scene text recognition[J]. arXiv Preprint,arXiv:1904.01375v3, 2019.
[25]	QIAO L , TANG S L , CHENG Z Z ,et al. Text perceptron:towards end-to-end arbitrary-shaped text spotting[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2020: 11899-11907.
[26]	WANG P F , ZHANG C Q , QI F ,et al. PGNet:real-time arbitrarily-shaped text spotting with point gathering network[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2021: 2782-2790.
[27]	LIU X B , LIANG D , YAN S ,et al. FOTS:fast oriented text spotting with a unified network[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2018: 5676-5685.
[28]	HE T , TIAN Z , HUANG W L ,et al. An end-to-end TextSpotter with explicit alignment and attention[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE Press, 2018: 5020-5029.
[29]	QIN S Y , BISSACO A , RAPTIS M ,et al. Towards unconstrained end-to-end text spotting[C]// Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway:IEEE Press, 2020: 4703-4713.
[30]	QIAO L , CHEN Y , CHENG Z Z ,et al. MANGO:a mask attention guided one-stage scene text spotter[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2021,35(3): 2467-2476.
[31]	HE K M , ZHANG X Y , REN S Q ,et al. Deep residual learning for image recognition[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2016: 770-778.
[32]	ZHOU X Y , YAO C , WEN H ,et al. EAST:an efficient and accurate scene text detector[C]// Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2017: 2642-2651.
[33]	LIAO M H , WAN Z Y , YAO C ,et al. Real-time scene text detection with differentiable binarization[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2020: 11474-11481.
[34]	WANG W H , XIE E Z , LI X ,et al. Shape robust text detection with progressive scale expansion network[C]// Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2020: 9328-9337.
[35]	VATTI B R . A generic solution to polygon clipping[J]. Communications of the ACM, 1992,35(7): 56-63.
[36]	GIRSHICK R . Fast R-CNN[C]// Proceedings of IEEE International Conference on Computer Vision (ICCV). Piscataway:IEEE Press, 2016: 1440-1448.
[37]	MILLETARI F , NAVAB N , AHMADI S A . V-net:fully convolutional neural networks for volumetric medical image segmentation[C]// Proceedings of 2016 Fourth International Conference on 3D Vision (3DV). Piscataway:IEEE Press, 2016: 565-571.
[38]	GUPTA A , VEDALDI A , ZISSERMAN A . Synthetic data for text localisation in natural images[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway:IEEE Press, 2016: 2315-2324.
[39]	KARATZAS D , SHAFAIT F , UCHIDA S ,et al. ICDAR 2013 robust reading competition[C]// Proceedings of 2013 12th International Conference on Document Analysis and Recognition. Piscataway:IEEE Press, 2013: 1484-1493.
[40]	CH'NG C K , CHAN C S . Total-text:a comprehensive dataset for scene text detection and recognition[C]// Proceedings of 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). Piscataway:IEEE Press, 2018: 935-942.
[41]	KARATZAS D , GOMEZ-BIGORDA L , NICOLAOU A , et al . ICDAR 2015 competition on robust reading[C]// Proceedings of 2015 13th International Conference on Document Analysis and Recognition (ICDAR). Piscataway:IEEE Press, 2015: 1156-1160.
[42]	ZHONG Z , JIN L , ZHANG S ,et al. DeepText:a unified framework for text proposal generation and text detection in natural images[J]. arXiv Preprint,arXiv:1605.07314v1, 2016.
[43]	LIAO M H , SHI B G , BAI X . TextBoxes++:a single-shot oriented scene text detector[J]. IEEE Transactions on Image Processing:a Publication of the IEEE Signal Processing Society, 2018,27(8): 3676-3690.
[44]	WANG H , LU P , ZHANG H ,et al. All You need is boundary:toward arbitrary-shaped text spotting[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2020: 12160-12167.
[45]	LIU Y L , SHEN C H , JIN L W ,et al. ABCNet v2:adaptive bezier-curve network for real-time end-to-end text spotting[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022,44(11): 8048-8064.
[46]	TANG J Q , QIAO S , CUI B L ,et al. You can even annotate text with voice:transcription-only-supervised text spotting[C]// Proceedings of the 30th ACM International Conference on Multimedia. New York:ACM Press, 2022: 4154-4163.
[47]	PENG D , WANG X , LIU Y ,et al. SPTS:single-point text spotting[J]. arXiv Preprint,arXiv:2112.07917, 2021.
[48]	LIU Y , ZHANG J , PENG D ,et al. SPTS v2:single-point scene text spotting[J]. arXiv Preprint,arXiv:2301.01635v1, 2023.

Metrics

Recommended 0

No Suggested Reading articles found!

平衡因子			旋转角度
γ_v	γ_L	γ_F	45°	60°
2	2	1	73.0%	72.3%
1	1	1	74.4%	74.2%
2	1	1	72.3%	73.3%
1	2	1	73.6%	74.5%
1	1	2	73.7%	73.9%

算法	旋转45°				旋转60°
算法	召回率	准确率	F值	帧率/(frame·s^-1)	召回率	准确率	F值	帧率/(frame·s^-1)
CharNet	35.5%	34.2%	33.9%	0.3	8.40%	10.30%	9.30%	0.4
Mask TextSpotter	45.8%	66.4%	54.2%	2.9	48.3%	68.2%	56.6%	2.8
本文算法	63.7%	89.4%	74.4%	9.3	63.7%	88.9%	74.2%	8.9

算法	端到端的场景文本检测与识别				帧率/(frame·s^-1)
算法	G约束	W约束	S约束	无字典约束	帧率/(frame·s^-1)
Mask TextSpotter v1	62.4%	73.0%	79.3%	—	—
CharNet R-50	62.2%	74.5%	80.2%	60.72%	0.8
TextBoxes++	51.9%	65.9%	73.3%	—	—
TextDragon	65.2%	78.3%	82.5%	—	—
Text Perceptron	65.1%	76.6%	80.5%	—	—
Boundary TextSpotter	64.1%	75.2%	79.7%	—	—
PGNet	63.5%	78.3%	83.3%	—	—
MANGO	67.3%	78.9%	81.8%	—	—
ABCNet v2	73.0%	78.5%	82.7%	—	—
TOSS	52.4%	59.6%	65.9%	—	—
SPTS	65.8%	70.2%	77.5%	—	1.5
SPTS v2	70.3%	75.6%	81.7%	—	—
本文算法	73.7%	76.8%	80.5%	69.2%	4.4

算法	端到端的场景文本检测与识别		帧率/(frame·s^-1)
算法	无字典约束	全字典约束	帧率/(frame·s^-1)
Mask TextSpotter v1	52.9%	71.8%	—
FOTS	32.2%	—	—
CharNet H-88	66.6%	—	0.5
TextDragon	48.8%	74.8%	—
Mask TextSpotter v2	65.3%	77.4%	3.1
Unconstrained	67.8%	—	—
ABCNet	64.2%	75.7%	—
Boundary TextSpotter	65.0%	76.1%	—
PGNet	63.1%	—	—
ABCNet v2	70.4%	78.1%	3.5
TOSS	65.1%	74.8%	—
本文算法	70.9%	78.1%	6.4

模块		无字典约束			全字典约束
模块	召回率	准确率	F值	召回率	准确率	F值
TVM	61.9%	79.6%	66.7%	68.7%	87.4%	76.9%
TAM+TVM	58.3%	79.4%	67.3%	69.4%	88.4%	77.7%
TVM+TLM	61.9%	81.3%	70.3%	70.3%	87.3%	77.8%
TAM+TVM+TLM	62.1%	82.5%	70.9%	70.3%	87.7%	78.1%

End-to-end scene text detection and recognition algorithm based on Transformer decoders

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 12

References 48

Related Articles 15

Metrics

Recommended 0

[1]	Weigang HUO, Rui LIANG, Yonghua LI. Anomaly detection model for multivariate time series based on stochastic Transformer [J]. Journal on Communications, 2023, 44(2): 94-103.
[2]	Yanwen WANG, Weimin LEI, Wei ZHANG, Huan MENG, Xinyi CHEN, Wenhui YE, Qingyang JING. Survey on video image reconstruction method based on generative model [J]. Journal on Communications, 2022, 43(9): 194-208.
[3]	Zhengyu ZHU, Pengfei CHEN, Zixuan WANG, Kexian GONG, Di WU, Zhongyong WANG. Short wave protocol signals recognition based on Swin-Transformer [J]. Journal on Communications, 2022, 43(11): 127-135.
[4]	Lei SUN, Jianquan WANG, Shangjing LIN, Zhangchao MA, Wei LI, Liang Qilian, Rong HUANG. Research on 5G-TSN joint scheduling mechanism based on radio channel information [J]. Journal on Communications, 2021, 42(12): 65-75.
[5]	Ze’nan WANG, Jiao ZHANG, Shuo WANG, Tao HUANG, F.Richard Yu. Service chain deployment algorithms for deterministic end-to-end delay upper bound [J]. Journal on Communications, 2021, 42(11): 66-78.
[6]	. Packet-loss robust scalable authentication algorithm for compressed image streaming [J]. Journal on Communications, 2014, 35(4): 20-181.
[7]	Xiao-wei YI,Heng-tai MA,Gang ZHENG,Chang-wen ZHENG. Packet-loss robust scalable authentication algorithm for compressed image streaming [J]. Journal on Communications, 2014, 35(4): 174-181.
[8]	Xue-fen CHI,Ying-ying ZHAO. Network analytical model of tandem queuing RED and ERED [J]. Journal on Communications, 2011, 32(9): 174-181.
[9]	Peng ZHANG,Jian-ping YU,Hong-wei LIU. Data fusion protocol with source security for sensor networks [J]. Journal on Communications, 2010, 31(11): 87-91.
[10]	Miao XUE,De-yun GAO,Si-dong ZHANG,Hong-ke ZHANG. End-to-end multipath transport layer architecture oriented the next generation network [J]. Journal on Communications, 2010, 31(10): 26-35.
[11]	Xue-juan GAO,Li ZHUO,Lan-sun SHEN. H.264 rate-distortion model based joint source channel coding scheme over wireless channels [J]. Journal on Communications, 2008, 29(9): 24-31.
[12]	Lian-ming ZHANG,Da-zu HUANG,Zhi-gang CHEN. Models of bounds on end-to-end delay of long-range dependence traffic based on fractal leaky buckets [J]. Journal on Communications, 2008, 29(7): 32-38.
[13]	Wei-xuan GU,UShun-zheng Y. Novel approach to measure and estimate one-way queuing delay without clock synchronization [J]. Journal on Communications, 2007, 28(9): 104-111.
[14]	Wei-qiang WANG,Li-bo FU,Wen GAO,Qing-ming HUANG,Shu-qiang JIANG. Text detection based on stroke features [J]. Journal on Communications, 2007, 28(12): 116-120.
[15]	Kun SHA,Xiao-liang SHAO,Jun-feng XIA,Ji-bing HU. The analysis of the change of routing based on theend-to-end measurement [J]. Journal on Communications, 2005, 26(1A): 133-135.