基于概率分布差异的医学命名实体识别方法

doi:10.11959/j.issn.2096-0271.2023008

大数据 ›› 2023, Vol. 9 ›› Issue (4): 159-171.doi: 10.11959/j.issn.2096-0271.2023008

• 研究 • 上一篇

基于概率分布差异的医学命名实体识别方法

刘聪¹, 吕雪峰¹, 王宏林¹, 王晓伟², 陆瑾², 孙顺¹, 胡松奇¹

¹ 中国共产党中央军事委员会后勤保障部信息中心，北京 100190
² 长沙军民先进技术研究有限公司，湖南长沙 410205

出版日期:2023-07-01 发布日期:2023-07-01
作者简介:刘聪（1985－），男，博士，中国共产党中央军事委员会后勤保障部信息中心工程师，主要研究方向为医疗卫生大数据、医疗卫生信息化
吕雪峰（1979－），男，中国共产党中央军事委员会后勤保障部信息中心高级工程师，主要研究方向为医疗卫生大数据、医疗卫生信息化
王宏林（1988－），男，中国共产党中央军事委员会后勤保障部信息中心工程师，主要研究方向为后勤信息化
王晓伟（1980－），男，博士，长沙军民先进技术研究有限公司高级工程师，主要研究方向为自然语言处理、大数据
陆瑾（1993－），男，长沙军民先进技术研究有限公司工程师，主要研究方向为自然语言处理、人工智能
孙顺（1980－），男，中国共产党中央军事委员会后勤保障部信息中心工程师，主要研究方向为卫生信息化
胡松奇（1988－），男，中国共产党中央军事委员会后勤保障部信息中心工程师，主要研究方向为卫生信息化
基金资助:
军队后勤科研重点项目(BS220R007)

Medical named entity recognition algorithm based on probability distribution difference

Cong LIU¹, Xuefeng LYU¹, Honglin WANG¹, Xiaowei WANG², Jin LU², Shun SUN¹, Songqi HU¹

¹ Information Center, Logistic Support Department of CMC, Beijing 100190, China
² Changsha Civi-military Advanced Technology Research Limited Company, Changsha 410205, China

Online:2023-07-01 Published:2023-07-01
Supported by:
Key Program of Scientific Research of Army Logistics(BS220R007)

摘要/Abstract

摘要：

医学命名实体识别是从医学文本中抽取出指代特定概念的医学实体，是医学信息抽取的基础性任务。当前主流的医学命名实体识别算法普遍基于深度学习技术，需要大量高质量的标注样本进行模型训练。然而医学领域的样本标注成本很高，严重限制了模型性能的提升。为了降低模型对标注样本的需求，一种重要的方法是基于主动学习思想，设计合理的样本采样策略，自动选取高价值样本优先标注，从而使模型提前收敛。现有算法普遍基于样本长度、样本识别的概率等特征来设计采样策略，忽视了样本类别分布这一深层次特征，导致命名实体识别召回率较低。提出了一种基于概率分布差异的主动学习算法，通过计算样本间的概率分布差异来评估样本的标注价值，并在标注样本更新时动态优化模型。在真实的医学检查文本上的实验表明，相比已有算法，达到同等的模型性能，该算法所需要的标注数据可缩减10%以上；在相同标注样本量的情况下，本算法F1值提高5%以上。

关键词: 医学命名实体识别, 深度学习, 主动学习, 概率分布

Abstract:

With the improvement of data abilities and the development of emerging technologies, there are profound changes occurring in economic patterns and competitive structure of industries.In order to better respond to future opportunities and challenges, and to improve competitiveness of enterprises in new situations, it is necessary to understand and master the knowledge of digital transformation.The new competitive situation was discussed in which traditional enterprises would gradually be replaced by digital-transformed ones, digital transformation was differentiated from digitalization.Main challenges facing traditional enterprises while undergoing digital transformation were pinpointed, which were the lack of funds, talents, data and consciousness.A digital transformation service platform oriented to new competitive situation was proposed, which provided a feasible solution to enhancing enterprise competitiveness and conducting digital transformation.

Key words: digital transformation, emerging technologies, data asset, digital economy

中图分类号:

TP391

刘聪, 吕雪峰, 王宏林, 王晓伟, 陆瑾, 孙顺, 胡松奇. 基于概率分布差异的医学命名实体识别方法[J]. 大数据, 2023, 9(4): 159-171.

Cong LIU, Xuefeng LYU, Honglin WANG, Xiaowei WANG, Jin LU, Shun SUN, Songqi HU. Medical named entity recognition algorithm based on probability distribution difference[J]. Big Data Research, 2023, 9(4): 159-171.

图/表 13

表1

表2

图1

图2

图3

图4

图5

图6

图7

图8

图9

表3

表4

参考文献 29

[1]	杨威, 刘艳如, 孟颖 ,等. 浅谈临床医学术语的标准化管理[J]. 中国卫生标准管理, 2021,12(12): 1-4.
	YANG W , LIU Y R , MENG Y ,et al. Discussion on standardization management of clinical medical terminology[J]. China Health Standard Management, 2021,12(12): 1-4.
[2]	赵嘉莹, 高鹏, 朱勇俊 ,等. 人工智能的应用将改进中国基层医疗卫生服务效能[J]. 中国全科医学, 2017,20(34): 4219-4223.
	ZHAO J Y , GAO P , ZHU Y J ,et al. The application of artificial intelligence could improve primary health care provision in China[J]. Chinese General Practice, 2017,20(24): 4219-4223.
[3]	曾晓天, 徐春园, 张勇 ,等. 人工智能在医学大数据标准化体系建设中的研究进展[J]. 北京生物医学工程, 2019,38(6): 639-643.
	ZENG X T , XU C Y , ZHANG Y ,et al. Research progress on artificial intelligence in the standardization system construction of medical big data[J]. Beijing Biomedical Engineering, 2019,38(6): 640-644.
[4]	郑强, 刘齐军, 王正华 ,等. 生物医学命名实体识别的研究与进展[J]. 计算机应用研究, 2010,27(3): 811-815,832.
	ZHENG Q , LIU Q J , WANG Z H ,et al. Research and development on biomedical named entity recognition[J]. Application Research of Computers, 2010,27(3): 811-815,832.
[5]	SETTLES B . Active learning literature survey[J]. Machine Learning, 2010,15(2): 201-221.
[6]	HANISCH D , FUNDEL K , MEVISSEN H T ,et al. ProMiner:rule-based protein and gene entity recognition[J]. BMC Bioinformatics, 2005,6(Suppl 1): S14.
[7]	刘一佳, 车万翔, 刘挺 ,等. 基于序列标注的中文分词,词性标注模型比较分析[C]// 第六届全国青年计算语言学会议论文集. [出版者不详:出版地不详], 2012: 26-34.
	LIU Y J , CHE W X , LIU T ,et al. A comparison study of sequence labeling methods for Chinese word segmentation,POS tagging models[C]// The 6th Youth Conference of Computational Linguistics.[S.l.:s.n.], 2012: 26-34.
[8]	王浩畅, 赵铁军 . 基于SVM的生物医学命名实体的识别[J]. 哈尔滨工程大学学报, 2006,27(S1): 570-574.
	WANG H C , ZHAO T J . SVM-based biomedical Name entity recognition[J]. Journal of Harbin Engineering University, 2006,27(S1): 570-574.
[9]	MORWAL S , CHOPRA D . NERHMM:a tool for named entity recognition based on hidden Markov model[J]. International Journal on Natural Language Computing, 2013,2(2): 43-49.
[10]	PATIL N , PATIL A , PAWAR B V . Named entity recognition using conditional random fields[J]. Procedia Computer Science, 2020,167: 1181-1188.
[11]	LAMPLE G , BALLESTEROS M , SUBRAMANIAN S ,et al. Neural architectures for named entity recognition[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg:Association for Computational Linguistics, 2016.
[12]	OUYANG E , LI Y X , JIN L ,et al. Exploring N-gram character presentation in bidirectional RNN-CRF for Chinese clinical named entity recognition[C]// Proceedings of China Conference on Knowledge Graph and Semantic Computing 2017.[S.l.:s.n.], 2017.
[13]	DONG X S , CHOWDHURY S , QIAN L J ,et al. Transfer bi-directional LSTM RNN for named entity recognition in Chinese electronic medical records[C]// Proceedings of 2017 IEEE 19th International Conference on e-Health Networking,Applications and Services. Piscataway:IEEE Press, 2017: 1-4.
[14]	ZHANG Z C , ZHANG Y , ZHOU T . Medical knowledge attention enhanced neural model for named entity recognition in Chinese EMR[C]// Proceedings of China National Conference on Chinese Computational Linguistics,International Symposium on Natural Language Processing Based on Naturally Annotated Big Data. Cham:Springer, 2018: 376-385.
[15]	WANG Q , XIA Y H , ZHOU Y M ,et al. Incorporating dictionaries into deep neural networks for the Chinese clinical named entity recognition[J]. Journal of Biomedical Informatics, 2019,92:103133.
[16]	QIU J H , WANG Q , ZHOU Y M ,et al. Fast and accurate recognition of Chinese clinical named entities with residual dilated convolutions[C]// Proceedings of 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Piscataway:IEEE Press, 2019: 935-942.
[17]	LI X Y , ZHANG H , ZHOU X H . Chinese clinical named entity recognition with variant neural structures based on BERT methods[J]. Journal of Biomedical Informatics, 2020,107:103422.
[18]	张岑芳 . 基于主动学习的命名实体识别算法[J]. 计算机与现代化, 2021(7): 18-22.
	ZHANG C F . Named entity recognition algorithm based on active learning[J]. Computer and Modernization, 2021(7): 18-22.
[19]	卢宁杰 . 结合主动学习的中文医疗命名实体识别研究[D]. 上海:华东师范大学, 2020.
	LU N J . Research on Chinese medical named entity recognition combined with active learning[D]. Shanghai:East China Normal University, 2020.
[20]	SHANNON C E . A mathematical theory of communication[J]. Bell System Technical Journal, 1948,27(4): 623-656.
[21]	LEWIS D D , CATLETT J . Heterogeneous uncertainty sampling for supervised learning[M]// Machine learning proceedings 1994. Amsterdam: Elsevier, 1994: 148-156.
[22]	SCHEFFER T , DECOMAIN C , WROBEL S . Active hidden Markov models for information extraction[M]// Advances in intelligent data analysis. Heidelberg: Springer, 2001: 309-318.
[23]	DEVLIN J , CHANG M , LEE K ,et al. BERT:pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint. 2018:arXiv:1810.04805.
[24]	GRAVES A , SCHMIDHUBER J . Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J]. Neural Networks, 2005,18(5/6): 602-610.
[25]	SUTTON C . An introduction to conditional random fields[J]. Foundations and Trends^? in Machine Learning, 2012,4(4): 267-373.
[26]	KINGMA D P , BA J . Adam:a method for stochastic optimization[J]. arXiv preprint,2014,arXiv:1412. 6980.
[27]	ZAN H Y , LI W X , ZHANG K L ,et al. Building a pediatric medical corpus:word segmentation and named entity annotation[M]// Lecture notes in computer science. Cham: Springer, 2021: 652-664.
[28]	LAN Z , CHEN M , GOODMAN S ,et al. ALBERT:a lite BERT for self-supervised learning of language representations[J]. arXiv preprint, 2019,arXiv:1909.11942.
[29]	DIAO S Z , BAI J X , SONG Y ,et al. ZEN:pre-training Chinese text encoder enhanced by N-gram representations[C]// Proceedings of Findings of the Association for Computational Linguistics:EMNLP 2020. Stroudsburg:Association for Computational Linguistics, 2020.

数据	标注结果
DR摄片（二次曝光）[右手正斜位]	[b:{[11,12,‘右手’],1}, c:{[13,15,’正斜位’],1}]
磁共振3.0T平扫	[b:{[],0},c:{[7,8,’平扫’],1}]
头颅神经外科移动CT平扫+三维（神外专用）	[b:{[0,1,”头颅”],1}, c:{[9,10,”平扫”],[12,13,”三维”],2}]

模型算法	轮次	批大小	初始学习率
ALBERT-base^[28]	10	32	5×10^-5
ALBERT-xxlarge^[28]	10	12	1×10^-5
ZEN^[29]	10	20	4×10^-5
BiLSTM-CRF^[15]	10	32	3×10^-4
本文方法	10	32	1×10^-4

模型算法	性能指标(F1值)	训练集样本数/条
ALBERT-base^[28]	58.5	15 000
ALBERT-xxlarge^[28]	61.7	15 000
ZEN^[29]	60.9	15 000
BiLSTM-CRF^[15]	57.6	15 000
本文方法	61.6	10 000

基于概率分布差异的医学命名实体识别方法

Medical named entity recognition algorithm based on probability distribution difference

在线阅读

PDF下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 29

相关文章 15

Metrics

推荐阅读 0

[1]	邓钇敏, 张旭龙, 司世景, 王健宗, 肖京. 虚拟人形象合成技术综述[J]. 大数据, 2023, 9(3): 114-139.
[2]	贺亚运, 彭俊清, 王健宗, 肖京. 节奏舞者：基于关键动作转换图和有条件姿态插值网络的3D舞蹈生成方法研究[J]. 大数据, 2023, 9(1): 23-37.
[3]	崔雨萌, 王靖亚, 闫尚义, 陶知众. 基于深度学习的警情记录关键信息自动抽取[J]. 大数据, 2022, 8(6): 127-142.
[4]	朱智韬, 司世景, 王健宗, 肖京. 联邦推荐系统综述[J]. 大数据, 2022, 8(4): 105-132.
[5]	王杰, 张松岩, 梁吉业. 融合一致性正则与流形正则的半监督深度学习算法[J]. 大数据, 2022, 8(3): 103-114.
[6]	徐康庭, 宋威. 结合语言知识和深度学习的中文文本情感分析方法[J]. 大数据, 2022, 8(3): 115-127.
[7]	赵智韬, 赵理君, 张正, 唐娉. 基于容器云技术的典型遥感智能解译算法集成[J]. 大数据, 2022, 8(2): 58-74.
[8]	张凯, 车漾. 基于分布式缓存加速容器化深度学习的优化方法[J]. 大数据, 2021, 7(5): 150-163.
[9]	温景熙, 于胡飞, 辛江, 唐艳. 基于深度学习的大脑性别差异分析[J]. 大数据, 2021, 7(4): 130-140.
[10]	彭鑫, 陈驰, 林云. 基于上下文的智能化代码复用推荐[J]. 大数据, 2021, 7(1): 37-47.
[11]	王丽会, 秦永彬. 深度学习在医学影像中的研究进展及发展趋势[J]. 大数据, 2020, 6(6): 83-104.
[12]	宋婷, 陈战伟, 杨海峰. 基于分层注意力网络的方面情感分析[J]. 大数据, 2020, 6(5): 82-91.
[13]	于胡飞, 温景熙, 辛江, 唐艳. 基于生成对抗网络的医学数据域适应研究[J]. 大数据, 2020, 6(5): 45-54.
[14]	于璠. 新一代深度学习框架研究[J]. 大数据, 2020, 6(4): 69-80.
[15]	马玮良, 彭轩, 熊倩, 石宣化, 金海. 深度学习中的内存管理问题研究综述[J]. 大数据, 2020, 6(4): 56-68.

算法 1:基于概率分布差异的主动学习算法流程
输入：未标注的样本集U
Step1: 从U中随机抽取部分样本L，通过标注平台A，进行样本标注；
Step2: 构建实体识别BERT-BiLSTM-CRF模型M，使用现有标注样本训练模型M；
Step3: 通过采样策略P从未标注样本集合U中筛选出差异值较大的数据集合；
Step4: 通过标注平台A进行标注，得到标注样本集；
Step5: 更新标注样本集；
Step6: 基于更新的样本集L，更新采样策略函数P；
Step7: 基于更新的样本集L，更新训练模型M；
Step8: 将更新后的模型M在测试集中验证；
if 达到收敛的条件：停止迭代；
else：重复step3-step8；
输出：新增后的样本集L，最终训练的模型M