大数据

• •    

基于概率分布差异的医学命名实体识别方法

刘聪1,吕雪峰1,王宏林1,王晓伟2,陆瑾2,孙顺1,胡松奇1   

  1. 1.中央军委后勤保障部信息中心, 北京 100190;

    2.长沙军民先进技术研究有限公司, 湖南 长沙 410205

  • 作者简介:刘聪(1985-),男,博士,工程师,研究方向为医疗卫生大数据、医疗卫生信息化。 吕雪峰(1979-),男,硕士,高级工程师,研究方向为医疗卫生大数据、医疗卫生信息化。 王宏林(1988-),男,硕士,工程师,研究方向为后勤信息化。 王晓伟(1980-),男,博士,高级工程师,研究方向为自然语言处理、大数据。 陆瑾(1993-),男,硕士,工程师,研究方向为自然语言处理、人工智能。 孙顺(1980-),男,本科,工程师,研究方向为卫生信息化。 胡松奇(1988-),男,硕士,工程师,研究方向为卫生信息化。

Medical Named Entity Recognition Algorithm Based on Probability Distribution Difference

LIU Cong1, LV Xuefeng1, WANG Honglin1, WANG Xiaowei2, LU Jin2, SUN Shun1, HU Songqi1   

  1. 1. Information Center, Logistic Support Department of CMC, Beijing 100190, China

    2. Changsha Civi-military Advanced Technology Research Limited Company, Changsha 410205, China

摘要:

医学命名实体识别是从医学文本中抽取出指代特定概念的医学实体,是医学信息抽取的基础性任务。当前主流的医学命名实体识别算法普遍基于深度学习技术,需要大量高质量的标注样本进行模型训练。然而医学领域的样本标注成本很高,严重限制了模型性能的提升。为了降低模型对标注样本的需求,一种重要方法是基于主动学习思想,设计合理的样本采样策略,自动选取高价值样本优先标注,从而使模型提前收敛。现有算法普遍基于样本长度、样本识别的概率等特征设计采样策略,忽视了样本类别分布这一深层次特征,导致命名实体识别召回率较低。提出了一种基于概率分布差异的主动学习算法,通过计算样本间的概率分布差异评估样本的标注价值,并在标注样本更新时动态优化模型。在真实的医学检查文本上的实验表明,相比已有算法,达到同等的模型性能,本文算法所需要的标注数据缩减10%以上。在同样标注样本量的情况下,F1值提高5%以上。

关键词:

医学命名实体识别, 深度学习, 主动学习, 概率分布

Abstract:

Medical named entity recognition is the task of extracting text parts referring to specific concepts from medical domain texts, which is a basic task of medical information extraction. The current mainstream approaches are generally based on deep learning, and it requires a lot of high-quality labeled samples for model training. However, the cost of labeling samples in the medical field is expensive, and it severely limits the performance of the model. In order to reduce the demands of the model on labeled samples, one important approach is based on the active learning, design a reasonable sampling strategy to automatically select high-value samples and label them first, thus allow the model converge earlier. Existing algorithms typically design their sampling strategies based on the length or recognition probability of samples, ignoring the feature of sample category distribution, which results in low recall of named entity recognition. An active learning algorithm based on the difference of probability distribution is proposed. It evaluate the labeling value of samples by calculating the difference of probability distribution between samples and dynamically optimize the model when the labeling samples are updated. Experiments on real medical examination text data show that the algorithm in this paper requires 10% less labeled data to achieve the same model performance compared to existing algorithms. With the same amount of annotated data, the F1 value is improved by more than 5%.

Key words:

 medical named entity recognition, deep learning, active learning, probability distribution

No Suggested Reading articles found!