Big Data Research ›› 2022, Vol. 8 ›› Issue (3): 128-139.doi: 10.11959/j.issn.2096-0271.2022025

• STUDY • Previous Articles     Next Articles

A fast text structuring methodology of TCM medical records based on NLP

Xiaoxia XIAO1, Mingting LIU2, Fengtianci YANG3, Jianjianxian LIU4, Yang YANG5, Yue SHI6   

  1. 1 School of Informatics, Hunan University of Chinese Medicine, Changsha 410208, China
    2 College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
    3 The College of Chemistry of Xiangtan University, Xiangtan 411105, China
    4 Hunan Zeta Technology Co., Ltd., Changsha 410012, China
    5 College of Engineering and Technology, Northeast Forestry University, Harbin 150040, China
    6 Beijing Ruidi Hongxin Science and Trade Co., Ltd., Beijing 100071, China
  • Online:2022-05-15 Published:2022-05-01
  • Supported by:
    The National Key Research and Development Program of China(2017YFC1703300);Open Fund Program of School of Informatics, Hunan University of Chinese Medicine(2018DK02)

Abstract:

Traditional Chinese medicine (TCM) medical records are the most valuable documents for TCM doctors to learn clinical experience.The structured TCM medical records are conducive to extract the clinic knowledge based on machine learning and other methods, which can accelerate the inheritance of TCM.A fast text structuring methodology of TCM medical records based on natural language processing(NLP)was proposed to structure the clinic cases.Essence of Chinese Modern Famous Chinese Medical Records was selected as the medical record structuring objects,and the text in the screenshots of the medical records was recognized by optical character recognition (OCR) and the text was initially structured.A simple symptom dictionary was constructed, and the improved N-gram model combined with the dictionary was used to recognize the symptoms, signs and other words in the text, and the dictionary was updated in the structuring process.At last, 4 754 text medical records were structured.The final model was test on 666 medical records selected randomly from the corpus, and its F1 value reached 82.99%.

Key words: N-gram model, NLP, TCM medical records, Chinese word segmentation, OCR

CLC Number: 

No Suggested Reading articles found!