大数据 ›› 2022, Vol. 8 ›› Issue (6): 26-39.doi: 10.11959/j.issn.2096-0271.2022058

• 专题:面向人文领域的大数据技术和方法 • 上一篇    下一篇

数字人文视域下面向历史古籍的信息抽取方法研究

韩立帆1,2, 季紫荆1,2, 陈子睿1,2, 王鑫1,2   

  1. 1 天津大学智能与计算学部,天津 300350
    2 天津市认知计算与应用重点实验室,天津 300350
  • 出版日期:2022-11-15 发布日期:2022-11-01
  • 作者简介:韩立帆(1999- ), 男, 天津大学智能与计算学部硕士生, 主要研究方向为自然语言处理、知识图谱构建
    季紫荆(1997- ), 女, 天津大学智能与计算学部硕士生, 主要研究方向为自然语言处理、知识图谱构建
    陈子睿(1998- ), 男, 天津大学智能与计算学部硕士生, 主要研究方向为知识表示学习、知识图谱问答、知识图谱构建
    王鑫(1981- ),男,博士,天津大学智能与计算学部教授、博士生导师,主要研究方向为知识图谱数据管理、图数据库、大规模知识处理
  • 基金资助:
    科技创新2030—“新一代人工智能”重大项目(2020AAA0108504);国家自然科学基金资助项目(61972275)

Research on information extraction methods for historical classics under the threshold of digital humanities

Lifan HAN1,2, Zijing JI1,2, Zirui CHEN1,2, Xin WANG1,2   

  1. 1 College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
    2 Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin 300350, China
  • Online:2022-11-15 Published:2022-11-01
  • Supported by:
    Science and Technology Innovation 2030 “New Generation Artificial Intelligence” Major Project(2020AAA0108504);The National Natural Science Foundation of China(61972275)

摘要:

数字人文旨在采用现代计算机网络技术助力传统人文研究,文言历史古籍是进行历史研究和学习的重要基础,但由于其写作语言为文言文,与现代所用的白话文在语法和词义上均有较大差别,因此不易于阅读和理解。针对上述问题,提出基于预训练模型对历史古籍中的实体和关系等进行知识抽取的方法,从而有效获取历史古籍文本中蕴含的丰富信息。该模型首先采用多级预训练任务代替BERT原有的预训练任务,以充分捕获语义信息,此外在BERT模型的基础上添加了卷积层及句子级聚合等结构,以进一步优化生成的词表示。然后,针对文言文标注数据稀缺的问题,构建了一个面向历史古籍文本标注任务的众包系统,获取高质量、大规模的实体和关系数据,完成文言文知识抽取数据集的构建,评估模型性能,并对模型进行微调。在构建的数据集及GulianNER数据集上的实验证明了提出模型的有效性。

关键词: 历史古籍, 预训练模型, 信息抽取, 众包机制

Abstract:

Digital humanities aims to use modern computer network technology to help traditional humanities research.Classical Chinese historical books are the important basis for historical research and learning, but since their writing language is classical Chinese, it is quite different from the vernacular Chinese in grammar and meaning, so it is not easy to read and understand.In view of the above problems, the solution to extract entities and relations in historical books based on pre-trained models was proposed to obtain the rich information contained in historical texts effectively.The model usedmulti-level pre-training tasks instead of BERT's original pre-training tasks to fully capture semantic information.And the model added some structures such as convolutional layers and sentence-level aggregations on the basis of the BERT model to optimize the generated word representation further.Then, in view of the scarcity of classical Chinese annotation data, a crowdsourcing system for the task of labeling historical classics was constructed, high-quality, large-scale entity and relation data was obtained and the classical Chinese knowledge extraction dataset was constructed.So it helped to evaluate the performance of the model and fine-tune the model.Experiments on the dataset constructed in this paper and on the GulianNER dataset demonstrated the effectiveness of the model proposed in this paper.

Key words: historical classics, pre-trained model, information extraction, crowdsourcing mechanism

中图分类号: 

No Suggested Reading articles found!