大数据

• •    

数字人文视域下面向历史古籍的信息抽取方法研究

季紫荆1, 2,陈子睿1, 2,韩立帆1, 2,王  鑫1, 2, *   

  1. 1. 智能与计算学部,天津大学,天津 300350;

    2. 天津市认知计算与应用重点实验室,天津 300350

  • 作者简介:季紫荆(1997- ),女,天津大学智能与计算学部硕士研究生,主要研究方向为自然语言处理、知识图谱构建。 陈子睿(1998- ),男,天津大学智能与计算学部硕士研究生,主要研究方向为知识表示学习、知识图谱问答、知识图谱构建。 韩立帆(1999- ),男,天津大学智能与计算学部硕士研究生,主要研究方向为自然语言处理、知识图谱构建。 王鑫(1981- )(通讯作者),男,博士,天津大学智能与计算学部教授,博士生导师,计算机学会(CCF)会员(14972S),主要研究领域为知识图谱数据管理、图数据库、大规模知识处理。

Research on information extraction methods for historical classics under the perspective of digital humanities

JI Zijing1, 2, CHEN Zirui 1, 2, HAN Lifan1, 2, WANG Xin1, 2, *   

  1. 1. College of Intelligence and Computing, Tianjin University, Tianjin 300350, China

    2. Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin 300350, China

摘要:

数字人文旨在采用现代计算机网络技术助力传统人文研究,文言历史古籍是进行历史研究和学习的重要基础,但由于其写作语言为文言文,与现代所用的白话文在语法和词义上均有较大差别,因此不易于阅读和理解。针对上述问题,本文提出基于预训练模型对历史古籍中的实体和关系等进行知识抽取,从而有效获取历史古籍文本中蕴含的丰富信息。该模型首先采用多级预训练任务代替BERT原有的预训练任务以充分捕获语义信息,此外在BERT模型的基础上添加了卷积层以及句子级聚合等结构,以进一步优化生成的词表示。之后,针对文言文标注数据稀缺的问题,构建了一个面向历史古籍文本标注任务的众包系统,获取高质量、大规模的实体和关系数据并完成文言文知识抽取数据集的构建,从而评估模型性能并对模型进行微调。在本文构建的数据集以及GulianNER数据集上的实验证明了本文提出模型的有效性。

关键词: 历史古籍, 预训练模型, 信息抽取, 众包机制

Abstract:

Digital humanities aims to use modern computer network technology to help traditional humanities research. Classical Chinese historical books are an important basis for historical research and learning, but since their writing language is classical Chinese, which is quite different from the vernacular Chinese in grammar and meaning, it is not easy to read and understand. In view of the above problems, this paper proposes to extract entities and relations in historical books based on pre-trained models, so as to effectively obtain the rich information contained in historical texts. The model first uses multi-level pre-training tasks instead of BERT's original pre-training tasks to fully capture semantic information, and adds structures such as convolutional layers and sentence-level aggregations on the basis of the BERT model to further optimize the generated word representation. After that, in view of the scarcity of classical Chinese annotation data, a crowdsourcing system for the task of labeling historical classics is constructed, and high-quality, large-scale entity and relation data is obtained to construct the classical Chinese knowledge extraction dataset, so as to evaluate the performance of the model and fine-tune the model. Experiments on the dataset constructed in this paper and on the GulianNER dataset demonstrate the effectiveness of the model proposed in this paper.

Key words:

"> historical classics, pre-trained models, information extraction, crowdsourcing mechanism

[1] 李 洪,杨雁武. 中国电信集团电子运维系统整合研究[J]. 电信科学, 2009, 25(11): 74 -77 .
[2] 姜启广. TD-SCDMA与2G共址网络规划探讨[J]. 电信科学, 2009, 25(11): 81 -85 .
[3] 王侃. IDM技术发展与挑战[J]. 电信科学, 2009, 25(11): 88 -90 .
[4] 刘伯涛. 移动回传的融合之路[J]. 电信科学, 2009, 25(11): 91 -93 .
[5] 杜伟. IP RAN 承载网技术探讨[J]. 电信科学, 2009, 25(11): 93 -94 .
[6] 朱召胜. 传递PTN价值 构建移动回传绿色精品网络[J]. 电信科学, 2009, 25(11): 97 -101 .
[7] 孙毓明,毛拥华. 移动网络演进及其对传送网络的影响[J]. 电信科学, 2009, 25(11): 102 -104 .
[8] 金家德. PTN力助运营商IP RAN建设步伐[J]. 电信科学, 2009, 25(11): 104 -105 .
[9] . 西南交通大学图书馆远程容灾备份系统的建设[J]. 电信科学, 2009, 25(11): 106 .
[10] . 联想服务器助力好耶广告网络邮件系统[J]. 电信科学, 2009, 25(11): 107 .