Big Data Research ›› 2022, Vol. 8 ›› Issue (6): 26-39.doi: 10.11959/j.issn.2096-0271.2022058

• TOPIC: BIG DATA TECHNOLOGY AND METHOD IN DIGITAL HUMANITIES • Previous Articles     Next Articles

Research on information extraction methods for historical classics under the threshold of digital humanities

Lifan HAN1,2, Zijing JI1,2, Zirui CHEN1,2, Xin WANG1,2   

  1. 1 College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
    2 Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin 300350, China
  • Online:2022-11-15 Published:2022-11-01
  • Supported by:
    Science and Technology Innovation 2030 “New Generation Artificial Intelligence” Major Project(2020AAA0108504);The National Natural Science Foundation of China(61972275)

Abstract:

Digital humanities aims to use modern computer network technology to help traditional humanities research.Classical Chinese historical books are the important basis for historical research and learning, but since their writing language is classical Chinese, it is quite different from the vernacular Chinese in grammar and meaning, so it is not easy to read and understand.In view of the above problems, the solution to extract entities and relations in historical books based on pre-trained models was proposed to obtain the rich information contained in historical texts effectively.The model usedmulti-level pre-training tasks instead of BERT's original pre-training tasks to fully capture semantic information.And the model added some structures such as convolutional layers and sentence-level aggregations on the basis of the BERT model to optimize the generated word representation further.Then, in view of the scarcity of classical Chinese annotation data, a crowdsourcing system for the task of labeling historical classics was constructed, high-quality, large-scale entity and relation data was obtained and the classical Chinese knowledge extraction dataset was constructed.So it helped to evaluate the performance of the model and fine-tune the model.Experiments on the dataset constructed in this paper and on the GulianNER dataset demonstrated the effectiveness of the model proposed in this paper.

Key words: historical classics, pre-trained model, information extraction, crowdsourcing mechanism

CLC Number: 

No Suggested Reading articles found!