数字人文视域下面向历史古籍的信息抽取方法研究

doi:10.11959/j.issn.2096-0271.2022058

Abstract

Abstract:

Digital humanities aims to use modern computer network technology to help traditional humanities research.Classical Chinese historical books are the important basis for historical research and learning, but since their writing language is classical Chinese, it is quite different from the vernacular Chinese in grammar and meaning, so it is not easy to read and understand.In view of the above problems, the solution to extract entities and relations in historical books based on pre-trained models was proposed to obtain the rich information contained in historical texts effectively.The model usedmulti-level pre-training tasks instead of BERT's original pre-training tasks to fully capture semantic information.And the model added some structures such as convolutional layers and sentence-level aggregations on the basis of the BERT model to optimize the generated word representation further.Then, in view of the scarcity of classical Chinese annotation data, a crowdsourcing system for the task of labeling historical classics was constructed, high-quality, large-scale entity and relation data was obtained and the classical Chinese knowledge extraction dataset was constructed.So it helped to evaluate the performance of the model and fine-tune the model.Experiments on the dataset constructed in this paper and on the GulianNER dataset demonstrated the effectiveness of the model proposed in this paper.

Key words: historical classics, pre-trained model, information extraction, crowdsourcing mechanism

CLC Number:

TP391.1

Lifan HAN, Zijing JI, Zirui CHEN, Xin WANG. Research on information extraction methods for historical classics under the threshold of digital humanities[J]. Big Data Research, 2022, 8(6): 26-39.

Figures/Tables 9

References 24

[1]	DEVLIN J , CHANG M , LEE K ,et al. BERT:pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of NAACLHLT.[S.l.:s.n.], 2019: 4171-4186.
[2]	HOWE J . The rise of crowdsourcing[J]. Wired, 2006,14(6): 176-183.
[3]	HOLLEY R . Crowdsourcing:how and why should libraries do it?[J]. D-Lib Magazine, 2010,16(3/4): 1-21.
[4]	OOMEN J , AROYO L . Crowdsourcing in the cultural heritage domain:opportunities and challenges[C]// Proceedings of the 5th International Conference on Communities and Technologies.[S.l.:s.n.], 2011: 138-149.
[5]	TERRAS M . Digital curiosities:resource creation via amateur digitization[J]. Literary and Linguistic Computing, 2010,25(4): 425-438.
[6]	RIDGE M , . Citizen history and its discontents[C]// Proceedings of 2014 IHR Digital History Seminar.[S.l.:s.n.], 2014: 1-13.
[7]	ZHANG X H , SONG S J , ZHAO Y C ,et al. Motivations of volunteers in the Transcribe Sheng project:a grounded theory approach[J]. Proceedings of the Association for Information Science and Technology, 2018,55(1): 951-953.
[8]	RIDGE M . From tagging to theorizing:deepening engagement with cultural heritage through crowdsourcing[J]. Curator:the Museum Journal, 2013,56(4): 435-450.
[9]	DANIELS C , HOLTZE T L , HOWARD R I ,et al. Community as resource:crowdsourcing transcription of an historic newspaper[J]. Journal of Electronic Resources Librarianship, 2014,26(1): 36-48.
[10]	CONCILIO G , VITELLIO I . Cocreating intangible cultural heritage by crowd-mapping: the case of mappi[na][C]// Proceedings of 2016 IEEE 2nd International Forum on Research and Technologies for Society and Industry Leveraging a Better Tomorrow. Piscataway:IEEE Press, 2016: 1-5.
[11]	RUMELHART D E , HINTON G E , WILLIAMS R J . Learning representations by back-propagating errors[J]. Nature, 1986,323(6088): 533-536.
[12]	HINTON G E , MCCLELLAND J L , RUMELHART D E . Distributed representations[M]. Cambridge: MIT Press, 1986: 77-109.
[13]	MIKOLOV T , CHEN K , CORRADO G ,et al. Efficient estimation of word representations in vector space[J]. arXiv preprint,2013,arXiv:1301.3781.
[14]	MCCANN B , BRADBURY J , XIONG C ,et al. Learned in translation:contextualized word vectors[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook:Curran Associates Inc, 2017: 6297-6308.
[15]	PETERS M , NEUMANN M , IYYER M ,et al. Deep contextualized word representations[C]// Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Volume 1 (Long Papers). Stroudsburg:Association for Computational Linguistics, 2018: 2227-2237.
[16]	VASWANI A , SHAZEER N , PARMAR N ,et al. Attention is all you need[J]. Advances in Neural Information Processing Systems, 2017,30.
[17]	STAUDEMEYER R C , MORRIS E R . Understanding LSTM-a tutorial into long short-term memory recurrent neural networks[J]. arXiv preprint,2019,arXiv:1909.09586.
[18]	RADFORD A , NARASIMHAN K , SALIMANS T ,et al. Imporoving language understanding by generative pretraining[Z]. 2018.
[19]	MA X Z , HOVY E . End-to-end sequence labeling via Bi-directional LSTMCNNs-CRF[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). Stroudsburg:Association for Computational Linguistics, 2016.
[20]	LIU Y , OTT M , GOYAL N ,et al. RoBERta:a robustly optimized bert pretraining approach[J]. arXiv preprint,2019,arXiv:1907.11692.
[21]	LAN Z , CHEN M , GOODMAN S ,et al. Albert:a lite bert for self-supervised learning of language representations[J]. arXiv preprint,2019,arXiv:1909.11942.
[22]	YANG Z , DAI Z , YANG Y ,et al. Xlnet:generalized autoregressive pretraining for language understanding[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook:Curran Associates Inc, 2019: 5753-5763.
[23]	CUI Y M , CHE W X , LIU T ,et al. Pretraining with whole word masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2021,29: 3504-3514.
[24]	王东波, 刘畅, 朱子赫 ,等. SikuBERT与SikuRoBERTa：面向数字人文的《四库全书》预训练模型构建及应用研究[J]. 图书馆论坛, 2022,42(6): 31-43.
	WANG D B , LIU C , ZHU Z H ,et al. Construction and application of pretrained models of Siku Quanshu in orientation to digital humanities[J]. Library Tribune, 2022,42(6): 31-43.

Metrics

Recommended 0

No Suggested Reading articles found!

实体类型	训练集/个	校验集/个	测试集/个	总数/个
人名	9 467	1 267	701	11 435
地名	2 962	391	167	3 520
职位名	1 750	242	139	2 131
组织名	1 698	266	100	2 064
其他	110	18	9	137

关系类型	训练集/个	校验集/个测试集/个	总数/个
人名-人名	1 139	324130	1 593
人名-地名	462	129 53	644
人名-职位名	1 093	319162	1 574
人名-组织名	231	6038	329
其他	157	4028	225

实体类型	训练集/个	校验集/个	测试集/个	总数/个
书名	27 445	11 531	5 633	44 609
其他专名	91 917	20 972	5 552	118 441

Research on information extraction methods for historical classics under the threshold of digital humanities

RichHTML

PDF下载

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 9

References 24

Related Articles 2

Metrics

Recommended 0

[1]	Haishan GUAN, Yulong ZHENG, Bifan WEI, Zemin ZHANG, Hao YUE, Bin SHI, Bo DONG. Extraction and visualization analysis of key elements of tax preferential policies [J]. Big Data Research, 2022, 8(5): 106-123.
[2]	Jianghua ZHAO, Xuezhi WANG, Qinghui LIN, Jianhui LI, Yuanchun ZHOU. Exploration of crowdsourcing in information extraction from remote sensing images [J]. Big Data Research, 2016, 2(6): 53-64.