大数据

• •    

历史典籍的结构化探索——《史记·列传》数字人文知识库的构建与可视化研究

郑童哲恒1,李斌1,冯敏萱1,常博林1 ,王东波2   

  1. 1.南京师范大学文学院,江苏 南京 210097

    2.南京农业大学信息管理学院,江苏 南京 210095

  • 作者简介:郑童哲恒(1998- ),女,南京师范大学文学院硕士生,主要研究方向为计算语言学、数字人文。 李斌(1981- ),男,南京师范大学文学院副教授,主要研究方向为计算语言学,数字人文。 冯敏萱(1978- ),女,南京师范大学文学院副教授,主要研究方向为语言信息处理、语料库语言学、数字人文。 常博林(1999- ),男,南京师范大学文学院在读本科生,主要研究方向为数字人文,计算语言学,语料库语言学。 王东波(1981- ),男,南京农业大学信息管理学院教授、博士生导师。主要研究方向为信息智能处理、自然语言处理。

Explore the structuration of historical books:The construction and quantitative analysis of digital humanities database of the Biographies of Shiji

ZHENG Tongzheheng1, LI Bin1, FENG Minxuan1, CHANG Bolin1, WANG Dongbo2   

  1. 1. School of Chinese Language and Literature, Nanjing Normal University, Nanjing 210097, China

    2. College of Information Management, Nanjing Agricultural University, Nanjing 210095, China

摘要:

中国古代典籍文献浩如烟海,蕴藏了大量的历史人文知识。以电子化和全文检索为主要方法的古籍数字化开发应用模式已经成为语言文学、历史、哲学等学科的重要基础资源和工具。随着人工智能与大数据技术的发展,数字人文的研究范式不断演进,将传统典籍的文本转换为高度结构化的新型数字人文数据库,将文本中词汇、人物、地理实体等要素有机组织起来,对于历史现象可视化、历史规律量化具有重大意义。将传统典籍的文本转换为高度结构化的新型数字人文数据库则是一项新的探索。本文选取《史记·列传》为对象,进行古汉语自动分词及词性标注、人工校对以及实体信息人工标注,形成多层次、高质量的数字人文知识库,实现包含古籍词汇、人物、地点等要素的定量分析与可视化检索,挖掘出《列传》人物和地点分布情况、人物关系、人地关系等信息,得出《列传》共出现人物1787位、地点1173个,且较之《本纪》和《世家》,《列传》特有人物共1092位,特有地点共556处等结论,为古籍数字人文知识库的构建提供新的思路与框架。

关键词:

数字人文, 《史记·列传》, 知识服务, 大数据, 古汉语信息处理

Abstract:

Ancient Chinese classical books are vast and contain a lot of historical and humanistic knowledge. The application mode of the digitization of ancient books based on digitization and full-text retrieval has become an important basic resource and tool for language and literature, history, philosophy and other disciplines. With the development of artificial intelligence and big data technology, the research paradigm of digital humanities is constantly evolving, transforming the text of historical books into a highly structured new digital humanities database, and organically organizing elements such as words, characters, and geographical entities in the text, which is of great significance for the visualization of historical knowledge and the quantification of historical information. It is a new exploration to convert the text of traditional books into a highly structured new digital humanities database. This paper selected Liezhuan as the object, performed automatic word segmentation and part-of-speech tagging, manual proofreading and manual annotation of entity information to construct a multi-level and high-quality structured digital humanities knowledge base, realized quantitative analysis and visual retrieval of elements such as words, characters and locations of ancient books, and excavated information such as distribution of characters and locations, relationship between characters, and relationship between people and locations. It is concluded that there are 1,787 persons and 1,173 locations in the Liezhuan, and compared with Benji and Shijia, there are 1092 unique persons and 556 unique locations of the Liezhuan. This paper provides new ideas and frameworks for the construction of digital humanities knowledge base of ancient books.

Key words:

digital humanities, the Biographies of the Shiji, knowledge service, big data, ancient Chinese information processing

[1] 李 洪,杨雁武. 中国电信集团电子运维系统整合研究[J]. 电信科学, 2009, 25(11): 74 -77 .
[2] 姜启广. TD-SCDMA与2G共址网络规划探讨[J]. 电信科学, 2009, 25(11): 81 -85 .
[3] 王邠. OTN系统在地铁通信中的应用[J]. 电信科学, 2009, 25(11): 86 -88 .
[4] 王侃. IDM技术发展与挑战[J]. 电信科学, 2009, 25(11): 88 -90 .
[5] 刘伯涛. 移动回传的融合之路[J]. 电信科学, 2009, 25(11): 91 -93 .
[6] 杜伟. IP RAN 承载网技术探讨[J]. 电信科学, 2009, 25(11): 93 -94 .
[7] 朱召胜. 传递PTN价值 构建移动回传绿色精品网络[J]. 电信科学, 2009, 25(11): 97 -101 .
[8] 金家德. PTN力助运营商IP RAN建设步伐[J]. 电信科学, 2009, 25(11): 104 -105 .
[9] . 西南交通大学图书馆远程容灾备份系统的建设[J]. 电信科学, 2009, 25(11): 106 .
[10] . 联想服务器助力好耶广告网络邮件系统[J]. 电信科学, 2009, 25(11): 107 .