大数据 ›› 2022, Vol. 8 ›› Issue (6): 40-55.doi: 10.11959/j.issn.2096-0271.2022067

• 专题:面向人文领域的大数据技术和方法 • 上一篇    下一篇


郑童哲恒1, 李斌1, 冯敏萱1, 常博林1, 王东波2   

  1. 1 南京师范大学文学院,江苏 南京 210097
    2 南京农业大学信息管理学院,江苏 南京 210095
  • 出版日期:2022-11-15 发布日期:2022-11-01
  • 作者简介:郑童哲恒(1998- ),女,南京师范大学文学院硕士生,主要研究方向为计算语言学、数字人文
    李斌(1981- ),男,南京师范大学文学院副教授,主要研究方向为计算语言学、数字人文
    冯敏萱(1978- ),女,南京师范大学文学院副教授,主要研究方向为语言信息处理、语料库语言学、数字人文
    常博林(1999- ),男,南京师范大学文学院本科生,主要研究方向为数字人文、计算语言学、语料库语言学
    王东波(1981- ),男,南京农业大学信息管理学院教授、博士生导师,主要研究方向为信息智能处理、自然语言处理
  • 基金资助:

Explore the structuration of historical books:the construction and quantitative analysis of digital humanities database of the Biographies of the Shiji

Tongzheheng ZHENG1, Bin LI1, Minxuan FENG1, Bolin CHANG1, Dongbo WANG2   

  1. 1 School of Chinese Language and Literature, Nanjing Normal University, Nanjing 210097, China
    2 College of Information Management, Nanjing Agricultural University, Nanjing 210095, China
  • Online:2022-11-15 Published:2022-11-01
  • Supported by:
    The Social Science Fund of Jiangsu(20JYB004);The National Social Science Foundation of China(18BYY127);The National Social Science Foundation of China(21&ZD331)


中国古代典籍文献浩如烟海,蕴藏了大量的历史人文知识。以电子化和全文检索为主要方法的古籍数字化开发应用模式已经成为语言文学、历史、哲学等学科的重要基础资源和工具。随着人工智能与大数据技术的发展,数字人文的研究范式不断演进,将传统典籍的文本转换为高度结构化的新型数字人文数据库是一项新的探索,将文本中词汇、人物、地理实体等要素有机组织起来,对于历史现象可视化、历史规律量化具有重大意义。以《史记·列传》为对象,进行古汉语自动分词及词性标注、人工校对以及实体信息人工标注,形成多层次、高质量的数字人文知识库,实现包含古籍词汇、人物、地点等要素的定量分析与可视化检索,挖掘出《史记·列传》人物和地点分布情况、人物关系、人地关系等信息。得出:《史记·列传》共出现人物1 787位、地点1 173个;相比《史记·本纪》和《史记·世家》,《史记·列传》特有人物共1 092位,特有地点共556个。本文研究内容为古籍数字人文知识库的构建提供了新的思路与框架。

关键词: 数字人文, 《史记·列传》, 知识服务, 大数据, 古汉语信息处理


Ancient Chinese classical books are vast and contain a lot of historical and humanistic knowledge.The development and application mode of the digitization of ancient books based on digitization and full-text retrieval has become an important basic resource and tool for language and literature, history, philosophy and other disciplines.With the development of artificial intelligence and big data technology, the research paradigm of digital humanities is constantly evolving.It is a new exploration to convert the text of traditional books into a highly structured new digital humanities database.Organizing elements such as words, characters, and geographical entities in the text organically is of great significance for the visualization of historical knowledge and the quantification of historical information.The Biographies of the Shiji was selected as the object.The automatic word segmentation and part-of-speech tagging, manual proofreading and manual annotation of entity information were performed to construct a multi-level and high-quality structured digital humanities knowledge base, realize quantitative analysis and visual retrieval of elements, such as words, characters and locations of ancient books, and excavate information such as distribution of characters and locations, relationship between characters and relationship between people and locations.It was concluded that there are 1 787 persons and 1 173 locations in the Biographies of the Shiji, and compared with Benji and Shijia of the Shiji, there are 1 092 unique persons and 556 unique locations of the Biographies of the Shiji.New ideas and frameworks for the construction of digital humanities knowledge base of ancient books were provided.

Key words: digital humanities, the Biographies of the Shiji, knowledge service, big data, ancient Chinese information processing


No Suggested Reading articles found!