Big Data Research

Big Data Technology and Method in Digital Humanities

2022, 8(6): 1-2. doi:10.11959/j.issn.2096-0271.2022086-1

Asbtract ( 195 )

HTML ( 96)

PDF (736KB) ( 237 )

Knowledge map

References | Related Articles | Metrics

Humanities big data and its application in the field of digital humanities

Jing CHEN

2022, 8(6): 3-14. doi:10.11959/j.issn.2096-0271.2022086

Asbtract ( 354 )

HTML ( 123)

PDF (1270KB) ( 522 )

Knowledge map

References | Related Articles | Metrics

Humanities big data refers to large-scale data based on digitized or digitally generated data that is considered to be in the realm of humanities and arts.Compared with science, engineering and social science data, humanities data is a kind of “deep data” with more mixed sources, more diverse formats, more diverse dimensions, more complex data levels and richer connotations, so there are greater difficulties in the process of data analysis.Focused on humanities big data and its characteristics to identify the key issues in the application of humanities big data research, the complex situation of big data as a collective concept was highlighted, as well as the possible misunderstanding, while highlighting the value of humanities big data.

Research on text annotation method of ancient works from the perspective of digital humanities：a case study on MARKUS

Yaxiu YU, Xin LI

2022, 8(6): 15-25. doi:10.11959/j.issn.2096-0271.2022046

Asbtract ( 291 )

HTML ( 37)

PDF (3909KB) ( 469 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

Text annotation is an important step in text analysis and mining.Manual labeling can no longer meet the needs of humanistic research faced with large-scale text resources, and due to the special grammatical structure and language characteristics of ancient works, the text annotation technology on modern corpora cannot be directly applied to the ancient works.Based on the analysis of the challenges faced by humanities researchers, a universal standard text annotation process of ancient works was proposed, and a model based on MARKUS was given.And ancient works annotation method based on this model through specific example was explored, to promote using tools to change the research methods in digital humanities and to expand the scale of research.

Research on information extraction methods for historical classics under the threshold of digital humanities

Lifan HAN, Zijing JI, Zirui CHEN, Xin WANG

2022, 8(6): 26-39. doi:10.11959/j.issn.2096-0271.2022058

Asbtract ( 325 )

HTML ( 40)

PDF (5459KB) ( 244 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

Digital humanities aims to use modern computer network technology to help traditional humanities research.Classical Chinese historical books are the important basis for historical research and learning, but since their writing language is classical Chinese, it is quite different from the vernacular Chinese in grammar and meaning, so it is not easy to read and understand.In view of the above problems, the solution to extract entities and relations in historical books based on pre-trained models was proposed to obtain the rich information contained in historical texts effectively.The model usedmulti-level pre-training tasks instead of BERT's original pre-training tasks to fully capture semantic information.And the model added some structures such as convolutional layers and sentence-level aggregations on the basis of the BERT model to optimize the generated word representation further.Then, in view of the scarcity of classical Chinese annotation data, a crowdsourcing system for the task of labeling historical classics was constructed, high-quality, large-scale entity and relation data was obtained and the classical Chinese knowledge extraction dataset was constructed.So it helped to evaluate the performance of the model and fine-tune the model.Experiments on the dataset constructed in this paper and on the GulianNER dataset demonstrated the effectiveness of the model proposed in this paper.

Explore the structuration of historical books:the construction and quantitative analysis of digital humanities database of the Biographies of the Shiji

Tongzheheng ZHENG, Bin LI, Minxuan FENG, Bolin CHANG, Dongbo WANG

2022, 8(6): 40-55. doi:10.11959/j.issn.2096-0271.2022067

Asbtract ( 275 )

HTML ( 39)

PDF (2879KB) ( 325 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

Ancient Chinese classical books are vast and contain a lot of historical and humanistic knowledge.The development and application mode of the digitization of ancient books based on digitization and full-text retrieval has become an important basic resource and tool for language and literature, history, philosophy and other disciplines.With the development of artificial intelligence and big data technology, the research paradigm of digital humanities is constantly evolving.It is a new exploration to convert the text of traditional books into a highly structured new digital humanities database.Organizing elements such as words, characters, and geographical entities in the text organically is of great significance for the visualization of historical knowledge and the quantification of historical information.The Biographies of the Shiji was selected as the object.The automatic word segmentation and part-of-speech tagging, manual proofreading and manual annotation of entity information were performed to construct a multi-level and high-quality structured digital humanities knowledge base, realize quantitative analysis and visual retrieval of elements, such as words, characters and locations of ancient books, and excavate information such as distribution of characters and locations, relationship between characters and relationship between people and locations.It was concluded that there are 1 787 persons and 1 173 locations in the Biographies of the Shiji, and compared with Benji and Shijia of the Shiji, there are 1 092 unique persons and 556 unique locations of the Biographies of the Shiji.New ideas and frameworks for the construction of digital humanities knowledge base of ancient books were provided.

Text sentiment visual analysis technology and its application in humanities

Lingli ZHANG, Qikai CHU, Guijuan WANG, Weihan ZHANG, Hui PU, Zhenjin SONG, Yadong WU

2022, 8(6): 56-73. doi:10.11959/j.issn.2096-0271.2022050

Asbtract ( 323 )

HTML ( 49)

PDF (4163KB) ( 503 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

Sentiment analysis is the mining of information sentiment tendency, which is mainly used for public opinion monitoring, commodity review analysis, and information retrieval.With the rapid development of social media, the volume of text data has shown explosive growth, and text sentiment analysis has become one of the important research hotspots in the field of natural language processing.At the same time, due to the characteristics of massive, time-varying, unstructured and strongly correlated sentiment data, visual analysis techniques that can present sentiment tendencies intuitively and efficiently are widely used in this field.The recent research on visual analysis of sentiment was reviewed, and according to the presentation form “topic words”, “association”, “evolution”, “spatial and temporal distribution” four aspects of text sentiment visual analysis methods were described, and future sentiment analysis techniques as well as text sentiment visual analysis research were foreseen.

Visualization in digital humanities

Yuchu LUO, Hao WU, Yuhan GUO, Shaocong TAN, Can LIU, Ruike JIANG, Xiaoru YUAN

2022, 8(6): 74-93. doi:10.11959/j.issn.2096-0271.2022085

Asbtract ( 464 )

HTML ( 93)

PDF (20687KB) ( 278 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

The development of information technology has promoted the generation of a new scientific research paradigm.Social sciences and humanities have gradually developed a data-driven research method in recent years.From the perspective of visualization, the current status of visualization applications in digital humanities was summarized from three levels, task, data, and application, through the analysis of papers at the digital humantites conference organized by alliance of digital humanities organizations.By analyzing different projects setup by different experts with different backgrounds, i.e., humanities, visualization, and art.The great potential of multidisciplinary cooperation to improve the quality of digital humanities plus visualization projects was revealed.The practice of Peking University in exploring this new paradigm of multidisciplinary cooperation in digital humanities plus visualization was shared, which included education in multidisciplinary, practice experience on promotion, and research experience in intelligent visualization.Finally, two ways for the future development of the interdiscipline between digital humanity and visualization were revealed, the cooperation between experts and the collaboration between humans and computers.

Emerging scientific topic prediction based on Poincare graph embedding

Jun DAI

2022, 8(6): 94-104. doi:10.11959/j.issn.2096-0271.2022041

Asbtract ( 235 )

HTML ( 21)

PDF (2440KB) ( 167 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

Scientific topic prediction is central to scientific research and can substantially advance the allocation of scientific resources.Machine learning and data mining approaches have been widely applied to scientific topic prediction, including paper content-based topic model and citation prediction models.A novel scientific topic prediction algorithm PKGM (Poincare keywords graph embedding) was proposed, which utilized keywords and their relations to build a keyword network, and calculated the distance between two nodes in this network to predict the probability that an edge existed.The result of comparing PKGM with seven baselines showed that PKGM obtained a 7.3% improvement by using AUROC and 5.8% improving by using AP in comparison to the best method in Euclidean space, and 10.8% improvement by using AUROC and 7.2% improving by using AP over the best approach in hyperbolic space.The results demonstrated the effectiveness of PKGM.

Research on emotion monitoring of public based on social network big data

Aili LI, Zishuai ZHANG, Yin LIN, Qiuju WANG, Jianan YANG, Weicheng MENG, Yanfeng ZHANG

2022, 8(6): 105-126. doi:10.11959/j.issn.2096-0271.2022054

Asbtract ( 363 )

HTML ( 59)

PDF (8295KB) ( 420 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

In recent years, social networking platforms such as Sina Weibo and Twitter have gradually become one of the main carriers for reflecting social public opinion, providing a convenient platform for netizens to express their opinions and emotions.Public opinion monitoring based on social network big data has become a new research hotspot.People’s emotions monitoring using social network big data in various countries is helpful to directly grasp people’s emotional tendencies in international relations, and has a great impact on the diplomacy, foreign trade, and other aspects.Based on this, a public sentiment monitoring system for Chinese and Japanese data was proposed, which could analyze the emotional tendencies contained in Chinese and Japanese data on social platforms such as Sina Weibo and Twitter simultaneously, and displayed them to users in a visual form.In the aspect of sentiment analysis algorithm, based on the BERT model and combined with the self-expanding Chinese and Japanese sentiment lexicon, a new sentiment analysis model, EmoBERT, was proposed.The experimental results show that, compared with the original BERT model, the EmoBERT has achieved good results on both Chinese sentiment classification tasks and Japanese sentiment classification tasks.Among them, EmoBERT-C increases the accuracy of Chinese BERT from 89.68% to 92.15%, and EmoBERT-J increases the accuracy of Japanese BERT model from 74.73% to 78.26%.

Automatic key information extraction of police records based on deep learning

Yumeng CUI, Jingya WANG, Shangyi YAN, Zhizhong TAO

2022, 8(6): 127-142. doi:10.11959/j.issn.2096-0271.2022052

Asbtract ( 293 )

HTML ( 50)

PDF (6174KB) ( 226 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

With the emergence of intelligent policing, the channels of mass to call police are widened, unstructured police records increase immensely, and the difficulty of police entity recognition is magnified.For this pain point, BERT model was introduced to generate the word vector, the self-attention mechanism was integrated to capture the long-distance dependence between words, and the BERT-BiGRU-SelfAtt-CRF police entity recognition model was constructed.In order to verify the performance and generalization ability of this model, experiments were carried out on public datasets.And to prove the feasibility and efficiency of this model in the police field, experiments were run on the annotated police dataset.Ultimately, the results showed that BERT-BiGRU-SelfAtt-CRF model outperformed other models on the police dataset, with the precision of 82.45%, recall rate of 79.03%, and F1 value of 80.72%.It is concluded that this model can meet the requirements of actual police work, and it is feasible and effective in the field of police entity recognition.

Enlightenment of open access to public data in the European Union

Qun ZHANG, Zhuo YIN, Hao YU, Weizhong WANG, Xiaojie JIA

2022, 8(6): 143-152. doi:10.11959/j.issn.2096-0271.2022047

Asbtract ( 683 )

HTML ( 58)

PDF (1264KB) ( 620 )

Knowledge map

References | Related Articles | Metrics

Open access to public data contributes to the high-quality development of the digital economy.In the early stage, China actively introduced relevant policies to guide the openness and utilization of public data, and many local regulations issued relevant local rules and regulations.But the national level has not yet issued relevant rules and regulations for the openness and utilization of public data.Compared with our country, the European Union is continuously issuing and revising directives related to open access to public data, to promote technological innovation in the field of the digital economy.The relevant practices of open access to public data in China were sorted out, and the main directions and characteristics of the EU’s open data and public sector information reuse directives were analyzed.Combing with China’s situation, the relevant enlightenments and suggestions on the open access to public data in China were put forward.Hope that it will be useful to further improve the open access policies, regulations, and mechanism of public data, promote the deep sharing and orderly opening of public data in China.

Knowledge Graph in Marvel Cinematic Universe

2022, 8(6): 153-155. doi:10.11959/j.issn.2096-0271.2022083

Asbtract ( 172 )

HTML ( 54)

PDF (2316KB) ( 234 )

Knowledge map

Figures and Tables | References | Related Articles | Metrics

当期目录