网络与信息安全学报 ›› 2023, Vol. 9 ›› Issue (4): 53-63.doi: 10.11959/j.issn.2096-109x.2023053

• 学术论文 • 上一篇    

基于单点地名匹配和局部地名筛选的推特用户定位方法

薛锦1,2, 袁福祥2, 刘毅敏2, 张萌2, 乔亚琼2,3, 罗向阳2   

  1. 1 郑州大学网络空间安全学院,河南 郑州 450003
    2 河南省网络空间态势感知重点实验室,河南 郑州 450001
    3 华北水利水电大学信息工程学院,河南 郑州 450045
  • 修回日期:2023-04-18 出版日期:2023-08-01 发布日期:2023-08-01
  • 作者简介:薛锦(1996- ),男,内蒙古乌兰察布人,郑州大学硕士生,主要研究方向为社交网络定位与数据挖掘
    袁福祥(1991- ),男,山东济宁人,博士,信息工程大学讲师,主要研究方向为网络目标定位、网络拓扑分析与网络空间测绘
    刘毅敏(1995- ),女,山东烟台人,信息工程大学博士生,主要研究方向为网络安全与社交网络数据分析
    张萌(1996- ),女,河南偃师人,信息工程大学博士生,主要研究方向为数据挖掘和社交网络分析
    乔亚琼(1981- ),女,河南开封人,博士,华北水利水电大学讲师,主要研究方向为数据挖掘与社交网络分析
    罗向阳(1978- ),男,湖北荆门人,博士,信息工程大学教授、博士生导师,主要研究方向为网络与信息安全
  • 基金资助:
    国家自然科学基金(U1804263);国家自然科学基金(U2172435);国家自然科学基金(62272163);国家重点研发计划(2022YFB3102900);中原科技创新领军人才项目(214200510019);河南省科技攻关项目(222102210036);河南省自然科学青年基金(222300420230)

Twitter user geolocation method based on single-point toponym matching and local toponym filtering

Jin XUE1,2, Fuxiang YUAN2, Yimin LIU2, Meng ZHANG2, Yaqiong QIAO2,3, Xiangyang LUO2   

  1. 1 School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450003, China
    2 Henan Key Laboratory of Cyberspace Situation Awareness, Zhengzhou 450001, China
    3 School of Information Engineering, North China University of Water Resources and Electric Power, Zhengzhou 450045, China
  • Revised:2023-04-18 Online:2023-08-01 Published:2023-08-01
  • Supported by:
    The National Natural Science Foundation of China(U1804263);The National Natural Science Foundation of China(U2172435);The National Natural Science Foundation of China(62272163);The National Key Research and Development Program of China(2022YFB3102900);Zhongyuan Science and Technology Innovation Leading Talent Project of China(214200510019);The Key Science and Technology Project of Henan Province(222102210036);The Henan Province Science Foundation for Youths(222300420230)

摘要:

用户推文中的地名信息是定位推特用户的重要基础数据之一,但现有推特用户定位方法提取的地名在数量和可靠性方面均存在欠缺,影响了用户定位准确性。提出基于单点地名匹配和局部地名筛选的推特用户定位方法。设计一种基于地名位置聚集度的地名类型判别算法,根据位置分布的聚集程度构建单点地名库,获取推文中更多可靠地名;提出一种基于用户位置聚集度的局部地名筛选算法,分别以地名经纬度和用户平均经纬度为中心,计算用户位置聚集度,筛选更高聚集度、更可靠的局部地名;基于用户社交关系、用户对地名的提及关系构建用户-地名异构图,并利用图表示学习和神经网络定位用户。基于常用公开数据集GEOTEXT和TW-US进行大量用户定位实验,并与HGNN、ReLP、GCN等9种现有推特用户位置推断典型方法进行了对比,结果表明,所提方法对推特用户的位置推断准确率具有明显优势,相比9种现有典型方法,在GEOTEXT数据集上,平均误差降低了7.3~342.8 km,中位数误差降低了2.4~354.4 km,大地区级定位准确率提高了1.3%~26.3%;在TW-US数据集上,平均误差降低了8.6~246.6 km,中位数误差降低了5.7~149.7 km,大地区级定位准确率提高了1.5%~20.5%。

关键词: 用户定位, 用户生成内容, 地名, 社交媒体

Abstract:

The availability of accurate toponyms in user tweets is crucial for geolocating Twitter users.However, existing methods for locating Twitter users often suffer from limited quantity and reliability of acquired toponyms, thus impacting the accuracy of user geolocation.To address this issue, a twitter user geolocation method based on single-point toponym matching and local toponym filtering was proposed.A toponym type discriminating algorithm based on the aggregation degree of locations of the toponym was designed.In the proposed algorithm, a single-point toponym database was generated to provide more reliable toponyms extracted from tweets.Then, according to a proposed local place name filtering algorithm based on the aggregation degree of user location, the aggregation degree of user location centered on the longitude and latitude of toponyms and the average longitude and latitude of users were calculated.This process helped in extracting local toponyms with a high aggregation degree, which enhances the reliability of toponyms used in geolocation.Finally, a user-toponym heterogeneous graph was constructed based on user social relationships and user mentions of toponyms, and users were located by graph representation learning and neural networks.A large number of user geolocation experiments were conducted based on two commonly used public datasets in this field, namely GEOTEXT and TW-US.Comparisons with nine existing typical methods for Twitter user geolocation, including HGNN, ReLP, and GCN, demonstrate that our proposed method achieves significantly higher geolocation accuracy.On the GEOTEXT dataset, the average error is reduced by 7.3~342.8 km, the median error is reduced by 2.4~354.4 km, and the accuracy of large area-level geolocation is improved by 1.3%~26.3%.On the TW-US dataset, the average error is reduced by 8.6~246.6 km, the median error is reduced by 5.7~149.7 km, and the accuracy of large area-level geolocation is improved by 1.5%~20.5%.

Key words: user geolocation, user-generated text, toponym, social media

中图分类号: 

No Suggested Reading articles found!