面向GDPR隐私政策合规性的智能化检测方法

doi:10.11959/j.issn.2096-109x.2023088

网络与信息安全学报 ›› 2023, Vol. 9 ›› Issue (6): 127-139.doi: 10.11959/j.issn.2096-109x.2023088

• 学术论文 • 上一篇

面向GDPR隐私政策合规性的智能化检测方法

李昕¹, 唐鹏¹, 张西珩¹, 邱卫东¹, 回红²

¹ 上海交通大学网络空间安全学院，上海 200240
² 上海交通大学网络安全技术研究院，上海 200240

修回日期:2023-07-04 出版日期:2023-12-01 发布日期:2023-12-01
作者简介:李昕（1999- ），男，江苏宿迁人，上海交通大学硕士生，主要研究方向为自然语言处理、隐私保护
唐鹏（1992- ），男，江西抚州人，上海交通大学博士生，主要研究方向为人工智能安全、隐私保护
张西珩（1999- ），男，山东聊城人，上海交通大学硕士生，主要研究方向为联邦学习、隐私保护
邱卫东（1973- ），男，江西九江人，博士，上海交通大学教授、博士生导师，主要研究方向为密码分析/密码工程、人工智能安全、大数据隐私保护
回红（1969- ），女，天津人，博士，上海交通大学副教授，主要研究方向为图像处理与模式识别、信息安全
基金资助:
国家自然科学基金(61972249);国家重点研发计划(2023YFB3106500)

GDPR-oriented intelligent checking method of privacy policies compliance

Xin LI¹, Peng TANG¹, Xiheng ZHANG¹, Weidong QIU¹, Hong HUI²

¹ School of Cyberspace Security, Shanghai Jiao Tong University, Shanghai 200240, China
² Institute of Cyber Science and Technology, Shanghai Jiao Tong University, Shanghai 200240, China

Revised:2023-07-04 Online:2023-12-01 Published:2023-12-01
Supported by:
The National Natural Science Foundation of China(61972249);The National Key R＆D Program of China(2023YFB3106500)

摘要/Abstract

摘要：

欧盟《通用数据保护条例（GDPR，general data protection regulation）》自2018年施行以来，已开出罚单300多起，其中不乏谷歌这类知名企业未能提供透明易懂的隐私政策而遭受巨额处罚。这项严格的数据保护法律使得各国企业在提供跨境服务特别是向欧盟地区提供服务时变得尤为谨慎。同时其管辖范围规定，GDPR适用于任何为欧盟公民提供服务的企业，无论其是否在欧盟境内注册，这意味着世界各地涉及海外业务的企业都要考虑其隐私政策面向 GDPR 的合规性，国内企业也不例外。面向这一需求，构建了一套智能化检测方法，自动提取各在线服务企业的隐私政策，并采用机器学习和自动化技术，将其转化为具有结构层次的标准格式。之后进行基于自然语言处理的文本分类，识别其中涵盖的相应的 GDPR 概念，并以搭建的GDPR知识图谱为依据，检验隐私政策是否缺少部分GDPR要求披露的概念，从而实现面向GDPR的隐私政策合规性智能化检测，为国内企业向欧盟用户提供跨境服务提供支撑。对语料库中样本的分析结果进一步揭示了主流在线服务企业普遍未达到GDPR合规要求的现状。

关键词: 通用数据保护条例, 隐私政策, 层级结构, 合规性检测

Abstract:

The implementation of the EU’s General Data Protection Regulation (GDPR) has resulted in the imposition of over 300 fines since its inception in 2018.These fines include significant penalties for prominent companies like Google, which were penalized for their failure to provide transparent and comprehensible privacy policies.The GDPR, known as the strictest data protection laws in history, has made companies worldwide more cautious when offering cross-border services, particularly to the European Union.The regulation's territorial scope stipulates that it applies to any company providing services to EU citizens, irrespective of their location.This implies that companies worldwide, including domestic enterprises, are required to ensure compliance with GDPR in their privacy policies, especially those involved in international operations.To meet this requirement, an intelligent detection method was introduced.Machine learning and automation technologies were utilized to automatically extract privacy policies from online service companies.The policies were converted into a standardized format with a hierarchical structure.Through natural language processing, the privacy policies were classified, allowing for the identification of relevant GDPR concepts.In addition, a constructed GDPR taxonomy was used in the detection mechanism to identify any missing concepts as required by GDPR.This approach facilitated intelligent detection of GDPR-oriented privacy policy compliance, providing support to domestic enterprises while they provided cross-border services to EU users.Analysis of the corpus samples reveals the current situation that mainstream online service companies generally fail to meet GDPR compliance requirements.

Key words: GDPR, privacy policy, hierarchical structure, compliance checking

中图分类号:

TP393

李昕, 唐鹏, 张西珩, 邱卫东, 回红. 面向GDPR隐私政策合规性的智能化检测方法[J]. 网络与信息安全学报, 2023, 9(6): 127-139.

Xin LI, Peng TANG, Xiheng ZHANG, Weidong QIU, Hong HUI. GDPR-oriented intelligent checking method of privacy policies compliance[J]. Chinese Journal of Network and Information Security, 2023, 9(6): 127-139.

图/表 7

表1

图1

图2

表2

表3

表4

表5

参考文献 16

[17]	ZIMMECK S , BELLOVIN S . Privee:an architecture for automatically analyzing web privacy policies[C]// Proceedings of the 23rd USENIX Security Symposium. 2014: 1-16.
[18]	HARKOUS H , FAWAZ K , LEBRET R ,et al. Automated analysis and presentation of privacy policies using deep learning[C]// Proceedings of the 27th USENIX Security Symposium. 2018: 531-548.
[19]	AYALA-RIVERA V , PASQUALE L . The grace period has ended:An approach to operationalize GDPR requirements[C]// Proceedings of the 2018 IEEE 26th International Requirements Engineering Conference. 2018: 136-146.
[20]	DEGELING M , UTZ C , LENTZSCH C ,et al. We value your privacy now take some cookies:measuring the GDPR’s impact on web privacy[C]// Proceedings of the 2019 Network and Distributed Systems Security Symposium. 2018: 1-14.
[21]	LINDEN T , KHANDELWAL R , HARKOUS H ,et al. The privacy policy landscape after the GDPR[C]// Proceedings on Privacy Enhancing Technologies, 2020(1): 47-64.
[22]	ABHIJITHG , SHOMIR W , NORMAN S . Supervised and unsupervised methods for robust separation of section titles and prose text in web documents[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018: 850-855.
[23]	WILLIAMS B , HALLOIN C , LOBEL W ,et al. Data-driven model development for cardiomyocyte production experimental failure prediction[C]// Proceedings of 30th European Symposium on Computer Aided Process Engineering. 2020: 1639-1644.
[24]	MOFUERZA J , MUNOZ A . Support vector machines with applications[J]. Statistical Science, 2006,21(3): 322-336.
[25]	GEURTS P , ERNST D , WEHENKEL L . Extremely randomized trees[J]. Machine Learning, 2006,63: 3-42.
[26]	CHEN T , GUESTRIN C . XGBoost:a scalable tree boosting system[R]. 2016.
[27]	COHEN J . A coefficient of agreement for nominal scales[J]. Educational and Psychological Measurement, 1960,20(1): 37-46.
[1]	MCDONALD A , CRANOR L . The cost of reading privacy policies[J]. Journal of Law and Policy for the Information Society, 2008,4: 543-568.
[2]	LI H X , ZHU H J , DU S G ,et al. Privacy leakage of location sharing in mobile social networks:attacks and defense[J]. IEEE Transactions on Dependable and Secure Computing, 2016,15(4): 646-660.
[3]	TUCKER S . Google fined $57m by data protection watchdog over GDPR violations[EB].
[4]	TOM J , SING E , MATULEVICIUS R . Conceptual representation of the GDPR:model and application directions[C]// Proceedings of 17th International Conference on Perspectives in Business Informatics Research(BIR). 2018: 18-28.
[5]	LIPPI M , PALKA P , CONTISSA G ,et al. CLAUDETTE:an automated detector of potentially unfair clauses in online terms of service[J]. Artificial Intelligence and Law, 2019,27(2): 117-139.
[6]	TORRE D , ABUALHAIJA S , SABETZADEH M ,et al. An AI-assisted approach for checking the completeness of privacy policies against GDPR[C]// Proceedings of the 2020 IEEE 28th International Requirements Engineering Conference. 2020: 136-146.
[7]	AMARAL O , ABUALHAIJA O , ABUALHAIJA S ,et al. AI-enabled automation for completeness checking of privacy policies[J]. IEEE Transactions on Software Engineering, 2021,(N/A): 1-21.
[8]	WILSON S , SCHAUB F , DARA A ,et al. The creation and analysis of a website privacy policy corpus[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(ACL). 2016: 1330-1340.
[9]	SATHYENDRA K , WILSON S , SCHAUB F ,et al. Identifying the provision of choices in privacy policy text[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 2774-2779.
[10]	ZIMMECK S , STORY P , SMULLEN D ,et al. MAPS:scaling privacy compliance analysis to a million apps[C]// Proceedings of the Conference on Privacy Enhancing Technologies. 2019: 66-86.
[11]	SRINATH M , WILSON S , GILES C . Privacy at scale:introducing the PrivaSeer corpus of Web privacy policies[J]. arXiv preprint arXiv:2020.11131.
[28]	SALTON G , BUCKLEY C . Term weighting approaches in automatic text retrieval[J]. Information Processing ＆ Management, 1988,24(5): 513-523.
[29]	PENNINGTON J , SOCHER R , MANNING C . GloVe:global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[30]	BOJANOWSKI P , GRAVE E , JOULIN A ,et al. Enriching word vectors with subword information[J]. Transactions of the Association for Computational Linguistics, 2017,5: 135-146.
[31]	MIKOLOV T , CHEN K , CORRADO G ,et al. Efficient estimation of word representations in vector space[J]. Computer Science, 2013,5: 324-352.
[32]	JACOB D , CHANG M W , KENTON L ,et al. BERT:pre-training of deep bidirectional transformers for language understanding[R]. 2018.
[12]	KUZNETSOV M , NOVIKOVA E , KOTENKO I ,et al. Privacy policies of IoT devices:collection and analysis[J]. Sensors, 2022,22(5): 1-23.
[13]	TESFAY W , HOFMANN P , NAKAMURA T ,et al. Privacy guide:towards an implementation of the EU GDPR on internet privacy policy evaluation[C]// Proceedings of the 4th ACM International Workshop on Security and Privacy Analytics. 2018: 15-21.
[14]	NEJAD N , SCERRI S , LEHMANN J . KnIGHT:mapping privacy policies to GDPR[C]// Proceedings of 21st International Conference on Knowledge Engineering and Knowledge Management(EKAW 2018). 2018: 258-272.
[15]	TORRE D , SOLTANA G , SABETZADEH M ,et al. Using models to enable compliance checking against the GDPR:an experience report[C]// Proceedings of the 2019 ACM/IEEE 22nd International Conference on Model Driven Engineering Languages and Systems. 2019: 10-20.
[16]	HAMDANI R , MUSTAPHA M , AMARILES D ,et al. A combined rule-based and machine learning approach for automated GDPR compliance checking[C]// Proceedings of the 18th International Conference on Artificial Intelligence and Law. 2021: 40-49.

一级概念节点	二级概念节点	三级概念节点
CONTROLLER96.6%	DENTITY31.5%	REGISTER NUMBER1.3%
	CONTACT93.9%	/
ONTROLLER REPRESENTATIVE18.7%	IDENTITY2.6%	REGISTER NUMBER0%
	ONTACT14.1%	/
DPO42.9%	IDENTITY0.6%	REGISTER NUMBER0%
	CONTACT40.3%	/
PD CATEGORY99.3%	TYPE4.0%	THIRD-PARTY0%
		PUBLICLY0%
	SPECIAL2.0%	CONDITION1.3%
	DIRECT ACTIVE95.3%	/
PD ORIGIN100%	DIRECT PASSIVE95.9%	COOKIE6.0%
	INDIRECT74.4%	THIRD-PARTY2.0%
		PUBLICLY0%
PROCESSING PURPOSES99.3%	/	/
DATA SHARING98.6%	RECIPIENTS92.6%	/
	CONDITION93.9%
	CONTRACT48.9%
	PUBLIC TASK12.7%
LAWFUL BASIS90.6%	LEGITIMATE INTEREST58.3%	/
	VITAL INTEREST12.0%
	CONSENT51.0%
	LEGAL OBLIGATION52.3%
	TIME67.1%
PD STORAGE DATAILS89.2%	LOCATION26.8%	/
	DISPOSAL METHOD46.9%
		POLICY CHANGE79.8%
	INFORMATION97.9%	DATA BREACH NOTIFICATION9.3%
		OTHERS84.5%
	ACCESS91.9%
	RECTIFIATION95.3%
DATA SUBJECT RIGHT100%	ERASURE88.5%
	RESTRICTION71.1%	/
	OBJECTION75.1%
	PORTABILITY61.7%
	AUTO DECISION MAKING14.7%
	WITHDRAW CONSENT89.2%
COMPLAINT75.1%	SA1.0%	/
NON-GDPR79.8%	COOKIE54.3%	/
	OTHER LEGISLATIONS45.6%
OTHERS	/	/

向量化方法	Classifier4title			Classifier4paragraph
向量化方法	P	R	F1	P	R	F1
TF-IDF	0.92	0.84	0.88	0.92	0.73	0.81
GloVe	0.87	0.84	0.86	0.94	0.68	0.79
fastText	0.88	0.87	0.87	0.92	0.65	0.76
Word2vec	0.90	0.81	0.85	0.91	0.71	0.80

GDPR概念	样本数量	Classifier4paragraph (TF-IDF + RF)			Torre's {GloVe + SVM（RBF核）}			BERT（bert-base-uncased）
GDPR概念	样本数量	P	R	F1	P	R	F1	P	R	F1
CONTROLLER	522	1	0.41	0.58	0.01	1	0.01	0.79	0.69	0.74
CONTROLLER.CONTACT	443	0.95	0.65	0.77	0.05	0.91	0.1	0.79	0.68	0.73
PD CATEGORY	1 715	0.93	0.91	0.92	0.25	0.93	0.39	0.79	0.84	0.81
PD ORIGIN	1 710	0.97	0.65	0.78	0.06	0.9	0.11	0.77	0.69	0.72
PROCESSING PURPOSES	1 682	0.96	0.74	0.83	0.24	0.92	0.38	0.84	0.54	0.66
DATA SHARING.CONDITION	709	0.91	0.79	0.84	0.1	0.91	0.17	0.85	0.7	0.76
PD STORAGE DETAILS	441	0.99	0.55	0.71	0.03	0.88	0.06	0.89	0.76	0.82
DATA SUBJECT RIGHT	1 895	0.98	0.56	0.71	0.04	0.93	0.07	0.83	0.71	0.76
COMPLAINT	165	0.99	0.7	0.81	0.02	0.92	0.05	0.8	0.53	0.64

标准	先决条件	后置条件
C1	—	CONTROLLER.CONTACT
C2	—	PD CATEGORY
C3	—	PD ORIGIN.{DIRECT ACTIVE 或 DIRECT PASSIVE} 和 PROCESSING PURPOSES
C4	适用时	PD ORIGIN.INDIRECT 和 PROCESSING PURPOSES
C5	—	DATA SHARING
C6	—	LAWFUL BASIS.{CONSENT, CONTRACT, LEGAL OBLIGATION, VITAL INTEREST, PUBLIC TASK 或 LEGITIMATE INTEREST}
C7	—	PD STORAGE DETAILS.{LOCATION, TIME 和 DISPOSAL METHOD}
C8	—	DATA SUBJECT RIGHT.{ACCESS, RECTIFICATION, ERASURE, RESTRICTION, OBJECTION 和 PORTABILITY}
C9	LAWFUL BASIS.CONSENT	DATA SUBJECT RIGHT.WITHDRAW CONSENT
C10	—	COMPLAINT

标准	合规比例	标准	合规比例
C1	140/150 (93.3%)	C6	136/150 (90.6%)
C2	149/150 (99.3%)	C7	16/150 (10.7%)
C3	131/150 (87.3%)	C8	58/150 (38.7%)
C4	—	C9	59/62 (95.2%)
C5	148/150 (98.6%)	C10	113/150 (75.3%)

面向GDPR隐私政策合规性的智能化检测方法

GDPR-oriented intelligent checking method of privacy policies compliance

在线阅读

pdf下载

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 16

相关文章 15

Metrics

推荐阅读 0

[1]	祝现威, 刘伟, 刘自豪, 顾泽宇. 基于知识图谱的网络安全事件数据推荐算法[J]. 网络与信息安全学报, 2023, 9(6): 116-126.
[2]	刘淑婷, 马颖华, 陈秀真. 面向舆情治理的信息管理机制演化博弈模型[J]. 网络与信息安全学报, 2023, 9(6): 102-115.
[3]	祖铄迪, 丁世昌, 袁福祥, 罗向阳. 目标网络场景自适应的IP定位框架[J]. 网络与信息安全学报, 2023, 9(6): 71-85.
[4]	贾雪丹, 黄龙霞, 经普杰, 王良民, 宋香梅. 基于混合链结构的隐私保护交易系统监管方案[J]. 网络与信息安全学报, 2023, 9(6): 56-70.
[5]	李高磊, 李建华, 周志洪, 张昊. 面向新型关键基础设施的密码应用安全性评估技术综述[J]. 网络与信息安全学报, 2023, 9(6): 1-19.
[6]	王庆丰, 梁浩, 王亚文, 谢根琳, 何本伟. 基于浮点数类型转换和运算的不透明谓词构造方法[J]. 网络与信息安全学报, 2023, 9(5): 48-58.
[7]	葛文婷, 李卫海, 俞能海. 基于属性加密的块级云数据去重方案[J]. 网络与信息安全学报, 2023, 9(5): 106-115.
[8]	邢馨心, 左青雅, 刘建伟. 基于5G的智慧机场网络安全方案设计与安全性分析[J]. 网络与信息安全学报, 2023, 9(5): 116-126.
[9]	赵彧然, 薛傥, 刘功申. 基于词序扰动的神经机器翻译模型鲁棒性研究[J]. 网络与信息安全学报, 2023, 9(5): 138-149.
[10]	汪天琦, 张迎周, 邸云龙, 李鼎文, 朱林林. 基于动态差分扩展的强鲁棒数据库水印算法研究[J]. 网络与信息安全学报, 2023, 9(5): 150-165.
[11]	黄诗瑀, 叶锋, 黄添强, 李伟, 黄丽清, 罗海峰. 人脸伪造与检测中的对抗攻防综述[J]. 网络与信息安全学报, 2023, 9(4): 1-15.
[12]	裘炜程, 陈秀真, 马颖华, 马进, 周志洪. 基于词向量和图卷积的攻击模式与技术实体关联方法[J]. 网络与信息安全学报, 2023, 9(4): 40-52.
[13]	于玥, 林宪正, 李卫海, 俞能海. 基于哈夫曼的k-匿名模型隐私保护数据压缩方案[J]. 网络与信息安全学报, 2023, 9(4): 64-73.
[14]	肖文乾, 杨高波, 王德望, 夏明. 基于加法同态加密与多高位嵌入的加密域图像可逆信息隐藏[J]. 网络与信息安全学报, 2023, 9(4): 121-133.
[15]	高国鹏, 房耀东, 韩彦芳, 钱振兴, 秦川. 面向虚假新闻检测的社交媒体多模态数据集构建[J]. 网络与信息安全学报, 2023, 9(4): 144-154.