网络与信息安全学报 ›› 2023, Vol. 9 ›› Issue (6): 127-139.doi: 10.11959/j.issn.2096-109x.2023088
• 学术论文 • 上一篇
李昕1, 唐鹏1, 张西珩1, 邱卫东1, 回红2
修回日期:
2023-07-04
出版日期:
2023-12-01
发布日期:
2023-12-01
作者简介:
李昕(1999- ),男,江苏宿迁人,上海交通大学硕士生,主要研究方向为自然语言处理、隐私保护基金资助:
Xin LI1, Peng TANG1, Xiheng ZHANG1, Weidong QIU1, Hong HUI2
Revised:
2023-07-04
Online:
2023-12-01
Published:
2023-12-01
Supported by:
摘要:
欧盟《通用数据保护条例(GDPR,general data protection regulation)》自2018年施行以来,已开出罚单300多起,其中不乏谷歌这类知名企业未能提供透明易懂的隐私政策而遭受巨额处罚。这项严格的数据保护法律使得各国企业在提供跨境服务特别是向欧盟地区提供服务时变得尤为谨慎。同时其管辖范围规定,GDPR适用于任何为欧盟公民提供服务的企业,无论其是否在欧盟境内注册,这意味着世界各地涉及海外业务的企业都要考虑其隐私政策面向 GDPR 的合规性,国内企业也不例外。面向这一需求,构建了一套智能化检测方法,自动提取各在线服务企业的隐私政策,并采用机器学习和自动化技术,将其转化为具有结构层次的标准格式。之后进行基于自然语言处理的文本分类,识别其中涵盖的相应的 GDPR 概念,并以搭建的GDPR知识图谱为依据,检验隐私政策是否缺少部分GDPR要求披露的概念,从而实现面向GDPR的隐私政策合规性智能化检测,为国内企业向欧盟用户提供跨境服务提供支撑。对语料库中样本的分析结果进一步揭示了主流在线服务企业普遍未达到GDPR合规要求的现状。
中图分类号:
李昕, 唐鹏, 张西珩, 邱卫东, 回红. 面向GDPR隐私政策合规性的智能化检测方法[J]. 网络与信息安全学报, 2023, 9(6): 127-139.
Xin LI, Peng TANG, Xiheng ZHANG, Weidong QIU, Hong HUI. GDPR-oriented intelligent checking method of privacy policies compliance[J]. Chinese Journal of Network and Information Security, 2023, 9(6): 127-139.
表1
面向GDPR的隐私政策知识图谱Table 1 GDPR-oriented privacy policy taxonomy"
一级概念节点 | 二级概念节点 | 三级概念节点 |
CONTROLLER96.6% | DENTITY31.5% | REGISTER NUMBER1.3% |
CONTACT93.9% | / | |
ONTROLLER REPRESENTATIVE18.7% | IDENTITY2.6% | REGISTER NUMBER0% |
ONTACT14.1% | / | |
DPO42.9% | IDENTITY0.6% | REGISTER NUMBER0% |
CONTACT40.3% | / | |
PD CATEGORY99.3% | TYPE4.0% | THIRD-PARTY0% |
PUBLICLY0% | ||
SPECIAL2.0% | CONDITION1.3% | |
DIRECT ACTIVE95.3% | / | |
PD ORIGIN100% | DIRECT PASSIVE95.9% | COOKIE6.0% |
INDIRECT74.4% | THIRD-PARTY2.0% | |
PUBLICLY0% | ||
PROCESSING PURPOSES99.3% | / | / |
DATA SHARING98.6% | RECIPIENTS92.6% | / |
CONDITION93.9% | ||
CONTRACT48.9% | ||
PUBLIC TASK12.7% | ||
LAWFUL BASIS90.6% | LEGITIMATE INTEREST58.3% | / |
VITAL INTEREST12.0% | ||
CONSENT51.0% | ||
LEGAL OBLIGATION52.3% | ||
TIME67.1% | ||
PD STORAGE DATAILS89.2% | LOCATION26.8% | / |
DISPOSAL METHOD46.9% | ||
POLICY CHANGE79.8% | ||
INFORMATION97.9% | DATA BREACH NOTIFICATION9.3% | |
OTHERS84.5% | ||
ACCESS91.9% | ||
RECTIFIATION95.3% | ||
DATA SUBJECT RIGHT100% | ERASURE88.5% | |
RESTRICTION71.1% | / | |
OBJECTION75.1% | ||
PORTABILITY61.7% | ||
AUTO DECISION MAKING14.7% | ||
WITHDRAW CONSENT89.2% | ||
COMPLAINT75.1% | SA1.0% | / |
NON-GDPR79.8% | COOKIE54.3% | / |
OTHER LEGISLATIONS45.6% | ||
OTHERS | / | / |
表3
GDPR概念分类器性能对比Table3 Comparison of GDPR classifiers’ performance"
GDPR概念 | 样本数量 | Classifier4paragraph (TF-IDF + RF) | Torre's {GloVe + SVM(RBF核)} | BERT(bert-base-uncased) | ||||||||
P | R | F1 | P | R | F1 | P | R | F1 | ||||
CONTROLLER | 522 | 1 | 0.41 | 0.58 | 0.01 | 1 | 0.01 | 0.79 | 0.69 | 0.74 | ||
CONTROLLER.CONTACT | 443 | 0.95 | 0.65 | 0.77 | 0.05 | 0.91 | 0.1 | 0.79 | 0.68 | 0.73 | ||
PD CATEGORY | 1 715 | 0.93 | 0.91 | 0.92 | 0.25 | 0.93 | 0.39 | 0.79 | 0.84 | 0.81 | ||
PD ORIGIN | 1 710 | 0.97 | 0.65 | 0.78 | 0.06 | 0.9 | 0.11 | 0.77 | 0.69 | 0.72 | ||
PROCESSING PURPOSES | 1 682 | 0.96 | 0.74 | 0.83 | 0.24 | 0.92 | 0.38 | 0.84 | 0.54 | 0.66 | ||
DATA SHARING.CONDITION | 709 | 0.91 | 0.79 | 0.84 | 0.1 | 0.91 | 0.17 | 0.85 | 0.7 | 0.76 | ||
PD STORAGE DETAILS | 441 | 0.99 | 0.55 | 0.71 | 0.03 | 0.88 | 0.06 | 0.89 | 0.76 | 0.82 | ||
DATA SUBJECT RIGHT | 1 895 | 0.98 | 0.56 | 0.71 | 0.04 | 0.93 | 0.07 | 0.83 | 0.71 | 0.76 | ||
COMPLAINT | 165 | 0.99 | 0.7 | 0.81 | 0.02 | 0.92 | 0.05 | 0.8 | 0.53 | 0.64 |
表4
参考ICO模板的合规标准Table 4 Compliance criteria based on the ICO template"
标准 | 先决条件 | 后置条件 |
C1 | — | CONTROLLER.CONTACT |
C2 | — | PD CATEGORY |
C3 | — | PD ORIGIN.{DIRECT ACTIVE 或 DIRECT PASSIVE} 和 PROCESSING PURPOSES |
C4 | 适用时 | PD ORIGIN.INDIRECT 和 PROCESSING PURPOSES |
C5 | — | DATA SHARING |
C6 | — | LAWFUL BASIS.{CONSENT, CONTRACT, LEGAL OBLIGATION, VITAL INTEREST, PUBLIC TASK 或 LEGITIMATE INTEREST} |
C7 | — | PD STORAGE DETAILS.{LOCATION, TIME 和 DISPOSAL METHOD} |
C8 | — | DATA SUBJECT RIGHT.{ACCESS, RECTIFICATION, ERASURE, RESTRICTION, OBJECTION 和 PORTABILITY} |
C9 | LAWFUL BASIS.CONSENT | DATA SUBJECT RIGHT.WITHDRAW CONSENT |
C10 | — | COMPLAINT |
[17] | ZIMMECK S , BELLOVIN S . Privee:an architecture for automatically analyzing web privacy policies[C]// Proceedings of the 23rd USENIX Security Symposium. 2014: 1-16. |
[18] | HARKOUS H , FAWAZ K , LEBRET R ,et al. Automated analysis and presentation of privacy policies using deep learning[C]// Proceedings of the 27th USENIX Security Symposium. 2018: 531-548. |
[19] | AYALA-RIVERA V , PASQUALE L . The grace period has ended:An approach to operationalize GDPR requirements[C]// Proceedings of the 2018 IEEE 26th International Requirements Engineering Conference. 2018: 136-146. |
[20] | DEGELING M , UTZ C , LENTZSCH C ,et al. We value your privacy now take some cookies:measuring the GDPR’s impact on web privacy[C]// Proceedings of the 2019 Network and Distributed Systems Security Symposium. 2018: 1-14. |
[21] | LINDEN T , KHANDELWAL R , HARKOUS H ,et al. The privacy policy landscape after the GDPR[C]// Proceedings on Privacy Enhancing Technologies, 2020(1): 47-64. |
[22] | ABHIJITHG , SHOMIR W , NORMAN S . Supervised and unsupervised methods for robust separation of section titles and prose text in web documents[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018: 850-855. |
[23] | WILLIAMS B , HALLOIN C , LOBEL W ,et al. Data-driven model development for cardiomyocyte production experimental failure prediction[C]// Proceedings of 30th European Symposium on Computer Aided Process Engineering. 2020: 1639-1644. |
[24] | MOFUERZA J , MUNOZ A . Support vector machines with applications[J]. Statistical Science, 2006,21(3): 322-336. |
[25] | GEURTS P , ERNST D , WEHENKEL L . Extremely randomized trees[J]. Machine Learning, 2006,63: 3-42. |
[26] | CHEN T , GUESTRIN C . XGBoost:a scalable tree boosting system[R]. 2016. |
[27] | COHEN J . A coefficient of agreement for nominal scales[J]. Educational and Psychological Measurement, 1960,20(1): 37-46. |
[1] | MCDONALD A , CRANOR L . The cost of reading privacy policies[J]. Journal of Law and Policy for the Information Society, 2008,4: 543-568. |
[2] | LI H X , ZHU H J , DU S G ,et al. Privacy leakage of location sharing in mobile social networks:attacks and defense[J]. IEEE Transactions on Dependable and Secure Computing, 2016,15(4): 646-660. |
[3] | TUCKER S . Google fined $57m by data protection watchdog over GDPR violations[EB]. |
[4] | TOM J , SING E , MATULEVICIUS R . Conceptual representation of the GDPR:model and application directions[C]// Proceedings of 17th International Conference on Perspectives in Business Informatics Research(BIR). 2018: 18-28. |
[5] | LIPPI M , PALKA P , CONTISSA G ,et al. CLAUDETTE:an automated detector of potentially unfair clauses in online terms of service[J]. Artificial Intelligence and Law, 2019,27(2): 117-139. |
[6] | TORRE D , ABUALHAIJA S , SABETZADEH M ,et al. An AI-assisted approach for checking the completeness of privacy policies against GDPR[C]// Proceedings of the 2020 IEEE 28th International Requirements Engineering Conference. 2020: 136-146. |
[7] | AMARAL O , ABUALHAIJA O , ABUALHAIJA S ,et al. AI-enabled automation for completeness checking of privacy policies[J]. IEEE Transactions on Software Engineering, 2021,(N/A): 1-21. |
[8] | WILSON S , SCHAUB F , DARA A ,et al. The creation and analysis of a website privacy policy corpus[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(ACL). 2016: 1330-1340. |
[9] | SATHYENDRA K , WILSON S , SCHAUB F ,et al. Identifying the provision of choices in privacy policy text[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 2774-2779. |
[10] | ZIMMECK S , STORY P , SMULLEN D ,et al. MAPS:scaling privacy compliance analysis to a million apps[C]// Proceedings of the Conference on Privacy Enhancing Technologies. 2019: 66-86. |
[11] | SRINATH M , WILSON S , GILES C . Privacy at scale:introducing the PrivaSeer corpus of Web privacy policies[J]. arXiv preprint arXiv:2020.11131. |
[28] | SALTON G , BUCKLEY C . Term weighting approaches in automatic text retrieval[J]. Information Processing & Management, 1988,24(5): 513-523. |
[29] | PENNINGTON J , SOCHER R , MANNING C . GloVe:global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543. |
[30] | BOJANOWSKI P , GRAVE E , JOULIN A ,et al. Enriching word vectors with subword information[J]. Transactions of the Association for Computational Linguistics, 2017,5: 135-146. |
[31] | MIKOLOV T , CHEN K , CORRADO G ,et al. Efficient estimation of word representations in vector space[J]. Computer Science, 2013,5: 324-352. |
[32] | JACOB D , CHANG M W , KENTON L ,et al. BERT:pre-training of deep bidirectional transformers for language understanding[R]. 2018. |
[12] | KUZNETSOV M , NOVIKOVA E , KOTENKO I ,et al. Privacy policies of IoT devices:collection and analysis[J]. Sensors, 2022,22(5): 1-23. |
[13] | TESFAY W , HOFMANN P , NAKAMURA T ,et al. Privacy guide:towards an implementation of the EU GDPR on internet privacy policy evaluation[C]// Proceedings of the 4th ACM International Workshop on Security and Privacy Analytics. 2018: 15-21. |
[14] | NEJAD N , SCERRI S , LEHMANN J . KnIGHT:mapping privacy policies to GDPR[C]// Proceedings of 21st International Conference on Knowledge Engineering and Knowledge Management(EKAW 2018). 2018: 258-272. |
[15] | TORRE D , SOLTANA G , SABETZADEH M ,et al. Using models to enable compliance checking against the GDPR:an experience report[C]// Proceedings of the 2019 ACM/IEEE 22nd International Conference on Model Driven Engineering Languages and Systems. 2019: 10-20. |
[16] | HAMDANI R , MUSTAPHA M , AMARILES D ,et al. A combined rule-based and machine learning approach for automated GDPR compliance checking[C]// Proceedings of the 18th International Conference on Artificial Intelligence and Law. 2021: 40-49. |
[1] | 祝现威, 刘伟, 刘自豪, 顾泽宇. 基于知识图谱的网络安全事件数据推荐算法[J]. 网络与信息安全学报, 2023, 9(6): 116-126. |
[2] | 刘淑婷, 马颖华, 陈秀真. 面向舆情治理的信息管理机制演化博弈模型[J]. 网络与信息安全学报, 2023, 9(6): 102-115. |
[3] | 祖铄迪, 丁世昌, 袁福祥, 罗向阳. 目标网络场景自适应的IP定位框架[J]. 网络与信息安全学报, 2023, 9(6): 71-85. |
[4] | 贾雪丹, 黄龙霞, 经普杰, 王良民, 宋香梅. 基于混合链结构的隐私保护交易系统监管方案[J]. 网络与信息安全学报, 2023, 9(6): 56-70. |
[5] | 李高磊, 李建华, 周志洪, 张昊. 面向新型关键基础设施的密码应用安全性评估技术综述[J]. 网络与信息安全学报, 2023, 9(6): 1-19. |
[6] | 王庆丰, 梁浩, 王亚文, 谢根琳, 何本伟. 基于浮点数类型转换和运算的不透明谓词构造方法[J]. 网络与信息安全学报, 2023, 9(5): 48-58. |
[7] | 葛文婷, 李卫海, 俞能海. 基于属性加密的块级云数据去重方案[J]. 网络与信息安全学报, 2023, 9(5): 106-115. |
[8] | 邢馨心, 左青雅, 刘建伟. 基于5G的智慧机场网络安全方案设计与安全性分析[J]. 网络与信息安全学报, 2023, 9(5): 116-126. |
[9] | 赵彧然, 薛傥, 刘功申. 基于词序扰动的神经机器翻译模型鲁棒性研究[J]. 网络与信息安全学报, 2023, 9(5): 138-149. |
[10] | 汪天琦, 张迎周, 邸云龙, 李鼎文, 朱林林. 基于动态差分扩展的强鲁棒数据库水印算法研究[J]. 网络与信息安全学报, 2023, 9(5): 150-165. |
[11] | 黄诗瑀, 叶锋, 黄添强, 李伟, 黄丽清, 罗海峰. 人脸伪造与检测中的对抗攻防综述[J]. 网络与信息安全学报, 2023, 9(4): 1-15. |
[12] | 裘炜程, 陈秀真, 马颖华, 马进, 周志洪. 基于词向量和图卷积的攻击模式与技术实体关联方法[J]. 网络与信息安全学报, 2023, 9(4): 40-52. |
[13] | 于玥, 林宪正, 李卫海, 俞能海. 基于哈夫曼的k-匿名模型隐私保护数据压缩方案[J]. 网络与信息安全学报, 2023, 9(4): 64-73. |
[14] | 肖文乾, 杨高波, 王德望, 夏明. 基于加法同态加密与多高位嵌入的加密域图像可逆信息隐藏[J]. 网络与信息安全学报, 2023, 9(4): 121-133. |
[15] | 高国鹏, 房耀东, 韩彦芳, 钱振兴, 秦川. 面向虚假新闻检测的社交媒体多模态数据集构建[J]. 网络与信息安全学报, 2023, 9(4): 144-154. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||
|