基于生成对抗网络技术的医疗仿真数据生成方法

doi:10.11959/j.issn.1000-436x.2022057

Abstract

Abstract:

Modeling the probability distribution of rows in structured electronic health records and generating realistic synthetic data is a non-trivial task.Tabular data usually contains discrete columns, and traditional encoding approaches may suffer from the curse of feature dimensionality.Poincaré Ball model was utilized to model the hierarchical structure of nominal variables and Gaussian copula-based generative adversarial network was employed to provide synthetic structured electronic health records.The generated training data are experimentally tested to achieve only 2% difference in utility from the original data yet ensure privacy.

Key words: generative adversarial network, representation learning, privacy-utility analysis, electronic health record

CLC Number:

TP309.2

Xiayu XIANG, Jiahui WANG, Zirui WANG, Shaoming DUAN, Hezhong PAN, Rongfei ZHUANG, Peiyi HAN, Chuanyi LIU. Generate medical synthetic data based on generative adversarial network[J]. Journal on Communications, 2022, 43(3): 211-224.

Figures/Tables 12

References 28

[1]	ROCHER L , HENDRICKX J M , DE MONTJOYE Y A . Estimating the success of re-identifications in incomplete datasets using generative models[J]. Nature Communications, 2019,10(1): 1-9.
[2]	GOODFELLOW I , POUGET-ABADIE J , MIRZA M ,et al. Generative adversarial nets[J]. Advances in Neural Information Processing Systems, 2014,27: 2672-2680.
[3]	FAN J , LIU T Y , LI G L ,et al. Relational data synthesis using generative adversarial networks:a design space exploration[J]. arXiv Preprint,arXiv:2008.12763, 2020.
[4]	POTDAR K , TAHER S , CHINMAY D . A comparative study of categorical variable encoding techniques for neural network classifiers[J]. International Journal of Computer Applications, 2017,175(4): 7-9.
[5]	RODRíGUEZ P , BAUTISTA M A , GONZàLEZ J , ,et al. Beyond one-hot encoding:lower dimensional target embedding[J]. Image and Vision Computing, 2018,75: 21-31.
[6]	ZHANG X , DOU D J , WU J . Learning conceptual-contextual embeddings for medical text[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020,34(5): 9579-9586.
[7]	BENGIO Y , COURVILLE A , VINCENT P . Representation learning:a review and new perspectives[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013,35(8): 1798-1828.
[8]	XU L , SKOULARIDOU M , CUESTA-INFANTE A , ,et al. Modeling tabular data using conditional GAN[J]. Advances in Neural Information Processing Systems, 2019,32: 7335-7345.
[9]	AGRAWAL R , SRIKANT R . Privacy-preserving data mining[C]// Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York:ACM Press, 2000: 439-450.
[10]	方滨兴, 贾焰, 李爱平 ,等. 大数据隐私保护技术综述[J]. 大数据, 2016,2(1): 1-18.
	FANG B X , JIA Y , LI A P ,et al. Privacy preservation in big data:a survey[J]. Big Data Research, 2016,2(1): 1-18.
[11]	李凤华, 李晖, 贾焰 ,等. 隐私计算研究范畴及发展趋势[J]. 通信学报, 2016,37(4): 1-11.
	LI F H , LI H , JIA Y ,et al. Privacy computing:concept,connotation and its research trend[J]. Journal on Communications, 2016,37(4): 1-11.
[12]	GARFINKEL S L . De-identification of personal information[R]. National Institute of Standards and Technology, 2015.
[13]	STRACK B , DESHAZO J P , GENNINGS C ,et al. Impact of HbA1c measurement on hospital readmission rates:analysis of 70,000 clinical database patient records[J]. BioMed Research International,2014, 2014:781670.
[14]	OSIA S A , SHAHIN SHAMSABADI A , SAJADMANESH S ,et al. A hybrid deep learning architecture for privacy-preserving mobile analytics[J]. IEEE Internet of Things Journal, 2020,7(5): 4505-4518.
[15]	XIAO T H , TSAI Y H , SOHN K ,et al. Adversarial learning of privacy-preserving and task-oriented representations[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020,34(7): 12434-12441.
[16]	LIU S C , DU J Z , SHRIVASTAVA A ,et al. Privacy adversarial network[J]. Proceedings of the ACM on Interactive,Mobile,Wearable and Ubiquitous Technologies, 2019,3(4): 1-18.
[17]	LI A , DUAN Y X , YANG H R ,et al. TIPRDC:task-independent privacy-respecting data crowdsourcing framework for deep learning with anonymized intermediate representations[C]// Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery ＆Data Mining. New York:ACM Press, 2020: 824-832.
[18]	GUO C , BERKHAHN F . Entity embeddings of categorical variables[J]. arXiv Preprint,arXiv:1604.06737, 2016.
[19]	SLEE V N . The international classification of diseases:ninth revision (ICD-9)[J]. Annals of Internal Medicine, 1978,88(3): 424.
[20]	CHOI E , BAHADORI M T , SEARLES E ,et al. Multi-layer representation learning for medical concepts[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM Press, 2016: 1495-1504.
[21]	WANG X , ZHANG Y D , SHI C . Hyperbolic heterogeneous information network embedding[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019,33: 5337-5344.
[22]	NICKEL M , KIELA D . Poincare embeddings for learning hierarchical representations[J]. arXiv Preprint,arXiv:1705.08039, 2017.
[23]	ARJOVSKY M , CHINTALA S , BOTTOU L . Wasserstein generative adversarial networks[C]// Proceedings of International Conference on Machine Learning.[S.l.:s.n.], 2017: 214-223.
[24]	PATKI N . The synthetic data vault:generative modeling for relational databases[D]. Cambridge:Massachusetts Institute of Technology, 2016.
[25]	YALE A , DASH S , DUTTA R ,et al. Privacy preserving synthetic health data[C]// Proceedings of 2019 European Symposium on Artificial Neural Networks,Computational Intelligence and Machine Learning,[S.l.:S.n. 2019: 2-10.
[26]	WEIJS S V , NOOIJEN V R , NICK V D G . Kullback–Leibler divergence as a forecast skill score with classic reliability–resolution–uncertainty decomposition[J]. Monthly Weather Review, 2010,138(9): 3387-3399.
[27]	WANG W , SUN Y , HALGAMUGE S . Improving MMD-GAN training with repulsive loss function[J]. arXiv Preprint,arXiv:1812.09916, 2018.
[28]	邹福泰, 谭越, 王林 ,等. 基于生成对抗网络的僵尸网络检测[J]. 通信学报, 2021,42(7): 95-106.
	ZOU F T , TAN Y , WANG L ,et al. Botnet detection based on generative adversarial network[J]. Journal on Communications, 2021,42(7): 95-106.

Metrics

Recommended 0

No Suggested Reading articles found!

疾病类型	ICD-9编码
循环系统疾病	390～459, 785
呼吸系统疾病	460～519, 786
消化系统疾病	520～579, 787
糖尿病	250.xx
受伤及中毒	800～999
肌肉骨骼疾病	710～739
泌尿生殖系统疾病	580～629, 788
赘生物	140～239

数据集	训练集最近邻对抗精度	测试集最近邻对抗精度	隐私损失
100万	0.782	0.703	-0.079
75万	0.800	0.729	-0.071
50万	0.786	0.712	-0.073
25万	0.779	0.702	-0.077
10万	0.791	0.719	-0.072
5万	0.842	0.769	-0.073
2万	0.805	0.732	-0.073
No Embedding	0.909	0.820	0.089
CTGAN	0.912	0.831	0.081

数据集	真实训练集与仿真训练集散度值	真实测试集与仿真测试集散度值	真实训练集与真实测试集差异分数	真实训练集与仿真训练集差异分数	真实测试集与仿真测试集差异分数	仿真数据集差异分数
100万	0.186	0.189	—	2.181	2.186	1.356
75万	0.200	0.205	—	2.186	2.212	1.312
50万	0.203	0.205	—	2.171	2.189	1.288
25万	0.209	0.208	2.412	2.176	2.190	1.347
10万	0.183	0.177	—	2.188	2.194	1.344
5万	0.183	0.178	—	2.218	2.224	1.385
2万	0.240	0.234	—	2.190	2.194	1.234
No Embedding	5.429	5.649	3.142	3.590	5.540	3.128
CTGAN	4.839	5.069	3.142	3.300	5.350	3.584

算法	原始数据集			10万合成数据集
算法	准确率	F1值	时间消耗/s	准确率	F1值	时间消耗/s
NearestCentroid	0.63	0.71	0.23	0.72	0.77	0.53
DecisionTreeClassifier	0.82	0.83	0.89	0.84	0.84	18.49
ExtraTreeClassifier	0.84	0.84	0.28	0.76	0.80	21.98
LabelPropagation	0.85	0.84	1 941.31	0.91	0.87	2 578.94
LabelSpreading	0.85	0.84	2 361.60	0.91	0.87	2 709.98
PassiveAggressiveClassifier	0.87	0.86	0.29	0.91	0.87	0.74
BaggingClassifier	0.91	0.87	4.69	0.79	0.81	121.11
XGBClassifier	0.91	0.87	7.02	0.77	0.80	41.74
LinearDiscriminantAnalysis	0.91	0.87	0.77	0.90	0.87	2.12
KNeighborsClassifier	0.91	0.87	78.30	0.90	0.87	675.94
QuadraticDiscriminantAnalysis	0.11	0.05	0.32	0.91	0.87	0.81
CalibratedClassifierCV	0.91	0.87	65.22	0.91	0.87	107.71
LogisticRegression	0.91	0.87	0.62	0.91	0.87	1.24
LinearSVC	0.91	0.87	17.02	0.91	0.87	25.61
RidgeClassifier	0.91	0.87	0.29	0.90	0.87	0.66
RidgeClassifierCV	0.91	0.87	0.40	0.90	0.87	1.10
DummyClassifier	0.84	0.84	0.21	0.64	0.72	0.58
GaussianNB	0.09	0.02	0.28	0.83	0.83	0.75
BernoulliNB	0.91	0.87	0.27	0.80	0.82	0.78
LGBMClassifier	0.91	0.87	0.77	0.68	0.75	3.32
SGDClassifier	0.91	0.87	0.60	0.91	0.87	1.46
ExtraTreesClassifier	0.91	0.87	8.54	0.75	0.79	0.79
AdaBoostClassifier	0.91	0.87	3.59	0.87	0.86	68.82
SVC	0.91	0.87	864.79	0.91	0.87	578.69
CheckingClassifier	0.91	0.87	0.17	0.91	0.87	0.48
RandomForestClassifier	0.91	0.87	7.47	0.28	0.35	81.50
Perceptron	0.81	0.82	0.34	0.91	0.87	0.62
可用性统计平均值	0.821	0.794	198.751	0.827	0.822	260.981

数据集	准确率	F1值	时间消耗/s
原始数据	0.821	0.794	198.75
2万	0.751	0.752	12.60
5万	0.758	0.743	51.20
10万	0.827	0.821	260.98
25万	0.434	0.441	609.49
50万	0.302	0.310	2 008.68
75万	0.421	0.455	4 363.73
100万	0.486	0.499	6 637.02
CTGAN	0.646	0.667	6 637.02

Generate medical synthetic data based on generative adversarial network

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 12

References 28

Related Articles 15

Metrics

Recommended 0

组件模型		F1值
表示学习方法	数据嵌入	+2.2%
	独热编码	-3.7%
深度学习模型架构	Vanilla GAN	-7.2%
	WGAN	+1.87%

[1]	Jiale ZHANG, Chengcheng ZHU, Xiaobing SUN, Bing CHEN. Membership inference attack and defense method in federated learning based on GAN [J]. Journal on Communications, 2023, 44(5): 193-205.
[2]	Xin SUN, Guifu ZHANG, Hongyan XING, Wang Zenghui. Research on intrusion detection for maritime meteorological sensor network based on balancing generative adversarial network [J]. Journal on Communications, 2023, 44(4): 124-136.
[3]	Lingtao TANG, Di WANG, Shengyun LIU. Data augmentation scheme for federated learning with non-IID data [J]. Journal on Communications, 2023, 44(1): 164-176.
[4]	Yanhua LIU, Jiaqi LI, Zhengui OU, Xiaoling GAO, Ximeng LIU, Weizhi MENG, Baoxu LIU. Adversarial training driven malicious code detection enhancement method [J]. Journal on Communications, 2022, 43(9): 169-180.
[5]	Yanwen WANG, Weimin LEI, Wei ZHANG, Huan MENG, Xinyi CHEN, Wenhui YE, Qingyang JING. Survey on video image reconstruction method based on generative model [J]. Journal on Communications, 2022, 43(9): 194-208.
[6]	Xueyuan DUAN, Yu FU, Kun WANG. Multi-dimensional time series anomaly detection method based on VAE-WGAN [J]. Journal on Communications, 2022, 43(3): 1-13.
[7]	Zhuo CHEN, Miao ZHU, Junwei DU. Multi-view graph neural network for fraud detection algorithm [J]. Journal on Communications, 2022, 43(11): 225-232.
[8]	Yanhui LU, Han LIU, Hang LI, Guangxu ZHU. Time series generation model based on multi-discriminator generative adversarial network [J]. Journal on Communications, 2022, 43(10): 167-176.
[9]	Wei LIU, Cheng CHEN, Rui JIANG, Tao LU. Four-path unsupervised learning-based image defogging network [J]. Journal on Communications, 2022, 43(10): 210-222.
[10]	Zhili ZHOU, Meimin WANG, Gaobo YANG, Jianyu ZHU, Xingming SUN. Generative steganography method based on auto-generation of contours [J]. Journal on Communications, 2021, 42(9): 144-154.
[11]	Chen CHEN, Yafeng RONG, Chaoqun JI, Deyun CHEN, Yongjun HE. Speaker verification method based on deep information divergence maximization [J]. Journal on Communications, 2021, 42(7): 231-237.
[12]	Hongyan WANG, Xiao YANG, Yanchao JIANG, Zumin WANG. Image denoising algorithm based on multi-channel GAN [J]. Journal on Communications, 2021, 42(3): 229-237.
[13]	Zunwen HE, Shuai HOU, Wancheng ZHANG, Yan ZHANG. Multi-feature fusion classification method for communication specific emitter identification [J]. Journal on Communications, 2021, 42(2): 103-112.
[14]	Ao LI, Zhuo WANG, Xiaoyang YU, Deyun CHEN, Yingtao ZHANG, Guanglu SUN. Robust multiview subspace clustering method based on multi-kernel low-redundancy representation learning [J]. Journal on Communications, 2021, 42(11): 193-204.
[15]	Bin ZHANG, Renjie LIAO. Malicious domain name detection method based on associated information extraction [J]. Journal on Communications, 2021, 42(10): 162-172.