物联网学报 ›› 2023, Vol. 7 ›› Issue (2): 76-87.doi: 10.11959/j.issn.2096-3750.2023.00337

• 理论与技术 • 上一篇    下一篇

基于SMOTE和gcForest的医疗小样本数据分类研究

刘文昌1, 魏赟1, 袁浩轩2, 高跃2   

  1. 1 上海理工大学光电信息与计算机工程学院,上海 200093
    2 复旦大学计算机科学技术学院,上海 200438
  • 修回日期:2023-03-07 出版日期:2023-06-30 发布日期:2023-06-01
  • 作者简介:刘文昌(1998- ),男,上海理工大学光电信息与计算机工程学院硕士生,主要研究方向为机器学习、数据分类等
    魏赟(1976- ),女,博士,上海理工大学副教授,主要研究方向为分布式系统、网络信息控制等
    袁浩轩(1996- ),男,复旦大学计算机科学技术学院博士生,主要研究方向为深度学习、智能频谱感知等
    高跃(1978- ),男,博士,复旦大学教授,主要研究方向为卫星互联网、天空地一体化网络、压缩感知与机器学习、智能天线
  • 基金资助:
    国家重点研发计划(2018YFB1700902);国家发展和改革委员会资助的“基于5G网络特大型城市区域智慧医疗应急救援体系建设”项目

Research on medical small sample data classification based on SMOTE and gcForest

Wenchang LIU1, Yun WEI1, Haoxuan YUAN2, Yue GAO2   

  1. 1 School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China
    2 School of Computer Science and Technology, Fudan University, Shanghai 200438, China
  • Revised:2023-03-07 Online:2023-06-30 Published:2023-06-01
  • Supported by:
    The National Key Research and Development Program of China(2018YFB1700902)

摘要:

针对传统机器学习模型在医疗小样本数据上由浅层模型结构和复杂数据特征导致的分类表现不佳的问题,提出了一种联合多粒度改进级联森林(cgicForest,combine multi-grained improved cascade forest)模型。通过在多粒度扫描中加入随机抽样环节以及对变换特征进行优化来提高模型表征学习能力,并改进级联森林部分的层级结构来提升模型分类能力。针对存在类别不平衡问题的数据集,提出安全边界过采样(SBS, safe-borderline-SMOTE)算法在属于安全边界的少数样本周围进行动态插值,提高训练数据质量,再通过cgicForest模型进行训练学习,最终得到支持不平衡医疗小样本数据的SBS-cgicForest分类模型。在3种医疗数据集上应用SBS-cgicForest分类模型进行测试,结果表明,cgicForest模型在具有复杂特征的医疗小样本数据上分类的性能指标较多粒度级联森林(gcForest, multi-grained cascade forest)模型提升了4.1~5.4个百分点,与SBS算法结合后各性能指标提升6.6~11.2个百分点,比与传统采样方法结合后的F1评分高出2~2.5个百分点,为解决医疗小样本数据的分类问题提供了参考,并为智慧医疗场景下的物联网应用提供了支持。

关键词: 医疗数据, 小样本, SMOTE, gcForest

Abstract:

Aiming at the problem of poor classification performance in traditional machine learning models caused by shallow model structure and complex data characteristics in small medical sample data, an combine multi- grained improved cascade forest (cgicForest) model was proposed.It enhances the representation learning ability of the model by adding random sampling into the multi-grained scanning and optimizing the transformation features.It also enhances the model's classification ability by updating the cascade forest’s hierarchical structure.Considering category imbalance problems in datasets, the safe-borderline-SMOTE (SBS) algorithm was proposed to dynamic interpolate around the few class samples belonging to the safety boundary, which can improve the quality of training data.The cgicForest was applied for training and learning, thus the SBS-cgicForest classification model was obtained which can support imbalanced medical small samples data.The model is used on three medical datasets for classification experiments.The results show that the performance indexes of the cgicForest model in the classification of medical small sample data with complex characteristics have increased by 4.1~5.4 percentage points, compared with the multi-grained cascade forest (gcForest) model.The performance indexes have increase by 6.6~11.2 percentage points after the combination with SBS algorithm, the F1 score was 2~2.5 percentage points higher than that obtained by traditional sampling methods.It provides a reference for solving the classification problem of small medical sample data, and includes support for internet of things applications in smart medical scenarios.

Key words: medical data, small sample, SMOTE, gcForest

中图分类号: 

No Suggested Reading articles found!