Chinese Journal on Internet of Things ›› 2023, Vol. 7 ›› Issue (2): 76-87.doi: 10.11959/j.issn.2096-3750.2023.00337

• Theory and Technology • Previous Articles     Next Articles

Research on medical small sample data classification based on SMOTE and gcForest

Wenchang LIU1, Yun WEI1, Haoxuan YUAN2, Yue GAO2   

  1. 1 School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China
    2 School of Computer Science and Technology, Fudan University, Shanghai 200438, China
  • Revised:2023-03-07 Online:2023-06-30 Published:2023-06-01
  • Supported by:
    The National Key Research and Development Program of China(2018YFB1700902)

Abstract:

Aiming at the problem of poor classification performance in traditional machine learning models caused by shallow model structure and complex data characteristics in small medical sample data, an combine multi- grained improved cascade forest (cgicForest) model was proposed.It enhances the representation learning ability of the model by adding random sampling into the multi-grained scanning and optimizing the transformation features.It also enhances the model's classification ability by updating the cascade forest’s hierarchical structure.Considering category imbalance problems in datasets, the safe-borderline-SMOTE (SBS) algorithm was proposed to dynamic interpolate around the few class samples belonging to the safety boundary, which can improve the quality of training data.The cgicForest was applied for training and learning, thus the SBS-cgicForest classification model was obtained which can support imbalanced medical small samples data.The model is used on three medical datasets for classification experiments.The results show that the performance indexes of the cgicForest model in the classification of medical small sample data with complex characteristics have increased by 4.1~5.4 percentage points, compared with the multi-grained cascade forest (gcForest) model.The performance indexes have increase by 6.6~11.2 percentage points after the combination with SBS algorithm, the F1 score was 2~2.5 percentage points higher than that obtained by traditional sampling methods.It provides a reference for solving the classification problem of small medical sample data, and includes support for internet of things applications in smart medical scenarios.

Key words: medical data, small sample, SMOTE, gcForest

CLC Number: 

No Suggested Reading articles found!