A Novel Feature Selection Method for Ultra High Dimensional Survival Data

Authors

  • Nahid Salma ᵃSchool of Mathematical Sciences, Universiti Sains Malaysia, Penang, Malaysia ᵇDepartment of Statistics and Data Science, Jahangirnagar University, Savar, Dhaka, Bangladesh-1342
  • Ali Hussain Mohammed Al-Rammahi School of Mathematical Sciences, Universiti Sains Malaysia, Penang, Malaysia
  • Majid Khan Majahar Ali School of Mathematical Sciences, Universiti Sains Malaysia, Penang, Malaysia

DOI:

https://doi.org/10.11113/mjfas.v20n5.3665

Keywords:

Ultra-high dimension, renal cell carcinoma, cox model, freund model, feature selection.

Abstract

Finding relevant features in ultra-high dimensional survival data is one of the most important and fundamental objectives in biology discovery and statistical acquisition. Conventional survival regression algorithms are challenged by the exponential increase in raw data. In real-world scenarios, data processing with ultra-high dimensionality has an impact, particularly on two-component structures like the kidneys, lungs, and eyes. Future system stability and the frequency of illness are both affected by gene interactions between two components. The traditional statistical procedures employed for the survival system are restricted to single component. To date, for ultra-high-dimensional survival data with two compartments, no feature selection method is available. Thus, with the goal to determine the optimal methods in this situation, this study suggested and contrasted the performance of ten variable selection approaches for ultra-high dimensional Renal Cell Carcinoma (RCC) survival data containing two compartments. The study attempted to combine Freund’s baseline hazard function as the baseline hazard of Cox model (Lasso Freund, Robust Lasso Freund, Elastic Net Freund) and integrated with sure independence screening (SIS) and iterative sure independence screening (ISIS) (i.e., LF-SIS, RLF-SIS, ENF-SIS, LF-ISIS, RLF-ISIS, ENF-ISIS) in an attempt to tackle this issue. Additionally, two basic approaches, LASSO and EN, were taken into consideration and EN is combined with SIS and ISIS (EN-SIS, EN-ISIS). Result shows that based on the validating model measures, including MSE (340.000), SSE (25300.0) and RMSE (16.490) suggest, the Robust Lasso Freund-Iterative Sure Independence Screening (RLF-ISIS) and Robust Lasso Freund-Sure Independence Screening (RLF-SIS) strategy performs superior to the other suggested approaches in terms of greater precision in picking variables. Though both methods showed lower R2 (0.71) which advocates the presence of the outliers in the dataset. Additionally, the box-plot of some selected predictive genes confirms the presence of outliers. Furthermore, two methods, RLF-ISIS and RLF-SIS, have been used to identify 49 and 68 genes that have both direct and indirect effects on patients with RCC. Finally, it can be concluded that although RLF-SIS and RLF-ISIS outperform other proposed approaches, they may, however, be regarded as a variable selection strategy but they might not be the optimal choice for ultra-high dimensional survival data with outliers. Nevertheless, the study can be expanded in the future by applying competitive risk theory to a sequential and parallel structure, which serves as the basis for most complex mechanical systems found in manufacturing facilities. Notably, no feature selection method is available for ultra-high-dimensional survival data with outliers and two-compartments. Therefore, to address this particular issue, further research should focus on developing an advanced hybrid feature selection approach, with a particular emphasis on deep learning strategies.

References

AL-Rammahi, A. H., & Dikheel, T. R. (2021, October). Sure independent screening elastic net for ultra-high dimensional survival data. In AIP Conference Proceedings (Vol. 2404, No. 1). AIP Publishing.

AL-Rammahi, A. H., & Dikheel, T. R. (2022, October). Freund’s model with iterated sure independence screening in Cox proportional hazard model. In AIP Conference Proceedings (Vol. 2398, No. 1). AIP Publishing.

Araveeporn, A. (2022). The penalized regression and penalized logistic regression of Lasso and elastic net methods for high-dimensional data: A modelling approach.

Ba, Z., Xiao, Y., He, M., Liu, D., Wang, H., Liang, H., & Yuan, J. (2022). Risk factors for the comorbidity of hypertension and renal cell carcinoma in the cardio-oncologic era and treatment for tumor-induced hypertension. Frontiers in Cardiovascular Medicine, 9, 810262.

Bhattacharjee, A., Dey, J., & Kumari, P. (2022). A combined iterative sure independence screening and Cox proportional hazard model for extracting and analyzing prognostic biomarkers of adenocarcinoma lung cancer. Healthcare Analytics, 2, 100108.

Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n.

Cox, D. R. (1972). Regression models and life tables. Journal of the Royal Statistical Society: Series B (Methodological), 34(2), 187—202.

Chamlal, H., Benzmane, A., & Ouaderhman, T. (2024). Elastic net-based high dimensional data selection for regression. Expert Systems with Applications, 244, 122958.

Cheon, S., Agarwal, A., Popovic, M., Milakovic, M., Lam, M., Fu, W., et al. (2016). The accuracy of clinicians’ predictions of survival in advanced cancer: A review. Annals of Palliative Medicine, 5(1), 22—29.

Zhang, L., Zhang, J., Gao, W., Bai, F., Li, N., & Ghadimi, N. (2024). A deep learning outline aimed at prompt skin cancer detection utilizing gated recurrent unit networks and improved orca predation algorithm. Biomedical Signal Processing and Control, 90, 105858.

Cheng, X., & Wang, H. (2023). A generic model-free feature screening procedure for ultra-high dimensional data with categorical response. Computer Methods and Programs in Biomedicine, 229, 107269.

Chen, Y., Gu, D., Wen, Y., Yang, S., Duan, X., Lai, Y., et al. (2020). Identifying the novel key genes in renal cell carcinoma by bioinformatics analysis and cell experiments. Cancer Cell International, 20, 1—16.

Huang, J. W., Chen, Y. H., Phoa, F. K. H., Lin, Y. H., & Lin, S. P. (2024). An efficient approach for identifying important biomarkers for biomedical diagnosis. Biosystems, 105163.

Chicco, D., Warrens, M. J., & Jurman, G. (2021). The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE, and RMSE in regression analysis evaluation. PeerJ Computer Science, 7, e623.

Domingo-Relloso, A., Feng, Y., Rodriguez-Hernandez, Z., Haack, K., Cole, S. A., Navas-Acien, A., et al. (2024). Omics feature selection with the extended SIS R package: Identification of a body mass index epigenetic multi-marker in the Strong Heart Study. American Journal of Epidemiology, kwae006.

Fan, J., Samworth, R., & Wu, Y. (2009). Ultrahigh dimensional feature selection: Beyond the linear model. The Journal of Machine Learning Research, 10, 2013—2038.

Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348—1360.

Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 70(5), 849—911.

Feng, Y., & Wu, Q. (2022). A statistical learning assessment of Huber regression. Journal of Approximation Theory, 273, 105660.

Fregoso-Aparicio, L., Noguez, J., Montesinos, L., & García-García, J. A. (2021). Machine learning and deep learning predictive models for type 2 diabetes: A systematic review. Diabetology & Metabolic Syndrome, 13(1), 1—22.

Freund, J. E. (1961). A bivariate extension of the exponential distribution. Journal of the American Statistical Association, 56(296), 971—977.

Gujarati, D. N., Bernier, B., & Bernier, B. (2004). Econométrie (pp. 17—5). Brussels: De Boeck.

Freijeiro-González, L., Febrero-Bande, M., & González-Manteiga, W. (2022). A critical review of LASSO and its derivatives for variable selection under dependence among covariates. International Statistical Review, 90(1), 118—145.

Han, X., & Song, D. (2022). Using a machine learning approach to identify key biomarkers for renal clear cell carcinoma. International Journal of General Medicine, 3541—3558.

Kong, S., Yu, Z., Zhang, X., & Cheng, G. (2021). High-dimensional robust inference for Cox regression models using desparsified Lasso. Scandinavian Journal of Statistics, 48(3), 1068—1095.

Zhou, H., & Zou, H. (2024). The nonparametric Box–Cox model for high-dimensional regression analysis. Journal of Econometrics, 239(2), 105419.

Huber, G. P. (1981). The nature of organizational decision making and the design of decision support systems. MIS Quarterly, 1—10.

Jaffe, S. (2015). Planning for US precision medicine initiative underway. The Lancet, 385(9986), 2448—2449.

Sathasivam, S., Adebayo, S. A., Velavan, M., Yee, T. H., & Yi, T. P. (2024, January). Transmission of hepatitis B dynamics in Malaysia using modified SIS hybrid model with Euler and Runge-Kutta method. In AIP Conference Proceedings (Vol. 3016, No. 1). AIP Publishing.

Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V., & Fotiadis, D. I. (2015). Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal, 13, 8—17.

Legates, D. R., & McCabe Jr, G. J. (1999). Evaluating the use of “goodness-of-fit” measures in hydrologic and hydroclimatic model validation. Water Resources Research, 35(1), 233—241.

Liu, Z., Elashoff, D., & Piantadosi, S. (2019). Sparse support vector machines with l0 approximation for ultra-high dimensional omics data. Artificial Intelligence in Medicine, 96, 134—141.

Lu, B., Wang, F., Wang, S., Chen, J., Wen, G., & Fu, R. (2024). Improvement of motor imagery electroencephalogram decoding by iterative weighted Sparse-Group Lasso. Expert Systems with Applications, 238, 122286.

Mayer, D. G., & Butler, D. G. (1993). Statistical validation. Ecological Modelling, 68(1-2), 21—32.

Montazeri, M., Montazeri, M., Montazeri, M., & Beigzadeh, A. (2016). Machine learning models in breast cancer survival prediction. Technology and Health Care, 24(1), 31—42.

Mihaylov, I., Nisheva, M., & Vassilev, D. (2019). Application of machine learning models for survival prognosis in breast cancer studies. Information, 10(3), 93.

Sartori, S. (2011). Penalized regression: Bootstrap confidence intervals and variable selection for high-dimensional data sets.

Salerno, S., & Li, Y. (2023). High-dimensional survival analysis: Methods and applications. Annual Review of Statistics and Its Application, 10, 25—49.

Saldana, D. F., & Feng, Y. (2018). SIS: An R package for sure independence screening in ultrahigh-dimensional statistical

Spooner, A., Chen, E., Sowmya, A., Sachdev, P., Kochan, N. A., Trollor, J., & Brodaty, H. (2020). A comparison of machine learning methods for survival analysis of high-dimensional clinical data for dementia prediction. Scientific Reports, 10(1), 1–10.

Sültmann, H., Heydebreck, A. V., Huber, W., Kuner, R., Buneβ, A., Vogt, M., & Poustka, A. (2005). Gene expression in kidney cancer is associated with cytogenetic abnormalities, metastasis formation, and patient survival. Clinical Cancer Research, 11(2), 646–655.

Shuch, B., Amin, A., Armstrong, A. J., Eble, J. N., Ficarra, V., Lopez-Beltran, A., & Kutikov, A. (2015). Understanding pathologic variants of renal cell carcinoma: Distilling therapeutic opportunities from biologic complexity. European Urology, 67(1), 85–97.

Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 58(1), 267–288.

Vasilevsky, N. A., Matentzoglu, N. A., Toro, S., Flack IV, J. E., Hegde, H., Unni, D. R., & Haendel, M. A. (2022). Mondo: Unifying diseases for the world, by the world. medRxiv, 2022-04.

Wang, H., & Li, G. (2017). A selective review on random survival forests for high dimensional data. Quantitative Bio-science, 36(2), 85.

Xu, X., Liang, T., Zhu, J., Zheng, D., & Sun, T. (2019). Review of classical dimensionality reduction and sample selection methods for large-scale data processing. Neurocomputing, 328, 5–15.

Yarahmadi, M. N., MirHassani, S. A., & Hooshmand, F. (2024). Handling the significance of regression coefficients via optimization. Expert Systems with Applications, 238, 121910.

Zou, H. (2006). The adaptive LASSO and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.

Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 67(2), 301–320.

Al-Thanoon, N. A., Qasim, O. S., & Algamal, Z. Y. (2018). Tuning parameter estimation in SCAD-support vector machine using firefly algorithm with application in gene selection and cancer classification. Computers in Biology and Medicine, 103, 262–268.

Ryan, R. J., Nitta, M., Borger, D., Zukerberg, L. R., Ferry, J. A., Harris, N. L., & Le, L. P. (2011). EZH2 codon 641 mutations are common in BCL2-rearranged germinal center B cell lymphomas. PLOS ONE, 6(12), e28585.

Wang, Z., Song, Q., Yang, Z., Chen, J., Shang, J., & Ju, W. (2019). Construction of immune‐related risk signature for renal papillary cell carcinoma. Cancer Medicine, 8(1), 289–304.

Walton, J., Lawson, K., Prinos, P., Finelli, A., Arrowsmith, C., & Ailles, L. (2023). PBRM1, SETD2 and BAP1—the trinity of 3p in clear cell renal cell carcinoma. Nature Reviews Urology, 20(2), 96–115.

Yu, C., & Yao, W. (2017). Robust linear regression: A review and comparison. Communications in Statistics—Simulation and Computation, 46(8), 6261–6282.

Sathasivam, S., Adebayo, S. A., Velavan, M., Yee, T. H., & Yi, T. P. (2024, January). Transmission of hepatitis B dynamics in Malaysia using modified SIS hybrid model with Euler and Runge-Kutta method. In AIP Conference Proceedings (Vol. 3016, No. 1). AIP Publishing.

Xiong, T., Wang, Y., & Zhu, C. (2024). A risk model based on 10 ferroptosis regulators and markers established by LASSO-regularized linear Cox regression has a good prognostic value for ovarian cancer patients. Diagnostic Pathology, 19(1), 4.

Ghosh, A., Jaenada, M., & Pardo, L. (2024). Robust adaptive variable selection in ultra-high dimensional linear regression models. Journal of Statistical Computation and Simulation, 94(3), 571–603.

Madadjim, R., An, T., & Cui, J. (2024). MicroRNAs in pancreatic cancer: Advances in biomarker discovery and therapeutic implications. International Journal of Molecular Sciences, 25(7), 3914.

Lopes, M. B., Veríssimo, A., Carrasquinha, E., Casimiro, S., Beerenwinkel, N., & Vinga, S. (2018). Ensemble outlier detection and gene selection in triple-negative breast cancer data. BMC Bioinformatics, 19, 1–15.

Yin, Q., Chen, W., Zhang, C., & Wei, Z. (2022). A convolutional neural network model for survival prediction based on prognosis-related cascaded Wx feature selection. Laboratory Investigation, 102(10), 1064–1074.

Cheng, X., & Wang, H. (2023). A generic model-free feature screening procedure for ultra-high dimensional data with categorical response. Computer Methods and Programs in Biomedicine, 229, 107269.

Li, K., Wang, F., Yang, L., & Liu, R. (2023). Deep feature screening: Feature selection for ultra-high-dimensional data via deep neural networks. Neurocomputing, 538, 126186.

Liu, Z., Elashoff, D., & Piantadosi, S. (2019). Sparse support vector machines with l0 approximation for ultra-high dimensional omics data. Artificial Intelligence in Medicine, 96, 134–141.

Afshar, M., & Usefi, H. (2020). High-dimensional feature selection for genomic datasets. Knowledge-Based Systems, 206, 106370.

Chamlal, H., Benzmane, A., & Ouaderhman, T. (2024). Elastic net-based high dimensional data selection for regression. Expert Systems with Applications, 244, 122958.

Huang, C. (2021). Feature selection and feature stability measurement method for high-dimensional small sample data based on big data technology. Computational Intelligence and Neuroscience, 2021.

Zambom, A. Z., & Matthews, G. J. (2021). Sure independence screening in the presence of missing data. Statistical Papers, 62, 817–845.

Yi, G. Y., He, W., & Carroll, R. J. (2022). Feature screening with large-scale and high-dimensional survival data. Biometrics, 78(3), 894–907.

Reese, R., Dai, X., & Fu, G. (2018). Strong sure screening of ultra-high dimensional categorical data. arXiv Preprint arXiv:1801.03539.

Frank, I. E., & Friedman, J. H. (1993). A statistical view of some chemometrics regression tools (with discussion). Technometrics, 35, 109–148.

Li, F., Yang, M., Li, Y., Zhang, M., Wang, W., Yuan, D., & Tang, D. (2020). An improved clear cell renal cell carcinoma stage prediction model based on gene sets. BMC Bioinformatics, 21, 1–15.

Sim, K. C., Han, N. Y., Cho, Y., Sung, D. J., Park, B. J., Kim, M. J., & Han, Y. E. (2023). Machine learning–based magnetic resonance radiomics analysis for predicting low-and high-grade clear cell renal cell carcinoma. Journal of Computer Assisted Tomography, 47(6), 873–881.

Sofia, D., Zhou, Q., & Shahriyari, L. (2023). Mathematical and machine learning models of renal cell carcinoma: A review. Bioengineering, 10(11), 1320.

Van Erp, S., Oberski, D. L., & Mulder, J. (2019). Shrinkage priors for Bayesian penalized regression. Journal of Mathematical Psychology, 89, 31–50.

Downloads

Published

15-10-2024