Imputation Method based on Adaptive Group Lasso for High-dimensional Compositional Data with Missing Values
DOI:
https://doi.org/10.11113/mjfas.v21n1.3621Keywords:
Compositional data, imputation method, machine learning, adaptive group lasso.Abstract
Compositional data usually refers to data on the individual components that make up a whole. Such data are common in many fields, especially in chemistry, biology, geology, and other scientific and engineering fields. However, in many real-life situations, a large number of missing values are often collected. The complexity of compositional data with missing values makes traditional estimation methods seem overwhelming. Therefore, how to effectively perform statistical inference on compositional data with missing values has attracted the attention of many scholars in recent years. The logarithmic scale transformation provides a possibility for compositional data, but this transformation has limited requirements on the components of the compositional data, such as not including or missing fixes and constraints. Therefore, it is of great significance to explore a new estimation method for composition data theoretically. In this paper, a compositional data imputation method based on the adaptive group least absolute shrinkage and selection operator (AGLasso) is proposed. AGLasso is able to adapt imputation methods to different data distributions and patterns based on the characteristics of the data. While traditional methods may result in lost or biased information, AGLasso attempts to impute while preserving data integrity. Through data analysis, the imputation effect of compositional data containing missing values is compared under different missing rates and correlation coefficients, and a comparative study is conducted with the Lasso imputation method, the adaptive Lasso imputation method and the group Lasso imputation method. The results show that adaptive group Lasso is superior to the other three interpolation methods. In domains such as healthcare data, where data quality has a huge impact on decision making, AGLasso can help improve data integrity and usability. And in the future, Generative Adversarial Network (GAN)-based imputation methods and novel deep learning methods using techniques such as self-encoders are expected to show more power in dealing with missing values.
References
Ahn, H., Sun, K., & Kim, K. P. (2022). Comparison of missing data imputation methods in time series forecasting. Computers, Materials & Continua, 70(1), 767–779.
Aitchison, J., Barcelo-Vidal, C., Martín-Fernandez, J. A., & Pawlowsky-Glahn, V. (2000). Logratio analysis and compositional distance. Mathematical Geology, 32(3), 271–275.
Aşiret, S., & Ömür Sünbül, S. (2023). Effect of missing data on test equating methods under NEAT design. International Journal of Psychology and Educational Studies, 10(3), 702–713.
Backdoors, A., Rance, J. C., Zhao, Y., & Mullins, R. D. (2023). Published at ICLR 2023 workshop on backdoor attacks and defenses in machine learning. Computer Science, 1–14.
Carpenter, J. R., & Smuk, M. (2021). Missing data: A statistical framework for practice. Biometrical Journal, 63(5), 915–947.
Chen, J., & Wu, J. (2023). The prediction of Chongqing's GDP based on the LASSO method and chaotic whale group algorithm–back propagation neural network–ARIMA model. Scientific Reports, 13, 15002.
Chen, Q., & Ibrahim, J. G. (2006). Regularization methods for variable selection and estimation in surrogate endpoint regression models with missing data. Journal of the Royal Statistical Society: Series B, 68(1), 67–88.
Coenders, G., & Ferrer-Rosell, B. (2020). Compositional data analysis in tourism: Review and future directions. Tourism Analysis, 25(1), 153–168.
Combettes, P. L., & Müller, C. L. (2019). Regression models for compositional data: General log-contrast formulations, proximal optimization, and microbiome data applications. Statistics in Biosciences, 13, 217–242.
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.
Gomer, B., & Yuan, K. (2023). A realistic evaluation of methods for handling missing data when there is a mixture of MCAR, MAR, and MNAR mechanisms in the same dataset. Multivariate Behavioral Research, 58(5), 988–1013.
Hammon, A. (2023). Multiple imputation of ordinal missing not at random data. Advances in Statistical Analysis, 107, 671–692.
Ji, F., Rabe-Hesketh, S., & Skrondal, A. (2023). Diagnosing and handling common violations of missing at random. Psychometrika, 88, 1123–1143.
Ji, H. L., Zhen, T. S., & Zhan, G. (2022). On LASSO for predictive regression. Journal of Econometrics, 229(2), 322–349.
Khaliq, A., Sirait, P., & Andri. (2020). KNN imputation missing value for predictor app rating on Google Play using random forest method. International Journal of Research and Review, 7(3), 151–160.
Kumar, S., Attri, S. D., & Singh, K. K. (2021). Comparison of Lasso and stepwise regression technique for wheat yield prediction. Journal of Agrometeorology, 21(2), 188–192.
Kumar, S., Attri, S. D., & Singh, K. K. (2021). Comparison of Lasso and stepwise regression technique for wheat yield prediction. Journal of Agrometeorology, 21(2), 188–192.
Li, C., Pak, D., & Todem, D. (2019). Adaptive Lasso for the Cox regression with interval censored and possibly left truncated data. Statistical Methods in Medical Research, 29, 1243–1255.
Li, J., Yan, X., Chaudhary, D., Avula, V., Mudiganti, S., Husby, H. M., Shahjouei, S., Afshar, A., Stewart, W. F., Yeasin, M., Zand, R., & Abedi, V. (2021). Imputation of missing values for electronic health record laboratory data. NPJ Digital Medicine, 4, 147.
Liu, Y. (2022). Adaptive Lasso variable selection method for semiparametric spatial autoregressive panel data model with random effects. Communications in Statistics - Theory and Methods, 70, 1–19.
Loh, P.-L., & Wainwright, M. J. (2012). High-dimensional regression with missing data: Provable guarantees with non-convexity. The Annals of Statistics, 40(3), 1637–1664.
Martínez-Álvaro, M., Greenacre, M., & Agustín Blasco, A. (2021). Compositional data analysis of microbiome and any-omics datasets: A validation of the additive logratio transformation. Frontiers in Microbiology, 12, 727398.
Mazumder, R., Hastie, T., & Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11, 2287–2322.
Meier, L., van de Geer, S., & Bühlmann, P. (2008). The group Lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 53–71.
Nandhini, S., Debnath, M., Tyagi, S., Mishra, S., & Kumar, K. A. (2020). Stock forecasting using an improved version of adaptive group Lasso. Journal of Computational and Theoretical Nanoscience, 17, 3370–3373.
Pandhare, S. C., & Ramanathan, T. V. (2023). The robust desparsified Lasso and the focused information criterion for high-dimensional generalized linear models. Statistics, 57(1), 1–25.
Sethi, J. K., & Mittal, M. (2021). An efficient correlation-based adaptive LASSO regression method for air quality index prediction. Earth Science Informatics, 14, 1777–1786.
Shen, J., Zhang, Y., & Liu, H. (2020). Combining autoencoders with LASSO for missing data imputation. IEEE Transactions on Neural Networks and Learning Systems, 31(5), 1587–1599.
Shi, H., Wang, P., Yang, X., & Yu, H. (2020). An improved mean imputation clustering algorithm for incomplete data. Neural Processing Letters, 54, 3537–3550.
Shortreed, S. M., & Ertefaie, A. (2017). Outcome-adaptive Lasso: Variable selection for causal inference. Biometrics, 73(4), 1111–1122.
Silva-Ramírez, E., Pino-Mejías, R., & López-Coello, M. (2015). Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbors for monotone patterns. Applied Soft Computing, 29, 65–74.
Simon, N., & Friedman, J. (2013). A blockwise descent algorithm for group-penalized multiresponse and multinomial regression. Journal of Computational and Graphical Statistics, 22(2), 231–245.
Su, M. H., & Wang, W. J. (2023). A network Lasso model for regression. Communications in Statistics - Theory and Methods, 52, 1702–1727.
Tang, J., Zhang, X., Yin, W., Zou, Y., & Wang, Y. (2020). Missing data imputation for traffic flow based on combination of fuzzy neural network and rough set theory. Journal of Intelligent Transportation Systems, 25(5), 439–454.
Tibshirani, R. (2023). High-dimensional regression: Ridge advanced topics in statistical learning, Spring 2023. Computer Science, Mathematics, 1–14.
Wang, C., Zhu, R., & Xu, G. (2022). Using Lasso and adaptive Lasso to identify DIF in multidimensional 2PL models. Multivariate Behavioral Research, 58, 387–407.
Wang, H., & Leng, C. (2008). A note on adaptive group Lasso. Computational Statistics & Data Analysis, 52(12), 5277–5286.
Wang, Y. D., Zhang, W. B., Fan, M. H., Ge, Q., Qiao, B. J., Zuo, X. Y., & Jiang, B. (2022). Regression with adaptive Lasso and correlation-based penalty. Applied Mathematical Modelling, 105, 179–196.
Xiao, R., Liu, X., Qiao, H., Zheng, X., Zhang, Y., & Cui, X. (2021). Adaptive LASSO logistic regression based on particle swarm optimization for Alzheimer's disease early diagnosis. Chemometrics and Intelligent Laboratory Systems, 105, 104–116.
Xi, L. J., Guo, Z. Y., Yang, X. K., & Ping, Z. G. (2023). Application of LASSO and its extended method in variable selection of regression analysis. Chinese Journal of Preventive Medicine, 57, 107–111.
Yuan, H., He, S., & Deng, M. (2019). Compositional data network analysis via Lasso penalized D-trace loss. Bioinformatics, 35(18), 3404–3411.
Zhang, C., & Xiang, Y. (2016). On the oracle property of adaptive group Lasso in high-dimensional linear models. Statistical Papers, 57, 249–265.
Zhao, P., & Rocha, G. (2018). Adaptive group Lasso for missing data imputation in high-dimensional regression. Biometrika, 105(3), 685–699.
Zhou, X., Zhao, P., & Gai, Y. (2022). Imputation-based empirical likelihood inferences for partially nonlinear quantile regression models with missing responses. Advances in Statistical Analysis, 106, 705–722.
Zou, H. (2006). The adaptive Lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Ying Tian, Majid Khan Majahar Ali, Lili Wu, Tao Li

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.