Application of Imputation Method for Compositional Data with Missing Values based on Adaptive LASSO Model: the Composition of Employment Industry in Taiyuan, China

Authors

  • Ying Tian ᵃSchool of Mathematical Sciences, Universiti Sains Malaysia, 11800 USM, Gelugor, Pulau Pinang, Malaysia; ᵇDepartment of Science, Taiyuan Institute of Technology, 030008, Taiyuan, Shanxi, China
  • Majid Khan Majahar Ali School of Mathematical Sciences, Universiti Sains Malaysia, 11800 USM, Gelugor, Pulau Pinang, Malaysia
  • Fam Pei Shan School of Mathematical Sciences, Universiti Sains Malaysia, 11800 USM, Gelugor, Pulau Pinang, Malaysia
  • Lili Wu School of Mathematical Sciences, Universiti Sains Malaysia, 11800 USM, Gelugor, Pulau Pinang, Malaysia
  • Siti Zulaikha Mohd Jamaludin School of Mathematical Sciences, Universiti Sains Malaysia, USM, 11800 Penang, Malaysia

DOI:

https://doi.org/10.11113/mjfas.v20n1.3034

Keywords:

Missing Values, Compositional Data, Adaptive Lasso, Industry Composition

Abstract

The tripartite industry classification, which divides all economic activities into three parts, is a classification method to reflect the dynamic process of economic development and the historical trend of the change of resource allocation structure.The fact shows that the proportion of each industry has become an important symbol of the level of national economic development. The proportion of each industry is compositional data,which is a kind of complex multidimensional data used in many fields. All components in the compositional data are non-negative and carry only relative information. In practice, there could be missing values in compositional data. However, general statistical analysis methods cannot be firstly used for compositional data with missing values. The complexity of the missing value of compositional data makes traditional imputation methods no longer suitable. Thus, how to carry out effective statistical inference for compositional data with missing values attracts the attention of many scholars, recently. In this paper, we focus on the imputation problem in compositional data containing missing values, and propose an Adaptive Least Absolute Shrinkage and Selection Operator (ALASSO) imputation method to obtain a complete datasets through variable selection and parameter estimation. Then, the new method is simulated and empirically analyzed, and a comparative study with mean imputation, k-nearest neighbor imputation, and iterative regression imputation is conducted. The results show that the ALASSO imputation method has the highest accuracy for different missing rates, dimensions and correlation coefficients.

References

Aitchison, J.; Barcelo-Vidal, C.;Martın-Fernandez, J. A.; Pawlowsky-Glahn, V. Logratio analysis and compositional distance. Mathematical Geology, 2000,32(3):271–275.

Behrouz, R.; Fatemeh, A. G.; Leila, A.; Majid, S.; Seyed, H. K. S. Distribution of metals in sediments of the anzali lagoon, north iran. Soil and Sediment Contamination: An International Journal,2012,21(6):768–787.

Efron, B.; Hastie,T.;Johnstone, I. et al. Least angle regression. The Annals of statistics, 2004,32(2):407–499.

Egozcue, J. J.; Pawlowsky-Glahn,V.; Mateu-Figueras, G.; Barcelo-Vidal, C. Isometric logratio transformations for compositional data analysis. Mathematical Geology, 2003,35(3):279–300.

Feng, Y. Study on the current situation of economic development andcountermeasures in the transformation of industrial structure in shanxi province [J]. Economist,2021, 3:139–140.

Ferrers, N. M. An Elementary Treatise on Trilinear Co-ordinates: The Method of Reciprocal Polars, and the Theory of Projections. Macmillan and Company, 1876.

Goeman, J.J. L1 penalized estimation in the cox proportional hazards model[J]. Biometrical journal, 2010,52(1):70–84.

Hron, K.; Templ, M.; Filzmoser, P. Imputation of missing values for compositional data using classical and robust methods. Computational Statistics & Data Analysis, 2010,54(12):3095–3107.

Hui, Z. The adaptive lasso and its oracle properties[J]. Journal of the American Statistical Association, 2006,101(476):1418–1429.

Kobayashi, Y. et al. Dna microarray unravels rapid changes in transcriptome of mk-801 treated rat brain. World journal of biological chemistry, 2015,6(4):389.

Little, R. J. A.; Rubin, D. B. Statistical analysis with missing data. John Wiley & Sons, 2002,364-365.

Meng, Y.C. and Li, X. Analysis of the change of employment industry structure and the effffect of capital function in beijing–based on the data of population census (1% population sample survey)[J]. Urban Development Research, 2020,27(12):45–53.

Nordemann, D.J.R.; Rigozo, N.R.; Echer, E.; Souza-Echer, M.P. Principal components and iterative regression analysis of geophysical series: Application to sunspot number (1750-2004)[J]. Computers and Geosciences, 2007,34(11).

Nordemann, D.J.R.; Rigozo, N.R.; Echer, E.; Souza-Echer, M.P. Principal components and iterative regression analysis of geophysical series: Application to sunspot number (1750-2004)[J]. Computers and Geosciences, 2007,34(11).

Tibshiranit, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 1996,58(1):267–288.

Troyanskaya, O.; Cantor,M.; Sherlock,G.; Brown,P.; Hastie, T.; Tibshiranit, R.; Botstein, D.; Russ, B. Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 2001,17(6):520–525.

Wang, H.W.; Meng, J.; Tenenhaus, M. Regression Modelling Analysis on Compositional Data. Springer Berlin Heidelberg, 2010.

Wang, Q.H.; Rao, J. N. K. Empirical likelihood for linear regression models under imputation for missing responses. Canadian Journal of Statistics, 2001,29(4):597–608.

Whitten, E.H. and Timothy, J. Open and closed compositional data in petrology[J]. Mathematical Geology, 1995,27(6):789–806.

Zheng, S.J. Optimization of fifinancial structure in shanxi-based on the perspective of industrial specialization [J]. Economist, 2015, 1:172–173.

Downloads

Published

04-12-2023