Impact of Missing Data on Correlation Coefficient Values: Deletion and Imputation Methods for Data Preparation

Authors

  • Mohamed Shantal Center for Artificial Intelligence Technology, Faculty of Information Science & Technology, Universiti Kebangsaan Malaysia, 43600 UKM Bangi, Selangor Darul Ehsan, Malaysia
  • Zalinda Othman Center for Artificial Intelligence Technology, Faculty of Information Science & Technology, Universiti Kebangsaan Malaysia, 43600 UKM Bangi, Selangor Darul Ehsan, Malaysia
  • Azuraliza Abu Bakar Center for Artificial Intelligence Technology, Faculty of Information Science & Technology, Universiti Kebangsaan Malaysia, 43600 UKM Bangi, Selangor Darul Ehsan, Malaysia

DOI:

https://doi.org/10.11113/mjfas.v19n6.3098

Keywords:

Correlation Coefficient, Pearson's Correlation, Missing data, Mean Imputation, k-NN imputation, Expectation Maximization imputation

Abstract

The correlation coefficient is one of the essential statistical techniques used to discover relationships among variables. Various techniques can quantify correlation, such as Pearson's, Spearman's, and Kendall's correlation coefficients, depending on the data type. As with any use of data, missing data will impact the availability of data, reducing it and potentially affecting the results. Furthermore, the removal of missing-value data from the study when using complete case analysis or available case analysis may result in selection biases. In this paper, we investigate the impact of missing data on the correlation coefficient value by calculating the difference between the correlation coefficient of the original complete dataset and that of a dataset with missing data. Two deletion strategies (Listwise and Pairwise) and three imputation strategies (Mean, k-Nearest Neighbors (k-NN), and Expectation-Maximization) were used to prepare the data before calculating the correlation coefficient. Unique correlation coefficient values were created by converting unique values to a one-dimensional array, and RMSE metrics were used to evaluate the experiments. Eight UCI and Kaggle datasets with different sizes and numbers of attributes were used in this study. The experiment results demonstrate that the Pairwise strategy and k-NN give good results on the correlation coefficient, respectively, when the missing rate is moderate or less. Pairwise uses all the available values and discards only the missing values of the related attribute, while k-NN fills the missing values with new values that produce correlation coefficient values close to the actual values.

References

W.-C. Lin and C.-F. Tsai. (2020).Missing value imputation: A review and analysis of the literature (2006–2017). Artificial Intelligence Review, 53(2), 1487-1509.

H. Nugroho, N. P. Utama, and K. Surendro. (2021). Class center-based firefly algorithm for handling missing data. Journal of Big Data, 8(1), 1-14.

B. Ratner.(2009). The correlation coefficient: Its values range between +1/−1, or do they? Journal of Targeting, Measurement and Analysis for Marketing, 17(2), 139-142. Doi: 10.1057/jt.2009.5.

I. Swesi and A. Abu Bakar. (2019). Feature clustering for PSO-based feature construction on high-dimensional data. Journal of Information and Communication Technology, 18. Doi: 10.32890/jict2019.18.4.3.

M. A. Hall. (2000). Correlation-based feature selection of discrete and numeric class machine learning. Hamilton, New Zealand: University of Waikato, Department of Computer Science.

H.-H. Hsu and C.-W. Hsieh. (2010). Feature selection via correlation coefficient clustering. J. Softw., 5(12), 1371-1377.

R. Saidi, W. Bouaguel, and N. Essoussi. (2019). Hybrid feature selection method based on the genetic algorithm and pearson correlation coefficient. Machine Learning Paradigms: Theory and Application, A. E. Hassanien Ed. Cham: Springer International Publishing, 3-24.

X. Chen, Z. Wei, Z. Li, J. Liang, Y. Cai, and B. Zhang. (2017). Ensemble correlation-based low-rank matrix completion with applications to traffic data imputation. Knowledge-based Systems, 132, 249-262.

X. Liu, X. Lai, and L. Zhang. (2019). A hierarchical missing value imputation method by correlation-based K-nearest neighbors.Proceedings of SAI Intelligent Systems Conference, 486-496.

G. Rahman and Z. Islam. (2011). A decision tree-based missing value imputation technique for data pre-processing. Proceedings of the Ninth Australasian Data Mining, 121, 41-50.

A. M. Sefidian and N. Daneshpour. (2020). Estimating missing data using novel correlation maximization based methods. Applied Soft Computing, 91, 106249. Doi: https://doi.org/10.1016/j.asoc.2020.106249.

R. Armina, A. Mohd Zain, N. A. Ali, and R. Sallehuddin. (2017). A review on missing value estimation using imputation algorithm. Journal of Physics: Conference Series, 892, 012004. Doi: 10.1088/1742-6596/892/1/012004.

K. F. Widaman. (2006). Missing data: what to do with or without them. Monographs of the Society for Research in Child Development, 71(3), 42-64. Doi: 10.1111/j.1540-5834.2006.00404.x.

P. Schober, C. Boer, and L. A. Schwarte. (2018). Correlation coefficients: appropriate use and interpretation. Anesthesia & Analgesia, 126(5), 1763-1768.

M. Baak, R. Koopman, H. Snoek, and S. Klous. (2020). A new correlation coefficient between categorical, ordinal and interval variables with Pearson characteristics. Computational Statistics & Data Analysis, 152. Doi: 10.1016/j.csda.2020.107043.

H. Khamis. (2008). Measures of association: how to choose? Journal of Diagnostic Medical Sonography, 24(3), 155-162.

D. Kornbrot. (2014). Point biserial correlation. Wiley StatsRef: Statistics Reference Online.

C. Arunkumar and S. Ramakrishnan. (2016). A hybrid approach to feature selection using correlation coefficient and fuzzy rough quick reduct algorithm applied to cancer microarray data. 2016 10th International Conference on Intelligent Systems and Control (ISCO). 1-6. Doi: 10.1109/ISCO.2016.7726921.

A. Alhroob, W. Alzyadat, I. Almukahel, and H. Altarawneh. (2020). Missing data prediction using correlation genetic algorithm and SVM approach. Population, 11(2).

S. Plancade, M. Berland, M. B. Nicolas, O. Langella, A. Bassignani, and C. Juste. (2021). A combined test for feature selection on sparse metaproteomics data-alternative to missing value imputation. bioRxiv.

J. M. Brick and G. Kalton. (1996). Handling missing data in survey research. Statistical methods in medical research, 5(3), 215-238.

O. Rado, M. Al Fanah, and E. Taktek. (2019). Performance analysis of missing values imputation methods using machine learning techniques. Intelligent Computing-Proceedings of the Computing Conference, 738-750.

P. S. Raja and K. Thangavel. (2020). Missing value imputation using unsupervised machine learning techniques. Soft Computing, 24(6), 4361-4392. Doi: 10.1007/s00500-019-04199-6.

J. T. McCoy, S. Kroon, and L. Auret. (2018). Variational autoencoders for missing data imputation with application to a simulated milling circuit. IFAC-PapersOnLine, 51(21), 141-146. Doi: https://doi.org/10.1016/j.ifacol.2018.09.406.

J. R. Cheema. (2014). A review of missing data handling methods in education research. Review of Educational Research, 84(4), 487-508.

S. Yenduri and S. S. Iyengar. (2007). Performance evaluation of imputation methods for incomplete datasets. International Journal of Software Engineering and Knowledge Engineering, 7(01), 127-152.

A. T. S. Dhevi. (2014). Imputing missing values using Inverse distance weighted interpolation for time series data. 2014 Sixth International Conference on Advanced Computing (ICoAC), 255-259. Doi: 10.1109/ICoAC.2014.7229721.

I. Eekhout, H. C. de Vet, J. W. Twisk, J. P. Brand, M. R. de Boer, and M. W. Heymans. (2014). Missing data in a multi-item instrument were best handled by multiple imputation at the item score level. Journal of Clinical Epidemiology, 67(3), 335-342.

N. García-Pedrajas, J. A. R. d. Castillo, and G. Cerruela-García. (2017). A proposal for local k values for k-nearest neighbor rule. IEEE Transactions on Neural Networks and Learning Systems, 28(2), 470-475. Doi: 10.1109/TNNLS.2015.2506821.

A. B. Hassanat, M. A. Abbadi, G. A. Altarawneh, and A. A. Alhasanat. (2014). Solving the problem of the K parameter in the KNN classifier using an ensemble learning approach. arXiv preprint arXiv:1409.0919. Doi: 10.48550/ARXIV.1409.0919.

S. Zhang. (2012). Nearest neighbor selection for iteratively kNN imputation. Journal of Systems and Software, 85(11), 2541-2552. Doi: https://doi.org/10.1016/j.jss.2012.05.073.

J. Chen and J. Shao. (2000). Nearest neighbor imputation for survey data. Journal of Official Statistics, 16, 113-131.

A. P. Dempster, N. M. Laird, and D. B. Rubin. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1-22.

W. Jiang, J. Josse, M. Lavielle, and G. TraumaBase. (2020). Logistic regression with missing covariates-parameter estimation, model selection and prediction within a joint-modeling framework. Computational Statistics & Data Analysis, 145, Art no. 106907. Doi: 10.1016/j.csda.2019.106907.

T. Schneider. (2001). Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate, 14(5), 853-871.

A. Mirzaei, S. R. Carter, A. E. Patanwala, and C. R. Schneider. (2022). Missing data in surveys: Key concepts, approaches, and applications. Research in Social and Administrative Pharmacy, 18(2), 2308-2316. Doi: https://doi.org/10.1016/j.sapharm.2021.03.009.

D. R. Johnson and R. Young. (2011). Toward best practices in analyzing datasets with missing data: Comparisons and recommendations. Journal of Marriage and Family, 73(5), 926-945. Doi: https://doi.org/10.1111/j.1741-3737.2011.00861.x.

L. M. Collins, J. L. Schafer, and C.-M. Kam. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological methods, 6(4), 330.

C. M. Musil, C. B. Warner, P. K. Yobas, and S. L. Jones. (2002). A comparison of imputation techniques for handling missing data. Western Journal of Nursing Research, 24(7), 815-829.

U. ÜRESİN. (2021). Correlation based regression imputation (CBRI) method for missing data imputation. Turkish Journal of Science and Technology, 16(1), 39-46.

A. Bommert, X. Sun, B. Bischl, J. Rahnenführer, and M. Lang. (2020). Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis, 143, 106839. Doi: https://doi.org/10.1016/j.csda.2019.106839.

S. Egea, A. R. Mañez, B. Carro, A. Sánchez-Esguevillas, and J. Lloret. (2018). Intelligent IoT traffic classification using novel search strategy for fast-based-correlation feature selection in industrial environments. IEEE Internet of Things Journal, 5(3), 1616-1624. Doi: 10.1109/JIOT.2017.2787959.

S. Rakshit, P. Das, and A. K. Das. (2018). Importance of Missing value estimation in feature selection for crime analysis. Intelligent Communication and Computational Technologies, Singapore, Y.-C. Hu, S. Tiwari, K. K. Mishra, and M. C. Trivedi, Eds. Springer Singapore. 97-105.

G. M. D’Angelo, J. Luo, and C. Xiong. (2012). Missing data methods for partial correlations. Journal of Biometrics & Biostatistics, 3(8).

D. Singh and B. Singh. (2020). Investigating the impact of data normalization on classification performance. Applied Soft Computing, 97, 105524. Doi: https://doi.org/10.1016/j.asoc.2019.105524.

I. Pratama, A. E. Permanasari, I. Ardiyanto, and R. Indrayani. (2016). A review of missing values handling methods on time-series data. 2016 International Conference on Information Technology Systems and Innovation (ICITSI). 1-6. Doi: 10.1109/ICITSI.2016.7858189.

R. Razavi-Far, B. Cheng, M. Saif, and M. Ahmadi. (2020). Similarity-learning information-fusion schemes for missing data imputation. Knowledge-Based Systems, 187, 104805. Doi: https://doi.org/10.1016/j.knosys.2019.06.013.

T. Chai and R. R. Draxler. (2014). Root mean square error (RMSE) or mean absolute error (MAE). Geoscientific Model Development Discussions, 7(1), 1525-1534.

Downloads

Published

04-12-2023