Simulation Study of the Test on Covariance Estimator for Outlier Detection in Multivariate Data with Mean and Covariance Shifts

Authors

  • Sharifah Sakinah Syed Abd Mutalib Faculty of Computer Science and Mathematics, Universiti Malaysia Terengganu, 21030 Kuala Nerus, Terengganu, Malaysia
  • Siti Zanariah Satari Centre for Mathematical Sciences, Universiti Malaysia Pahang Al-Sultan Abdullah, Lebuh Persiaran Tun Khalil Yaakob, 26300 Gambang, Pahang, Malaysia
  • Wan Nur Syahidah Wan Yusoff Faculty of Computer Science and Mathematics, Universiti Malaysia Terengganu, 21030 Kuala Nerus, Terengganu, Malaysia

DOI:

https://doi.org/10.11113/mjfas.v21n2.3660

Keywords:

Test on Covariance, robust estimator, Mahalanobis distance, outliers, multivariate data.

Abstract

Outlier detection in multivariate data is complex compared to univariate data, which can be done using graphical inspection. Outlier detection is also one of the common issues in multivariate analysis and has been applied to tax fraud detection and industrial food inspection. Outliers’ studies are closely related to robust estimators of the sample mean and covariance matrix as these estimators are resistant toward outliers. The Test on Covariance (TOC) is a newly developed robust estimator for multivariate data. Until now, TOC’s performance was investigated for two outlier scenarios by shifting the mean and covariance separately. TOC shows good results in both outlier scenarios and is found to be applicable in detecting outliers. In this study, the performance of TOC is investigated further in detecting outliers via simulation study for other outliers’ scenarios by shifting the mean and covariance simultaneously. Other robust estimators; Fast Minimum Covariance Determinant (FMCD), Minimum Vector Variance (MVV), Covariance Matrix Equality (CME) and Index Set Equality (ISE) are used as a comparison. Various conditions of sample sizes, number of variables,  and percentage of outliers,  are considered in the simulation study. The performance of all robust estimators is measured by probability to detect outliers (pout), masking error (pmask) and swamping error (pswamp). Results present that the TOC can be the best robust estimator, give the same performance as other robust estimators in detecting outliers, and have a low masking error when outliers and inliers are far from each other. Moreover, TOC displays good results in low swamping errors for most cases which means TOC has a low probability of misclassifying inliers as outliers compared to other robust estimators. In conclusion, TOC is an applicable and promising approach for outlier detection in multivariate data and can be incorporated with other multivariate analyses.

 

References

Hadi, A. S., Rahmatullah Imon, A. H. M. M., & Werner, M. (2009). Detection of outliers. Wiley Interdisciplinary Reviews: Computational Statistics, 1(1), 57–70.

Su, X., & Tsai, C.-L. (2011). Outlier detection. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(3), 261–268.

De Ketelaere, B., Hubert, M., Raymaekers, J., Rousseeuw, P. J., & Vranckx, I. (2020). Real-time outlier detection for large datasets by RT-DetMCD. Chemometrics and Intelligent Laboratory Systems, 199, Article 103957.

Savić, M., Atanasijević, J., Jakovetić, D., & Krejić, N. (2022). Tax evasion risk management using a hybrid unsupervised outlier detection method. Expert Systems with Applications, 193, Article 116409.

Cabana, E., Lillo, R. E., & Laniado, H. (2021). Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators. Statistical Papers, 62(4), 1583–1609.

Herwindiati, D. E., Djauhari, M. A., & Mashuri, M. (2007). Robust multivariate outlier labeling. Communications in Statistics—Simulation and Computation, 36(6), 1287–1294.

Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point. Mathematical Statistics and Applications, 8, 283–297.

Salleh, R. M. (2013). A robust estimation method of location and scale with application in monitoring process variability (Doctoral dissertation). Universiti Teknologi Malaysia.

Mashuri, M., Ahsan, M., Lee, M. H., & Dwi, D. P. (2021). PCA-based Hotelling’s T² chart with fast minimum covariance determinant (FMCD) estimator and kernel density estimation (KDE) for network intrusion detection. Computers & Industrial Engineering, 158, Article 107447.

Rousseeuw, P. J., & Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3), 212–223.

Lim, H. A., & Midi, H. (2016). Diagnostic robust generalized potential based on index set equality (DRGP(ISE)) for the identification of high leverage points in linear model. Computational Statistics, 31(3), 859–877.

Salleh, R. M., & Djauhari, M. A. (2011). Robust Hotelling’s T² control charting in spike production process. International Seminar on the Application of Science & Mathematics 2011 (ISASM 2011), 1–8.

Abd Mutalib, S. S. S., Satari, S. Z., & Wan Yusoff, W. N. S. (2019). A new robust estimator to detect outliers for multivariate data. Journal of Physics: Conference Series, 1366(1), Article 012104.

Abd Mutalib, S. S. S., Satari, S. Z., & Wan Yusoff, W. N. S. (2021). Comparison of robust estimators’ performance for detecting outliers in multivariate data. Journal of Statistical Modeling and Analytics, 3(2), 36–64.

Abd Mutalib, S. S. S., Satari, S. Z., & Wan Yusoff, W. N. S. (2021). Comparison of robust estimators for detecting outliers in multivariate datasets. Journal of Physics: Conference Series, 1988(1).

Sebert, D. M., Montgomery, D. C., & Rollier, D. A. (1998). A clustering algorithm for identifying multiple outliers in linear regression. Computational Statistics & Data Analysis, 27(4), 461–484.

Rencher, A. C. (2002). Methods of multivariate analysis (2nd ed.). Wiley.

Cerioli, A., Riani, M., & Torti, F. (2011). Accurate and powerful multivariate outlier detection. Proceedings of the 58th World Statistical Congress of the International Statistical Institute, 5608–5613.

Filzmoser, P. (2005). Identification of multivariate outliers: A performance study. Austrian Journal of Statistics, 34(2), 127–138.

Filzmoser, P., Maronna, R., & Werner, M. (2008). Outlier identification in high dimensions. Computational Statistics & Data Analysis, 52(3), 1694–1711.

Zulkipli, N. S., Satari, S. Z., & Wan Yusoff, W. S. (2022). The effect of different similarity distance measures in detecting outliers using single-linkage clustering algorithm for univariate circular biological data. Pakistan Journal of Statistics and Operation Research, 18(3), 561–573.

Santos-Pereira, C. M., & Pires, A. M. (2002). Detection of outliers in multivariate data: A method based on clustering and robust estimators. In Compstat (pp. 291–296). Physica.

Hubert, M. (2020). Robust multivariate statistical methods. In Comprehensive Chemometrics (2nd ed., pp. 107–122). Elsevier.

Downloads

Published

23-04-2025