Comparative Analysis of Improved Dirichlet Process Mixture Model

Authors

  • Lili Wu ᵃSchool of Mathematical Sciences, Universiti Sains Malaysia, 11800 USM, Penang, Malaysia; ᵇDepartment of Computer Science, Xinzhou Teachers University, 034000 Xinzhou, Shanxi, China
  • Pei Shan Fam School of Mathematical Sciences, Universiti Sains Malaysia, 11800 USM, Penang, Malaysia
  • Majid Khan Majahar Ali School of Mathematical Sciences, Universiti Sains Malaysia, 11800 USM, Penang, Malaysia
  • Ying Tian School of Mathematical Sciences, Universiti Sains Malaysia, 11800 USM, Penang, Malaysia
  • Mohd. Tahir Ismail School of Mathematical Sciences, Universiti Sains Malaysia, 11800 USM, Penang, Malaysia
  • Siti Zulaikha Mohd Jamaludin School of Mathematical Sciences, Universiti Sains Malaysia, 11800 USM, Penang, Malaysia

DOI:

https://doi.org/10.11113/mjfas.v19n6.3062

Keywords:

Bayesian non-parametric model; PCA; t-SNE; DPMM

Abstract

Due to the development of information technology, large amounts of data are generated every day in various industries such as engineering, healthcare, finance, anomaly detection, image recognition, and artificial intelligence. This massive data poses the challenge of analyzing accurately and appropriate classifications. The traditional clustering methods require specifying the number of clusters and are mostly based on distance, which cannot effectively consider the correlations between different indicators of high-dimensional and multi-source data. Moreover, the number of clusters cannot automatically adjust when new data is generated. In order to improve the clustering analysis of high-dimensional and multi-source data in a big data environment, this study utilizes non-parametric mixture models based on distribution clustering, which does not require specifying the number of clusters and can auto update with the data. By combining Principal Component Analysis (PCA), t-Distributed Stochastic Neighbour Embedding (t-SNE), and the non-parametric Bayesian method called Dirichlet Process Mixture Model (DPMM), the Bayesian non-parametric PCA model (PCA-DPMM) and Bayesian non-parametric t-SNE model (TSNE-DPMM) are proposed. The Chinese restaurant process of DPMM is used for sampling by introducing a finite normal mixture distribution. The clustering results on the iris dataset are compared and analyzed. The accuracy of DPMM and TSNE-DPMM reaches 0.97, while PCA-DPMM achieves a maximum accuracy of only 0.94. When different numbers of iterations are set, TSNE-DPMM maintains an accuracy ranging from 0.92 to 0.97, DPMM ranges from 0.66 to 0.97, and PCA-DPMM ranges from 0.73 to 0.94. Therefore, the proposed TSNE-DPMM ensures accuracy and exhibits better model stability in clustering results. Future research can explore the improvement of the model by incorporating deep learning algorithms, among others, to further enhance its performance. Additionally, applying the TSNE-DPMM model to data analysis in other fields is also a future research direction. Through these efforts, we can better tackle the challenges of analyzing high-dimensional and multi-source data in a big data environment and extract valuable information from it.

References

Gao, M. Y., Wang, J. & Yang, J. (2023). Research into the relationship between personality and behavior in video games, based on mining association rules. Mathematics, 11(3), 1-13.

Li, W. P., Cao, Y. & Li, L. J. (2021). Orthogonal wavelet transform KCA in fault diagnosis. Journal of Vibration and Shock, 40(07), 291-296.

Badhera, U., Verma, A. & Nahar, P. (2022). Applicability of K-medoids and K-means algorithms for segmenting students based on their scholastic performance. Journal of Statistics and Management Systems, 25(7), 1621-1632.

Li, T. & Ma, J. W. (2023). Dirichlet process mixture of Gaussian process functional regressions and its variational EM algorithm. Pattern Recognition, 134, 109129.

Huang, Y. G., Zhang, S. S. & Liu, H. J. (2022). Urban road traffic state identification based on Gaussian mixture model clustering algorithm. Modern Electronics Technique, 45(07), 168-173.

Liu, Y. & Nandram, B. (2022). Sampling methods for the concentration parameter and discrete baseline of the Dirichlet Process. Statistics in Transition New Series, 23(4), 21-36.

Saraiva, E. F., Suzuki, A. K. & Milan, L. A. (2017). Identifying differentially expressed genes using the Polya urn scheme. Communications for Statistical Applications and Methods, 24(6), 627-640.

Rogers, D. & Winkel, M. (2022). A Ray–Knight representation of up-down Chinese restaurants. Bernoulli, 28(1), 689-712.

Bhattacharya, I. & Ghosal, S. (2021). Bayesian multivariate quantile regression using Dependent Dirichlet Process prior. Journal of Multivariate Analysis, 185, 104763.

Yao, Y., Li, Z. Q., Zhao, J. H. & Wu, L. N. (2019). Adaptive chaotic MIMO radar based on DPMM clustering and Kalman filtering technique. Chaos: An Interdisciplinary Journal of Nonlinear Science, 29(11).

Li, H. & Zhang, N. (2019). A Sticky Hierarchical Dirichlet Process Clustering Method. Statictics & Information Forum, 34(08), 20-26.

Lai, Y. P., Guan, W. B., Luo L. J., Ruan, Q., Ping, Y., Song, H. P., Meng, H. Y. & Pan, Y. (2021). Extended variational inference for Dirichlet process mixture of Beta‐Liouville distributions for proportional data modeling. International Journal of Intelligent Systems, 37(7), 4277-4306.

Peng, X. & He, J. F. (2023). Flora analysis based on Dirichlet polynomial process model and k-means. Chinese Journal of Bioinformatics, 1-16.

Chen, Y. M., Liu, W. F., Kong, M. X. & Zhang, G. L. (2020). A modeling and tracking algorithm of finite mixture models for multiple extended target based on the GLMB filter and Gibbs sampler. Acta Automatica Sinica, 46(07), 1445-1456.

Duan, T., Pinto, J. P. and Xie, X. (2019). Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures. Bioinformatics, 35(6), 953-961.

Xu, X., Lin, H. J. Liu, Y. Y., & Hu, B. (2022). On-line fault detection method of hydraulic turbine combining PCA and adaptive K-Means clustering. Journal of Electronic Measurement and Instrumentation, 36(03), 260-267.

Han, Z. M., Zhang, M. M., Li, M. Q., Duan, D. G. & Chen, Y. (2019). Flow hierarchical dirichlet process for complex topic modeling, Chinese Journal of Computers, 42(07), 1539-1552.

Li, Y., Schofield, E., & Gönen, M. (2019). A tutorial on Dirichlet process mixture modeling. Journal of Mathematical Psychology, 91, 128-144.

Zhou, Z. M. & Gao, S. Y. (2014). A Survey on Hierarchical Dirichlet Process Principle and its Application. Computer Applications and Software, 31(08), 1-5+41.

Teh, Y., Kurihara, K., & Welling, M. (2007). Collapsed variational inference for HDP. Advances in Neural Information Processing Systems, 20.

Khoo, T. H., Pathmanathan, D., & Dabo-Niang, S. (2023). Spatial autocorrelation of global stock exchanges using functional areal spatial principal component analysis. Mathematics, 11(3), 674.

Lu, W. P. & Yan, X. F. (2022). Industrial process data visualization based on a deep enhanced t-distributed stochastic neighbor embedding neural network. Assembly Automation, 42(2), 268-277.

Downloads

Published

04-12-2023