Improving the Prediction of Labour Skill Classification Model in Malaysia using Tree-based Machine Learning Algorithms
DOI:
https://doi.org/10.11113/mjfas.v21n5.4435Keywords:
Classification, machine learning, skill level, SMOTE, tree-based algorithmAbstract
Understanding the skill level among the workforce is essential for analysing the quality of the labour market and supporting economic development. The objective of this study is to classify skill level of the Malaysian workforce into skilled, semi-skilled and low-skilled categories using supervised machine learning techniques. Current approaches rely on descriptive statistics which limit the capability of interaction between multiple features and the prediction of future outcomes. As the labour market scenario has shifted towards the adoption of digitalisation and automation, it is essential to adopt more effective and robust methods to identify key factors influencing skill level. This study applied the Cross-Industry Standard Process for Data Mining (CRISP-DM) process to analyse 120,518 cases from the 2023 Salaries and Wages Survey dataset. The dataset undergoes a comprehensive data preprocessing procedure of data cleaning, data transformation, data splitting and handling multiclass imbalanced data by leveraging the Synthetic Minority Oversampling Technique (SMOTE). Five tree-based algorithms were applied including Decision Tree, Random Forest, Gradient Boosted Trees, Adaptive Boosting and Extreme Gradient Boosting which is consistently recognised for their strong classification performance. Model performance was evaluated using four metrics including specificity, sensitivity, F1-score and accuracy. Random Forest achieved the best performance with an accuracy of 86.45%, sensitivity of 86.45%, specificity of 90.89% and F1-score of 86.36%. The findings indicated that Random Forest is effective in predicting the skill level category. Relevant factors contributing to the prediction were salaries and wages received, economic activity, education level, certificate obtained and year of birth. It provides valuable insights into enhancing skill development initiatives that contribute to academic research by applying machine learning techniques in labour market studies.
References
Economic Planning Unit. (2021). Twelfth Malaysia Plan 2021–2025. Putrajaya: Economic Planning Unit.
Gammarano, R. (2019). Work and employment are not synonyms. ILOSTAT. https://ilostat.ilo.org/blog/work-and-employment-are-not-synonyms/.
P. J. Boettke, ‘Economics in a world amid flux’, Behav. Public Policy, pp. 1–16, Jan. 2025, doi: 10.1017/bpp.2024.57.
Department of Statistics Malaysia. (2024). Labour force survey report 2023. Putrajaya: Department of Statistics Malaysia.
International Labour Organization. (n.d.). Productivity and skills utilisation. Skills and Lifelong Learning. https://www.ilo.org/topics-and-sectors/skills-and-lifelong-learning/productivity-and-skills-utilisationand-skills-utilisation.
Ministry of Human Resources. (2020). Malaysia standard classification of occupations 2020. Putrajaya: Ministry of Human Resources.
Bukola, O. A., & Tosin. (2023). Introduction to descriptive statistics. In Recent advances in biostatistics. IntechOpen.
Hashim, N. M., Noor, N. M., Ul-Saufie, A. Z., Sandu, A. V., Vizureanu, P., Deák, G., & Kheimi, M. (2022). Forecasting daytime ground-level ozone concentration in urbanized areas of Malaysia using predictive models. Sustainability, 14(13), 7936. https://doi.org/10.3390/su14137936.
Asadi, F., Homayounfar, R., Mehrali, Y., Masci, C., & Talebi, S. (2024). Detection of cardiovascular disease cases using advanced tree-based machine learning algorithms. Scientific Reports, 14(1), 22230. https://doi.org/10.1038/s41598-024-72819.
Ribeiro Junior, R. F., & Gomes, G. F. (2024). On the use of machine learning for damage assessment in composite structures: A review. Applied Composite Materials, 31(1), 1–37. https://doi.org/10.1007/s10443-023-10161-5.
Anuar, A., Mohd Hussain, N. H., & Byrd, H. (2023). Tree-based machine learning in classifying reverse migration. Mathematical Sciences and Informatics Journal, 4(1), 49–56.
Baghdadi, A., Lama, S., Singh, R., & Sutherland, G. R. (2023). Tool-tissue force segmentation and pattern recognition for evaluating neurosurgical performance. Scientific Reports, 13(1), 9591. https://doi.org/10.1038/s41598-023-36702-3.
Dials, J., et al. (2023). Skill-level classification and performance evaluation for endoscopic sleeve gastroplasty. Surgical Endoscopy, 37(6), 4754–4765. https://doi.org/10.1007/s00464-023-09955-2
Soleymani, A., Sadat Asl, A. A., Yeganejou, M., Dick, S., Tavakoli, M., & Li, X. (2021). Surgical skill evaluation from robot-assisted surgery recordings. In 2021 International Symposium on Medical Robotics (ISMR) (pp. 1–6). IEEE. https://doi.org/10.1109/ISMR48346.2021.9661527.
Chen, M., Hui Fang Szu, Hsin Yen Lin, Liu, Y., Ho Yin Chan, Wang, Y., Zhao, Y., Zhang, G., Yao, D., & Li, W. J. (2023). Phase-based quantification of sports performance metrics using a smart IoT sensor. IEEE Internet of Things Journal, 10(18), 15900–15911. https://doi.org/10.1109/JIOT.2023.3266351.
Guo, X., Brown, E., Chan, P. P. K., Rosa, & Cheung, R. T. H. (2023). Skill level classification in basketball free-throws using a single inertial sensor. Applied Sciences, 13(9), 5401. https://doi.org/10.3390/app13095401.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953.
Sayed, E. H., Alabrah, A., Rahouma, K. H., Zohaib, M., & Badry, R. M. (2024). Machine learning and deep learning for loan prediction in banking: Exploring ensemble methods and data balancing. IEEE Access, 12, 193997–194019. https://doi.org/10.1109/ACCESS.2024.3509774.
Department of Statistics Malaysia. (2024). Salaries & wages survey 2023. Putrajaya: Department of Statistics Malaysia.
Dou, J., Song, Y., Wei, G., & Zhang, Y. (2022). Fuzzy information decomposition incorporated and weighted Relief-F feature selection: When imbalanced data meet incompletion. Information Sciences, 584, 417–432. https://doi.org/10.1016/j.ins.2021.10.057.
Ul-Saufie, A. Z., Hamzan, N. H., Zahari, Z., Shaziayani, W. N., Noor, N. M., Zainol, M. R. R. M. A., Sandu, A. V., Deak, G., & Vizureanu, P. (2022). Improving air pollution prediction modelling using wrapper feature selection. Sustainability, 14(18), 11403. https://doi.org/10.3390/su141811403.
Plante, J.-F., & Radatz, M. (2024). On the capability of classification trees and random forests to estimate probabilities. Journal of Statistical Theory and Practice, 18(2), 25. https://doi.org/10.1007/s42519-024-00376-5.
Fellini, I., & Megna, F. (2024). Labour market participation of second-generation youth in Italy. Rivista Italiana di Economia, Demografia e Statistica, 78(3), 147–158. https://doi.org/10.71014/sieds.v78i3.289.
Ernas, S. (2024). Over-education rates and predictors of entry-level jobs in Türkiye. International Journal of Assessment Tools in Education, 11(4), 758–773. https://doi.org/10.21449/ijate.1495346.
Bischof, S. (2024). Test-based measurement of skill mismatch: A validation of five different measurement approaches using the NEPS. Journal for Labour Market Research, 58(1), 11. https://doi.org/10.1186/s12651-024-00370-1.
Van Oosten, A. J., Van Mens, K., Blonk, R. W. B., Burdorf, A., & Tiemens, B. (2023). The relationship between having a job and the outcome of brief therapy in patients with common mental disorders. BMC Psychiatry, 23(1), 910. https://doi.org/10.1186/s12888-023-05418-z.
Kiss, Z. (2024). Vertical and horizontal (mis)match of university degrees in the Hungarian labour market. Humanities and Social Sciences Communications, 11(1), 1699. https://doi.org/10.1057/s41599-024-04203-x.
Gultekin, D., Hisarciklilar, M., & Yusufi, F. (2024). Multiple faces of labour market segmentation within the Turkish construction industry. Economic and Labour Relations Review, 1–22. https://doi.org/10.1017/elr.2024.35.
Omia, E., Bae, H., Park, E., Kim, M. S., Baek, I., Kabenge, I., & Cho, B.-K. (2023). Remote sensing in field crop monitoring: A comprehensive review of sensor systems, data analyses and recent advances. Remote Sensing, 15(2), 354. https://doi.org/10.3390/rs15020354.
Celbiş, M. G., Wong, P., Kourtit, K., & Nijkamp, P. (2023). Impacts of the COVID-19 outbreak on older-age cohorts in European labor markets: A machine learning exploration of vulnerable groups. Regional Science Policy & Practice, 15(3), 559–585. https://doi.org/10.1111/rsp3.12520.
Mansor, A., & Othman, Z. (2024). Malaysian community college graduates employability prediction model using machine learning approach. Journal of Telecommunication, Electronic and Computer Engineering (JTEC), 16(3), 1–7. https://doi.org/10.54554/jtec.2024.16.03.001.
Ong, S. Y., Ting, C. Y., Goh, H. N., Quek, A., & Cham, C. L. (2023). Workplace preference analytics among graduates. Journal of Informatics and Web Engineering, 2(2), 233–248. https://doi.org/10.33093/jiwe.2023.2.2.17.
Lee, J.-Y., Lee, W., & Cho, S. (2023). Characteristics of fatal occupational injuries in migrant workers in South Korea: A machine learning study. Heliyon, 9(9), e20138. https://doi.org/10.1016/j.heliyon.2023.e20138.
Cho, W. H., Shin, J., Kim, Y. D., & Jung, G. J. (2022). Pixel-wise classification in graphene detection with tree-based machine learning algorithms. Machine Learning: Science and Technology, 3(4), 045029. https://doi.org/10.1088/2632-2153/aca744.
Lin, W.-C., Huang, C.-H., Chien, L.-T., Tseng, H.-J., Ng, C.-J., Hsu, K.-H., Lin, C.-C., & Chien, C.-Y. (2022). Tree-based algorithms and association rule mining for predicting patients’ neurological outcomes after first-aid treatment for an out-of-hospital cardiac arrest during the COVID-19 pandemic: Application of data mining. International Journal of General Medicine, 15, 7395–7405. https://doi.org/10.2147/IJGM.S384959.
Mansur, R., & Subroto, A. (2022). Using tree-based algorithm to predict informal workers’ willingness to pay national health insurance after tele-collection. 2022 10th International Conference on Information and Communication Technology (ICoICT), 23–28. https://doi.org/10.1109/ICoICT55009.2022.9914901.
Shaziayani, W. N., Ul-Saufie, A. Z., Mutalib, S., Mohamad Noor, N., & Zainordin, N. S. (2022). Classification prediction of PM10 concentration using a tree-based machine learning approach. Atmosphere, 13(4), 538. https://doi.org/10.3390/atmos13040538.
Soangra, R., Sivakumar, R., Anirudh, E. R., Reddy, S. V., & John, E. B. (2022). Evaluation of surgical skill using machine learning with optimal wearable sensor locations. PLOS ONE, 17(6), e0267936. https://doi.org/10.1371/journal.pone.0267936.
Malek, N. H. A., Wan Yaacob, W. F., Md Nasir, S. A., & Shaadan, N. (2022). Prediction of water quality classification of the Kelantan River Basin, Malaysia, using machine learning techniques. Water, 14(7), 1067. https://doi.org/10.3390/w14071067.
Toharudin, T., et al. (2023). Boosting algorithm to handle unbalanced classification of PM2.5 concentration levels by observing meteorological parameters in Jakarta, Indonesia using AdaBoost, XGBoost, CatBoost, and LightGBM. IEEE Access, 11, 35680–35696. https://doi.org/10.1109/ACCESS.2023.3265019.
S., Hu, Y., Zhang, L., Liu, S., Xie, R., & Yin, Z. (2024). Intelligent risk identification for drilling lost circulation incidents using data-driven machine learning. Reliability Engineering & System Safety, 252, 110407. https://doi.org/10.1016/j.ress.2024.110407.
International Labour Office. (1979). An integrated system of wages statistics. Geneva: International Labour Office.
Salah-Ud-Din, M., B. T. L. S. S., & Al Ali, H. (2024, April). Exploratory data analysis and prediction of passenger satisfaction with airline services. In 2024 New Trends in Civil Aviation (NTCA) (pp. 295–302). IEEE. https://doi.org/10.23919/NTCA60572.2024.10517814.
Shin, H., & Lee, S. (2021). An OMOP-CDM-based pharmacovigilance data-processing pipeline (PDP) providing active surveillance for ADR signal detection from real-world data sources. BMC Medical Informatics and Decision Making, 21(1), 159. https://doi.org/10.1186/s12911-021-01520-y.
Elmannai, H., Alqahtani, A., Mahfoud, A., Khan, R. A., Alotaibi, S. S., & Alghamdi, A. (2023). Polycystic ovary syndrome detection machine learning model based on optimized feature selection and explainable artificial intelligence. Diagnostics, 13(8), 1506. https://doi.org/10.3390/diagnostics13081506.
Kim, T.-Y., Lee, S., Park, H., Kim, Y., Lee, S., & Lim, J. (2024). Occupation classification model based on DistilKoBERT: Using the 5th and 6th Korean Working Condition Surveys. Annals of Occupational and Environmental Medicine, 36(1), e19. https://doi.org/10.35371/aoem.2024.36.e19
Baptiste, P. J., Wong, A. Y. S., Schultze, A., Clase, C. M., Clémence Leyrat, Williamson, E., Powell, E., Mann, J. F. E., Cunnington, M., Teo, K., Bangdiwala, S. I., Gao, P., Wing, K., & Tomlinson, L. (2024). Effectiveness and risk of ARB and ACEi among different ethnic groups in England: A reference trial (ONTARGET) emulation analysis using UK Clinical Practice Research Datalink Aurum-linked data. PLOS Medicine, 21(9), e1004465. https://doi.org/10.1371/journal.pmed.1004465.
Zolbanin, H., & Aubert, B. (2025). A process model for design-oriented machine learning research in information systems. Journal of Strategic Information Systems, 34(1), 101868. https://doi.org/10.1016/j.jsis.2024.101868.
Lartey, C., Liu, J., Asamoah, R. K., Greet, C., Zanin, M., & Skinner, W. (2024). Effective outlier detection for ensuring data quality in flotation data modelling using machine learning (ML) algorithms. Minerals, 14(9), 925. https://doi.org/10.3390/min14090925.
Wang, J., Ueda, T., Wang, P., Li, Z., & Li, Y. (2025). Building damage inspection method using UAV-based data acquisition and deep learning-based crack detection. Journal of Civil Structural Health Monitoring, 15(1), 151–171. https://doi.org/10.1007/s13349-024-00836-3.
Nasaruddin, N., Masseran, N., Idris, W. M. R., & Ul-Saufie, A. Z. (2024). Reduced noise SMOTE in machine learning model: Application in water quality classification with imbalanced datasets. In 2024 5th International Conference on Artificial Intelligence and Data Sciences (AiDAS) (pp. 87–92). IEEE. https://doi.org/10.1109/AiDAS63860.2024.10730391.
Shaha, T. R., Begum, M., Uddin, J., Torres, V. Y., Iturriaga, J. A., Ashraf, I., & Samad, M. A. (2024). Feature group partitioning: An approach for depression severity prediction with class balancing using machine learning algorithms. BMC Medical Research Methodology, 24(1), 123. https://doi.org/10.1186/s12874-024-02249-8
Unlu, A., & Subasi, A. (2025). Substance use prediction using artificial intelligence techniques. Journal of Computational Social Science, 8(1), 21. https://doi.org/10.1007/s42001-024-00356-6.
Uddin, S., & Lu, H. (2024). Confirming the statistically significant superiority of tree-based machine learning algorithms over their counterparts for tabular data. PLOS ONE, 19(4), e0301541. https://doi.org/10.1371/journal.pone.0301541.
Mustakim, N. A., Ul-Saufie, A. Z., Shaziayani, W. N., Mohamad Noor, N., & Mutalib, S. (2022). Prediction of daily air pollutants concentration and air pollutant index using machine learning approach. Pertanika Journal of Science & Technology, 31(1), 123–135. https://doi.org/10.47836/pjst.31.1.08.
Alharbi, A. A. (2024). Classification performance analysis of decision tree-based algorithms with noisy class variable. Discrete Dynamics in Nature and Society, 2024, 6671395. https://doi.org/10.1155/2024/6671395.
Rezaei, A., Yazdinejad, M., & Sookhak, M. (2024). Credit card fraud detection using tree-based algorithms for highly imbalanced data. In 2024 IEEE 3rd International Conference on Computing and Machine Intelligence (ICMI) (pp. 1–6). IEEE. https://doi.org/10.1109/ICMI60790.2024.10586088.
Imada, J., Arango-Sabogal, J. C., Bauman, C., Roche, S., & Kelton, D. (2024). Comparison of machine learning tree-based algorithms to predict future paratuberculosis ELISA results using repeat milk tests. Animals, 14(7), 1113. https://doi.org/10.3390/ani14071113.
Ibrahim, S., Balzter, H., & Tansey, K. (2024). Machine learning feature importance selection for predicting aboveground biomass in African savannah with Landsat 8 and ALOS PALSAR data. Machine Learning with Applications, 16, 100561. https://doi.org/10.1016/j.mlwa.2024.100561.
Islam, M. K., Reza, I., Gazder, U., Akter, R., Arifuzzaman, M., & Rahman, M. M. (2022). Predicting road crash severity using classifier models and crash hotspots. Applied Sciences, 12(22), 11354. https://doi.org/10.3390/app122211354.
Mienye, I. D., & Jere, N. (2024). A survey of decision trees: Concepts, algorithms, and applications. IEEE Access, 12, 86716–86727. https://doi.org/10.1109/ACCESS.2024.3416838.
Champahom, T., Se, C., Watcharamaisakul, F., Jomnonkwao, S., Karoonsoontawong, A., & Ratanavaraha, V. (2024). Tree-based approaches to understanding factors influencing crash severity across roadway classes: A Thailand case study. IATSS Research, 48(3), 464–476. https://doi.org/10.1016/j.iatssr.2024.09.001.
Putra, M., Rosid, M. S., & Handoko, D. (2022). Rainfall estimation using machine learning approaches with raingauge, radar, and satellite data. In 2022 International Conference on Electrical Engineering and Informatics (ICELTICs) (pp. 25–30). IEEE. https://doi.org/10.1109/ICELTICs56128.2022.9932109.
Asteris, P. G., Rizal, F. I. M., Koopialipoor, M., Roussis, P. C., Ferentinou, M., Armaghani, D. J., & Gordan, B. (2022). Slope stability classification under seismic conditions using several tree-based intelligent techniques. Applied Sciences, 12(3), 1753. https://doi.org/10.3390/app12031753
Wang, Z., He, C., Hu, Y., Luo, H., Li, C., Wu, X., Zhang, Y., Li, J., & Cai, J. (2024). A hybrid deep learning scheme for MRI-based preliminary multiclassification diagnosis of primary brain tumors. Frontiers in Oncology, 14, 1363756. https://doi.org/10.3389/fonc.2024.1363756.
Ibrahim, N., Ishak, U. M., Ali, N. N. A., Shaadan, N., & others. (2024). Machine learning-based approaches for credit card debt prediction. Malaysian Journal of Computing, 9(1), 1722–1733. https://doi.org/10.24191/mjoc.v9i1.25656.
Javeed, A., Anderberg, P., Ghazi, A. N., Saleem, M. A., & Sanmartin Berglund, J. (2025). Predicting depression in older adults: A novel feature selection and neural network framework. Neural Processing Letters, 57(3), 41. https://doi.org/10.1007/s11063-025-11760-y.
Hu, P., & Zhu, J. (2025). A filter-wrapper model for high-dimensional feature selection based on evolutionary computation. Applied Intelligence, 55(7), 581. https://doi.org/10.1007/s10489-025-06474-6.
Halias, A. F., Saiful, N. H., Ibrahim, N., Muhamad Jamil, S. A., Mansor, M. M., Ul - Saufie, A. Z., & Md Ghani, N. A. (2023). Type 2 diabetes mellitus prediction using data mining approach. 2023 IEEE International Conference on Computing (ICOCO), 2824, 29–34. https://doi.org/10.1109/icoco59262.2023.10398078.
Lawal, Z. K., Aldrees, A., Yassin, H., Salisu Dan’azumi, Sujay Raghavendra Naganna, Abba, S. I., & Saad Sh. Sammen. (2024). Optimized ensemble methods for classifying imbalanced water quality index data. IEEE Access, 1–1. https://doi.org/10.1109/access.2024.3502361.
Mehmood, K., Shoaib Ahmad Anees, Luo, M., Akram, M., Zubair, M., Khan, K. A., & Khan, W. R. (2024). Assessing Chilgoza Pine (Pinus gerardiana) forest fire severity: Remote sensing analysis, correlations, and predictive modeling for enhanced management strategies. Trees, Forests and People, 16, 100521. https://doi.org/10.1016/j.tfp.2024.100521.
Afsharinia, B., & Gurtoo, A. (2024). COVID-19 impact on food consumption of low-skilled employees in India. Global Food Security, 42, 100791. https://doi.org/10.1016/j.gfs.2024.100791.
Josten, C., Krause, H., Lordan, G., & Yeung, B. (2024). What skills pay more? The changing demand and return to skills for professional workers. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4706059.
Yang, G., Yao, S., & Dong, X. (2023). Digital economy and wage gap between high- and low-skilled workers. Digital Economy and Sustainable Development, 1(1), 7. https://doi.org/10.1007/s44265-023-00009-y.
Kaboth, A., Hünefeld, L., & Lück, M. (2024). Exploring work ability, psychosocial job demands and resources of employees in low-skilled jobs: A German cross-sectional study. Journal of Occupational Medicine and Toxicology, 19(1), 30. https://doi.org/10.1186/s12995-024-00429-2.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Rabi’atul’adawiah Shablia, Prof. Dr. Ts. Ahmad Zia Ul-Saufie, Nurain Ibrahima

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.














