Science for Education Today, 2024, vol. 14, no. 2, pp. 125–151
UDC: 
37.012.4+159.9.072

Analysis of the effectiveness of clustering algorithms for multimodal samples using computer simulation of an educational experiment

Abitov R. N. 1 (Kazan, Russian Federation), Safin R. S. 1 (Kazan, Russian Federation)
1 Kazan State University of Architecture and Engineering
Abstract: 

Introduction. The article is devoted to the problem of primary data processing of pedagogical experiments having a multimodal character. The purpose of the study is to identify the most effective and universal clustering algorithms for pedagogical experiments.
Materials and Methods. The study used the method of modeling a pedagogical experiment. The analysis of 5 clustering algorithms is conducted. The effectiveness of clustering algorithms was evaluated based on the proportion of observations with clustering errors at various tolerance levels and the Jacquard similarity coefficient. Regression analysis was used to assess the influence of modeling parameters of a pedagogical experiment and indicators of descriptive statistics on the effectiveness of clustering algorithms.
Results. The assessment of the effectiveness of various data clustering algorithms is provided, as well as a correlation and regression analysis of factors affecting clustering efficiency indicators was carried out.
Conclusions. The most effective clustering algorithms for multimodal samples include the K-means algorithm and the agglomerative hierarchical algorithm. The results obtained in this research can be used for statistical analysis of pedagogical, psychological, sociological, biological and medical research data.

Keywords: 

Educational experiment modeling; Data clustering algorithms; Multimodal samples; Data analysis in education.

For citation:
Abitov R. N., Safin R. S. Analysis of the effectiveness of clustering algorithms for multimodal samples using computer simulation of an educational experiment. Science for Education Today, 2024, vol. 14, no. 2, pp. 125–151. DOI: http://dx.doi.org/10.15293/2658-6762.2402.06
References: 
  1. Abitov R. N. On the ways to increase the validity and repeatability of experimental pedagogical research. Kazan Pedagogical Journal, 2022, no. 4, pp. 79–90. (In Russian) DOI: https://10.51379/kpj.2022.154.4.009  URL: https://elibrary.ru/item.asp?id=49482910  
  2. Ershov K. S., Romanova T. N. Analysis and classification of clustering algorithms. New Information Technologies in Automated Systems, 2016, no. 19, pp. 274–279. (In Russian) URL: https://elibrary.ru/item.asp?id=25864070   
  3. Podvalny S. L., Plotnikov A. V., Belyanin A. M. Comparison of cluster analysis of algorithms random set of data. Bulletin of Voronezh State Technical University, 2012, vol. 8 (5), pp. 4–6. (In Russian) URL: https://elibrary.ru/item.asp?id=17743528  
  4. Sivogolovko E. V. Methods for assessing the quality of clear clustering. Computer Tools in Education, 2011, no. 4, pp. 14–31. (In Russian) URL: https://elibrary.ru/item.asp?id=21786023  
  5. Xiaowei Xu, Ester M., Kriegel H.-P., Sander J. A distribution-based clustering algorithm for mining in large spatial databases. Proceedings 14th International Conference on Data Engineering. DOI: https://doi.org/10.1109/icde.1998.655795
  6. Azzalini A., Valle A. D. The multivariate skew-normal distribution. Biometrika, 1996, vol. 83 (4), pp. 715–726. DOI: https://doi.org/10.1093/biomet/83.4.715   
  7. Banfield J. D., Raftery A. E. Model-based Gaussian and non-Gaussian clustering. Biometrics, 1993, vol. 49 (3), pp. 803–821. DOI: https://doi.org/10.2307/2532201  
  8. Cheng M.-Y., Hall P. Calibrating the excess mass and dip tests of modality. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 1998, vol. 60 (3), pp. 579–589. DOI: https://doi.org/10.1111/1467-9868.00141
  9. Rodriguez M. Z., Comin C. H., Casanova D., Bruno O. M., Amancio D. R., Costa L. da F., Rodrigues F. A. Clustering algorithms: A comparative approach. PloS One, 2019, vol. 14 (1), pp. e021023. DOI: https://doi.org/10.1371/journal.pone.0210236
  10. Reynolds A. P., Richards G., de la Iglesia B., Rayward-Smith V. J. Clustering rules: A comparison of partitioning and hierarchical clustering algorithms. Journal of Mathematical Modeling and Algorithms, 2006, vol. 5 (4), pp. 475–504. DOI: https://doi.org/10.1007/s10852-005-9022-1
  11. Kinnunen T., Sidoroff I., Tuononen M., Fränti P. Comparison of clustering methods: A case study of text-independent speaker modeling. Pattern Recognition Letters, 2011, vol. 32 (13), pp. 1604–1617. DOI: https://doi.org/10.1016/j.patrec.2011.06.023
  12. Ameijeiras-Alonso J., Crujeiras R. M., Rodríguez-Casal A. Mode testing, critical bandwidth and excess mass. TEST, 2018, vol. 28 (3), pp. 900–919. DOI: https://doi.org/10.1007/s11749-018-0611-5
  13. Fisher N. I., Marron J. S. Mode testing via the excess mass estimate. Biometrika, 2001, vol. 88 (2), pp. 499–517. DOI: https://doi.org/10.1093/biomet/88.2.499   
  14. Fowlkes E. B., Mallows C. L. A method for comparing two hierarchical clusterings: Rejoinder. Journal of the American Statistical Association, 1983, vol. 78 (383), pp. 584. DOI: https://doi.org/10.2307/2288123   
  15. Guha S., Rastogi R., Shim K. Cure: an efficient clustering algorithm for large databases. Information Systems, 2001, vol. 26 (1), pp. 35–58. DOI: https://doi.org/10.1016/s0306-4379(01)00008-4
  16. Guha S., Rastogi R., Shim K. ROCK: a robust clustering algorithm for categorical attributes. Proceedings 15th International Conference on Data Engineering, 1999. Cat. No.99CB36337. DOI: https://doi.org/10.1109/icde.1999.754967
  17. Hartigan J. A., Hartigan P. M. The dip test of unimodality. The Annals of Statistics, 1985, vol. 13 (1), pp. 70–84. DOI: https://doi.org/10.1214/aos/1176346577
  18. Jung Y. G., Kang M. S., Heo J. Clustering performance comparison using K-means and expectation maximization algorithms. Biotechnology & Biotechnological Equipment, 2014, vol. 28 (sup1), pp. S44–S48. DOI: https://doi.org/10.1080/13102818.2014.949045
  19. Karypis G., Eui-Hong Han, Kumar V.  Chameleon: Hierarchical clustering using dynamic modeling. Computer, 1999, vol. 32 (8), pp. 68–75. DOI: https://doi.org/10.1109/2.781637
  20. Kruskal W. H., Wallis W. Errata: Use of Ranks in One-Criterion Variance Analysis. Journal of the American Statistical Association, 1953, vol. 48 (264), pp. 907. DOI: https://doi.org/10.2307/2281082
  21. Ankerst M., Breunig M. M., Kriegel H.-P., Sander J. OPTICS: Ordering points to identify the clustering structure. ACM Sigmod Record, 1999, vol. 28 (2), pp. 49–60. DOI: https://doi.org/10.1145/304181.304187
  22. Rand W. M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 1971, vol. 66 (336), pp. 846–850. DOI: https://doi.org/10.1080/01621459.1971.10482356
  23. Sculley D. Web-scale k-means clustering. Proceedings of the 19th international conference on World wide web, 2010, pp. 1177–1178. DOI: https://doi.org/10.1145/1772690.1772862
  24. Shi J., Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, vol. 22 (8), pp. 888–905. DOI: https://doi.org/10.1109/cvpr.1997.609407
  25. Silverman B. W. Using kernel density estimates to investigate multimodality. Journal of the Royal Statistical Society: Series B (Methodological), 1981, vol. 43 (1), pp. 97–99. DOI: https://doi.org/10.1111/j.2517-6161.1981.tb01155.x
  26. Ward J. H. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 1963, vol. 58 (301), pp. 236–244. DOI: https://doi.org/10.1080/01621459.1963.10500845
  27. Wilkin G. A., Huang X. K-means clustering algorithms: Implementation and comparison. Second International Multi-Symposiums on Computer and Computational Sciences (IMSCCS 2007), 2007, pp. 133–136. DOI: https://doi.org/10.1109/imsccs.2007.51
  28. Xu D., Tian Y. A comprehensive survey of clustering algorithms. Annals of Data Science, 2015, vol. 2 (2), pp. 165–193. DOI: https://doi.org/10.1007/s40745-015-0040-1
  29. Zhang T., Ramakrishnan R., Livny M. BIRCH: An efficient data clustering method for very large databases. ACM Sigmod Record, 1996, vol. 25 (2), pp. 103–114. DOI: https://doi.org/10.1145/235968.233324
Date of the publication 30.04.2024