Analysis of the effectiveness of clustering algorithms for multimodal samples using computer simulation of an educational experiment
Introduction. The article is devoted to the problem of primary data processing of pedagogical experiments having a multimodal character. The purpose of the study is to identify the most effective and universal clustering algorithms for pedagogical experiments.
Materials and Methods. The study used the method of modeling a pedagogical experiment. The analysis of 5 clustering algorithms is conducted. The effectiveness of clustering algorithms was evaluated based on the proportion of observations with clustering errors at various tolerance levels and the Jacquard similarity coefficient. Regression analysis was used to assess the influence of modeling parameters of a pedagogical experiment and indicators of descriptive statistics on the effectiveness of clustering algorithms.
Results. The assessment of the effectiveness of various data clustering algorithms is provided, as well as a correlation and regression analysis of factors affecting clustering efficiency indicators was carried out.
Conclusions. The most effective clustering algorithms for multimodal samples include the K-means algorithm and the agglomerative hierarchical algorithm. The results obtained in this research can be used for statistical analysis of pedagogical, psychological, sociological, biological and medical research data.
Educational experiment modeling; Data clustering algorithms; Multimodal samples; Data analysis in education.
- Abitov R. N. On the ways to increase the validity and repeatability of experimental pedagogical research. Kazan Pedagogical Journal, 2022, no. 4, pp. 79–90. (In Russian) DOI: https://10.51379/kpj.2022.154.4.009 URL: https://elibrary.ru/item.asp?id=49482910
- Ershov K. S., Romanova T. N. Analysis and classification of clustering algorithms. New Information Technologies in Automated Systems, 2016, no. 19, pp. 274–279. (In Russian) URL: https://elibrary.ru/item.asp?id=25864070
- Podvalny S. L., Plotnikov A. V., Belyanin A. M. Comparison of cluster analysis of algorithms random set of data. Bulletin of Voronezh State Technical University, 2012, vol. 8 (5), pp. 4–6. (In Russian) URL: https://elibrary.ru/item.asp?id=17743528
- Sivogolovko E. V. Methods for assessing the quality of clear clustering. Computer Tools in Education, 2011, no. 4, pp. 14–31. (In Russian) URL: https://elibrary.ru/item.asp?id=21786023
- Xiaowei Xu, Ester M., Kriegel H.-P., Sander J. A distribution-based clustering algorithm for mining in large spatial databases. Proceedings 14th International Conference on Data Engineering. DOI: https://doi.org/10.1109/icde.1998.655795
- Azzalini A., Valle A. D. The multivariate skew-normal distribution. Biometrika, 1996, vol. 83 (4), pp. 715–726. DOI: https://doi.org/10.1093/biomet/83.4.715
- Banfield J. D., Raftery A. E. Model-based Gaussian and non-Gaussian clustering. Biometrics, 1993, vol. 49 (3), pp. 803–821. DOI: https://doi.org/10.2307/2532201
- Cheng M.-Y., Hall P. Calibrating the excess mass and dip tests of modality. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 1998, vol. 60 (3), pp. 579–589. DOI: https://doi.org/10.1111/1467-9868.00141
- Rodriguez M. Z., Comin C. H., Casanova D., Bruno O. M., Amancio D. R., Costa L. da F., Rodrigues F. A. Clustering algorithms: A comparative approach. PloS One, 2019, vol. 14 (1), pp. e021023. DOI: https://doi.org/10.1371/journal.pone.0210236
- Reynolds A. P., Richards G., de la Iglesia B., Rayward-Smith V. J. Clustering rules: A comparison of partitioning and hierarchical clustering algorithms. Journal of Mathematical Modeling and Algorithms, 2006, vol. 5 (4), pp. 475–504. DOI: https://doi.org/10.1007/s10852-005-9022-1
- Kinnunen T., Sidoroff I., Tuononen M., Fränti P. Comparison of clustering methods: A case study of text-independent speaker modeling. Pattern Recognition Letters, 2011, vol. 32 (13), pp. 1604–1617. DOI: https://doi.org/10.1016/j.patrec.2011.06.023
- Ameijeiras-Alonso J., Crujeiras R. M., Rodríguez-Casal A. Mode testing, critical bandwidth and excess mass. TEST, 2018, vol. 28 (3), pp. 900–919. DOI: https://doi.org/10.1007/s11749-018-0611-5
- Fisher N. I., Marron J. S. Mode testing via the excess mass estimate. Biometrika, 2001, vol. 88 (2), pp. 499–517. DOI: https://doi.org/10.1093/biomet/88.2.499
- Fowlkes E. B., Mallows C. L. A method for comparing two hierarchical clusterings: Rejoinder. Journal of the American Statistical Association, 1983, vol. 78 (383), pp. 584. DOI: https://doi.org/10.2307/2288123
- Guha S., Rastogi R., Shim K. Cure: an efficient clustering algorithm for large databases. Information Systems, 2001, vol. 26 (1), pp. 35–58. DOI: https://doi.org/10.1016/s0306-4379(01)00008-4
- Guha S., Rastogi R., Shim K. ROCK: a robust clustering algorithm for categorical attributes. Proceedings 15th International Conference on Data Engineering, 1999. Cat. No.99CB36337. DOI: https://doi.org/10.1109/icde.1999.754967
- Hartigan J. A., Hartigan P. M. The dip test of unimodality. The Annals of Statistics, 1985, vol. 13 (1), pp. 70–84. DOI: https://doi.org/10.1214/aos/1176346577
- Jung Y. G., Kang M. S., Heo J. Clustering performance comparison using K-means and expectation maximization algorithms. Biotechnology & Biotechnological Equipment, 2014, vol. 28 (sup1), pp. S44–S48. DOI: https://doi.org/10.1080/13102818.2014.949045
- Karypis G., Eui-Hong Han, Kumar V. Chameleon: Hierarchical clustering using dynamic modeling. Computer, 1999, vol. 32 (8), pp. 68–75. DOI: https://doi.org/10.1109/2.781637
- Kruskal W. H., Wallis W. Errata: Use of Ranks in One-Criterion Variance Analysis. Journal of the American Statistical Association, 1953, vol. 48 (264), pp. 907. DOI: https://doi.org/10.2307/2281082
- Ankerst M., Breunig M. M., Kriegel H.-P., Sander J. OPTICS: Ordering points to identify the clustering structure. ACM Sigmod Record, 1999, vol. 28 (2), pp. 49–60. DOI: https://doi.org/10.1145/304181.304187
- Rand W. M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 1971, vol. 66 (336), pp. 846–850. DOI: https://doi.org/10.1080/01621459.1971.10482356
- Sculley D. Web-scale k-means clustering. Proceedings of the 19th international conference on World wide web, 2010, pp. 1177–1178. DOI: https://doi.org/10.1145/1772690.1772862
- Shi J., Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, vol. 22 (8), pp. 888–905. DOI: https://doi.org/10.1109/cvpr.1997.609407
- Silverman B. W. Using kernel density estimates to investigate multimodality. Journal of the Royal Statistical Society: Series B (Methodological), 1981, vol. 43 (1), pp. 97–99. DOI: https://doi.org/10.1111/j.2517-6161.1981.tb01155.x
- Ward J. H. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 1963, vol. 58 (301), pp. 236–244. DOI: https://doi.org/10.1080/01621459.1963.10500845
- Wilkin G. A., Huang X. K-means clustering algorithms: Implementation and comparison. Second International Multi-Symposiums on Computer and Computational Sciences (IMSCCS 2007), 2007, pp. 133–136. DOI: https://doi.org/10.1109/imsccs.2007.51
- Xu D., Tian Y. A comprehensive survey of clustering algorithms. Annals of Data Science, 2015, vol. 2 (2), pp. 165–193. DOI: https://doi.org/10.1007/s40745-015-0040-1
- Zhang T., Ramakrishnan R., Livny M. BIRCH: An efficient data clustering method for very large databases. ACM Sigmod Record, 1996, vol. 25 (2), pp. 103–114. DOI: https://doi.org/10.1145/235968.233324