Science for Education Today, 2025, vol. 15, no. 6, pp. 151–174
UDC: 
004.8+51-77+37.031

Research on the potential of generative artificial intelligence for providing expert-level evaluative feedback in open-ended mathematical problems assessment

Lukoyanova M. A. 1 (Kazan, Russian Federation), Danilov A. V. 1 (Kazan, Russian Federation), Zaripova R. R. 1 (Kazan, Russian Federation), Salekhova L. L. 1 (Kazan, Russian Federation), Batrova N. I. 1 (Kazan, Russian Federation)
1 Kazan Federal University
Abstract: 

Introduction. Modern education faces a contradiction between the active integration of generative artificial intelligence and its underexplored potential for providing evaluative feedback in development students’ mathematical literacy. The purpose of the article is to identify the potential of using a generative language model as a teacher’s tool for generating expert-level evaluative feedback when assessing open-ended mathematical problems
Materials and Methods. The research is based on systemic-activity, criteria-oriented, and comparative approaches. Methods employed included theoretical analysis of scholarly literature, criteria-based assessment combined with prompt engineering techniques, as well as quantitative and qualitative analysis to determine the agreement between the evaluative feedback generated by the language model and that provided by a human expert. The sample consisted of 51 students
Results. The research experimentally confirmed the feasibility of using generative artificial intelligence for providing evaluative feedback in mathematics education. An effective strategy for automating the assessment of open-ended mathematical problems was developed and substantiated, based on criteria-based assessment and prompt engineering techniques using GigaChat Pro language model. Empirical data revealed a moderate agreement between the evaluative feedback generated by GigaChat Pro and that provided by an expert teacher: accuracy reached 73%, Cohen’s coefficient (k) was 0,57, and the semantic similarity of textual comments (BertScore F1) was 0,614.
Conclusions. The research concludes that generative language model holds significant potential for transforming assessment practice of open-ended mathematical problems. Key applications include automating and personalizing expert-level evaluative feedback, and scaling criteria-based assessment. Feedback quality is enhanced by optimizing assessment prompts, implementing multi-agent verification, and introducing selective assessment.

Keywords: 

Evaluative feedback; Generative language model; Criteria-based assessment; Prompt engineering techniques; Open-ended problems; Mathematical literacy

For citation:
Lukoyanova M. A., Danilov A. V., Zaripova R. R., Salekhova L. L., Batrova N. I. Research on the potential of generative artificial intelligence for providing expert-level evaluative feedback in open-ended mathematical problems assessment. Science for Education Today, 2025, vol. 15, no. 6, pp. 151–174. DOI: http://dx.doi.org/10.15293/2658-6762.2506.07
References: 
  1. Crompton H., Burke D. Artificial intelligence in higher education: The state of the field. International Journal of Educational Technology in Higher Education, 2023, vol. 20, pp. 1-22. DOI: https://doi.org/10.1186/s41239-023-00392-8
  2. Pospelova E. A., Ototsky P. L., Gorlacheva E. N., Faizullin R. V. Generative artificial intelligence in education: Analysis trends and prospects. Vocational Education and Labour Market, 2024, vol. 12 (3), pp. 6-21. (In Russian) URL: https://www.elibrary.ru/item.asp?id=69176655 DOI: https://doi.org/10.52944/PORT.2024.58.3.001
  3. Chekalina T. A. AI-didactics: A new trend or evolution of the learning process? Vestnik of Minin University, 2025, vol. 13 (2), pp. 5. (In Russian) URL: https://elibrary.ru/item.asp?id=82539976 DOI: https://doi.org/10.26795/2307-1281-2025-13-2-5
  4. Alotaibi N. S., Alshehri A. H. Prospers and obstacles in using artificial intelligence in Saudi Arabia higher education institutions - The potential of AI-based learning outcomes. Sustainability, 2023, vol. 15 (13), pp. 10723. DOI: https://doi.org/10.3390/su151310723
  5. Awidi I. T. Comparing expert tutor evaluation of reflective essays with marking by generative artificial intelligence (AI) tool. Computers and Education: Artificial Intelligence, 2024, vol. 6, pp. 100226. DOI: https://doi.org/10.1016/j.caeai.2024.100226
  6. Kinder A., Briese F. J., Jacobs M., Dern N., Glodny N., Jacobs S., Leßmann S. Effects of adaptive feedback generated by a large language model: A case study in teacher education. Computers and Education: Artificial Intelligence, 2025, vol. 8, pp. 100349. DOI: https://doi.org/10.1016/j.caeai.2024.100349
  7. Bearman M., Tai J., Dawson P., Boud D., Ajjawi R. Developing evaluative judgement for a time of generative artificial intelligence. Assessment & Evaluation in Higher Education, 2024, vol. 49 (6), pp. 893-905. DOI: https://doi.org/10.1080/02602938.2024.2335321
  8. Chiang C.-H., Lee H.-y. Can large language models be an alternative to human evaluations? Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023, vol. 1, pp. 15607-15631. DOI: https://doi.org/10.18653/v1/2023.acl-long.870
  9. Meyer J., Jansen T., Schiller R., Liebenow W., Steinbach M., Horbach A., Fleckenstein J. Using LLMs to bring evidence-based feedback into the classroom: AI-generated feedback increases secondary students’ text revision, motivation, and positive emotions. Computers and Education: Artificial Intelligence, 2024, vol. 6, pp. 100199. DOI: https://doi.org/10.1016/j.caeai.2023.100199

10. Pak L. E., Kryukova A. A. Capabilities of artificial intelligence programs in teaching a foreign language. The Territory of New Opportunities: The Herald of Vladivostok State University, 2024, vol. 16 (2), pp. 81-95. (In Russian) URL: https://elibrary.ru/item.asp?id=67900721 

11. Hahn M. G., Navarro S. M. B., De La Valentín L., Burgos D. A systematic review of the effects of automatic scoring and automatic feedback in educational settings. Institute of Electrical and Electronics Engineers Access, 2021, vol. 9, pp. 108190-108198. DOI: https://doi.org/10.1109/ACCESS.2021.3100890

12. Bogolepova S. V., Zharkova M. G. Researching the potential of generative language models for essay evaluation and feedback provision. Domestic and Foreign Pedagogy, 2024, vol. 1 (5), pp. 123-137. (In Russian) URL: https://elibrary.ru/item.asp?id=73431773

13. Zeevy-Solovey O. Comparing peer, ChatGPT and teacher corrective feedback in EFL writing: Students' perceptions and preferences. Technology in Language Teaching & Learning, 2024, vol. 6 (3), pp. 1482. DOI: https://doi.org/10.29140/tltl.v6n3.1482

14. Kincl T., Gunina D., Novák M., Pospíšil J. Comparing human and ai-based essay evaluation in the Czech higher education: Challenges and limitations. Trendy v Podnikání - Business Trends, 2024, vol. 14 (2), pp. 25-34. DOI: https://doi.org/10.24132/jbt.2024.14.2.25_34

15. Núñez-Peña M. I., Bono R., Suárez-Pellicioni M. Feedback on students’ performance: A possible way of reducing the negative effect of math anxiety in higher education. International Journal of Educational Research, 2015, vol. 70, pp. 80-87. DOI: https://doi.org/10.1016/j.ijer.2015.02.005

16. Fyfe E. R., Brown S. A. Feedback influences children’s reasoning about math equivalence: A meta-analytic review. Thinking & Reasoning, 2017, vol. 24 (2), pp. 157-178. DOI: https://doi.org/10.1080/13546783.2017.1359208

17. Kouzminov Y., Kruchinskaia E. The Evaluation of GenAI Capabilities to Implement Professional Tasks. Foresight and STI Governance, 2024, vol. 18 (4), pp. 67-76. DOI: https://doi.org/10.17323/2500-2597.2024.4.67.76

18. Schorcht S., Buchholtz N., Baumanns L. Prompt the problem – investigating the mathematics educational quality of AI-supported problem solving by comparing prompt techniques. Frontier Education, 2024, vol. 9, pp. 1-15. DOI: https://doi.org/10.3389/feduc.2024.1386075

19. Qian Y. Prompt engineering in education: A systematic review of approaches and educational applications. Journal of Educational Computing Research, 2025, vol. 63 (7–8), pp. 1782-1818. DOI: https://doi.org/10.1177/07356331251365189

20. Lee G. G., Latif E., Wu X., Liu N., Zhai X. Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence, 2024, vol. 6, pp. 100213. DOI: https://doi.org/10.1016/j.caeai.2024.100213

21. Albakkosh I. Using Fleiss’ kappa coefficient to measure the intra and inter- rater reliability of three AI software programs in the assessment of EFL learners’ story writing. International Journal of Educational Sciences and Arts, 2024, vol. 3 (1), pp. 69-96. DOI: https://doi.org/10.59992/IJESA.2024.v3n1p4

Date of the publication 31.12.2025