Papers
arxiv:2602.00095

EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

Published on Apr 30
· Submitted by
Weiyu Sun
on May 8
Authors:
,
,
,
,

Abstract

EDU-CIRCUIT-HW dataset reveals significant limitations in MLLMs' ability to accurately interpret complex STEM handwritten solutions, prompting a hybrid approach combining automated recognition with minimal human oversight for improved educational grading systems.

AI-generated summary

Multimodal Large Language Models (MLLMs) hold significant promise for revolutionizing traditional education and reducing teachers' workload. However, accurately interpreting unconstrained STEM student handwritten solutions with intertwined mathematical formulas, diagrams, and textual reasoning poses a significant challenge due to the lack of authentic and domain-specific benchmarks. Additionally, current evaluation paradigms predominantly rely on the outcomes of downstream tasks (e.g., auto-grading), which often probe only a subset of the recognized content, thereby failing to capture the MLLMs' understanding of complex handwritten logic as a whole. To bridge this gap, we release EDU-CIRCUIT-HW, a dataset consisting of 1,300+ authentic student handwritten solutions from a university-level STEM course. Utilizing the expert-verified verbatim transcriptions and grading reports of student solutions, we simultaneously evaluate various MLLMs' upstream recognition fidelity and downstream auto-grading performance. Our evaluation uncovers an astonishing scale of latent failures within MLLM-recognized student handwritten content, highlighting the models' insufficient reliability for auto-grading and other understanding-oriented applications in high-stakes educational settings. As a potential solution, we present a case study demonstrating that leveraging identified error patterns to preemptively detect and correct recognition errors, while requiring only minimal human intervention (e.g., routing 3.3% of assignments to human graders and the remainder to the GPT-5.1 grader), can effectively enhance the robustness of the deployed AI-enabled grading system. Code and dataset are available in this GitHub repo: https://gt-learning-innovation.github.io/CIRCUIT_EDU_HW_ACL.

Community

Paper author Paper submitter

Hello all! Thanks for taking a look at our work - "EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions"

This paper investigates how MLLMs handle authentic university-level STEM handwritten solutions, with a particular focus on recognition errors and their downstream impact on auto-grading. We also release the EDU-CIRCUIT-HW dataset to support future research on handwritten STEM understanding, recognition robustness, and educational AI evaluation.

Dataset:
https://repository.gatech.edu/handle/1853/81252

Please note that the current public release does not include the original problem statements, since they come from the textbook below and cannot be redistributed due to copyright restrictions. This may lead to some performance degradation for methods that rely on problem-statement context. If you are using this project purely for research purposes, feel free to contact us regarding pre-processed problem statements.

[1] James A. Svoboda and Richard C. Dorf. 2013. Introduction to Electric Circuits (9th Edition). John Wiley & Sons.

Project page:
https://gt-learning-innovation.github.io/CIRCUIT_EDU_HW_ACL/

GitHub:
https://github.com/gt-learning-innovation/CIRCUIT_EDU_HW_ACL

If you have any suggestions or encounter any issues, feel free to reach out. We really appreciate your interest and feedback ^ ^.

the four-way taxonomy of recognition errors and the diagnostic LLM detector that ties those errors to downstream grading is the standout part for me. it's refreshing to see upstream recognition and autograding evaluated together, and the claim that fewer than 5% of cases need human regrading to get near-expert performance feels realistically deployable. i'd worry a bit about distribution shift—handwriting styles, new circuit diagrams, or course topics—whether those four categories still cover the space or if the detector would need to adapt. arxivlens does a nice job unpacking this kind of method, for example the breakdown here explains how the error signals cascade into rubrics, and you can check it at https://arxivlens.com/PaperView/Details/edu-circuit-hw-evaluating-multimodal-large-language-models-on-real-world-university-level-stem-student-handwritten-solutions-6427-5d618776. if you go a bit further, an uncertainty-aware routing or an active-learning loop that targets the high-impact error types could push the robustness even more with only modest human effort.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2602.00095
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.00095 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.00095 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.00095 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.