Download Our Latest Research Findings

This paper explores how large language models (LLMs), specifically GPT-4o and LLaMA-3, perform in grading secondary school geography worksheets in Singapore compared to a human teacher, focusing on accuracy, reasoning, and reproducibility. The study finds that while LLMs like GPT-4o can reliably assess structured questions with clear marking schemes and offer reproducible results, they still require careful prompt design and human oversight for nuanced or context-dependent tasks.