Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

Type: Preprint
Publication Date: 2025-02-21
Citations: 0
DOI: https://doi.org/10.1145/3706468.3706530

Abstract

The role of multiple-choice questions (MCQs) as effective learning tools has been debated in past research.While MCQs are widely used due to their ease in grading, open response questions are increasingly used for instruction, given advances in large language models (LLMs) for automated grading.This study evaluates MCQs effectiveness relative to open-response questions, both individually and in combination, on learning.These activities are embedded within six tutor lessons on advocacy.Using a posttest-only randomized control design, we compare the performance of 234 tutors (790 lesson completions) across three conditions: MCQ only, open response only, and a combination of both.We find no significant learning differences across conditions at posttest, but tutors in the MCQ condition took significantly less time to complete instruction.These findings suggest that MCQs are as effective, and more efficient, than open response tasks for learning when practice time is limited.To further enhance efficiency, we autograded open responses using GPT-4o and GPT-4-turbo.GPT models demonstrate proficiency for purposes of low-stakes assessment, though further research is needed for broader use.This study contributes a dataset of lesson log data, human annotation rubrics, and LLM prompts to promote transparency and reproducibility.

Locations

  • arXiv (Cornell University)

Ask a Question About This Paper

Summary

Login to see paper summary

Similar Works

Action Title Date Authors
The Future of Learning in the Age of Generative AI: Automated Question Generation and Assessment with Large Language Models 2024-10-12 Subhankar Maity Aniket Deroy
Do Tutors Learn from Equity Training and Can Generative AI Assess It? 2025-02-21 Danielle R. Thomas Conrad Borchers Sanjit Kakarla Jionghao Lin Shambhavi Bhushan Boyuan Guo Erin Gatz Kenneth R. Koedinger
Applying large language models and chain-of-thought for automatic scoring 2024-02-27 Gyeong-Geon Lee Ehsan Latif Xuansheng Wu Ninghao Liu Xiaoming Zhaı
Applying Large Language Models and Chain-of-Thought for Automatic Scoring 2023-01-01 Gyeong-Geon Lee Ehsan Latif Xuansheng Wu Ninghao Liu Xiaoming Zhaı
Generating AI Literacy MCQs: A Multi-Agent LLM Approach 2025-02-18 Jiayi Wang Ruiwei Xiao Ying-Jui Tseng
From GPT-3 to GPT-4: On the Evolving Efficacy of LLMs to Answer Multiple-choice Questions for Programming Classes in Higher Education 2023-01-01 JaromĂ­r Ć avelka Arav Agarwal Christopher Bogart Majd Sakr
Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability To Mark Short Answer Questions in K-12 Education 2024-07-09 Owen Henkel Libby Hills Adam Boxer Bill Roberts Zachary Levonian
Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education 2024-05-05 Owen Henkel Adam Boxer L. V. Hills Bill Roberts
A Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in Programming Education 2024-01-02 Jacob Arthur Doughty Zipiao Wan Anishka Bompelli Jubahed Qayum Taozhi Wang Juran Zhang Yujia Zheng Aidan Doyle Pragnya Sridhar Arav Agarwal
Practical and ethical challenges of large language models in education: A systematic scoping review 2023-08-06 Lixiang Yan Lele Sha Linxuan Zhao Yuheng Li Roberto Martínez‐Maldonado Guanliang Chen Xinyu Li Yueqiao Jin Dragan Gǎsević
CoTAL: Human-in-the-Loop Prompt Engineering, Chain-of-Thought Reasoning, and Active Learning for Generalizable Formative Assessment Scoring 2025-04-03 Clayton Cohn Nicole Hutchins T. S. Ashwin Gautam Biswas
When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options 2024-08-27 Gracjan Góral Emilia Wiƛnios
Practical and Ethical Challenges of Large Language Models in Education: A Systematic Scoping Review 2023-01-01 Lixiang Yan Lele Sha Linxuan Zhao Yuheng Li Roberto Martínez‐Maldonado Guanliang Chen Xinyu Li Yueqiao Jin Dragan Gǎsević
A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students’ Formative Assessment Responses in Science 2024-03-24 Clayton Cohn Nicole Hutchins Tuan Anh Le Gautam Biswas
Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models 2024-08-20 Yuyan Chen Chenwei Wu Songzhou Yan Panjun Liu Haoyu Zhou Yanghua Xiao
Evaluating Students’ Open-ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large 2024-01-01 Jussi S. Jauhiainen Agustín Garagorry Guerra
A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science 2024-03-21 Clayton Cohn Nicole Hutchins Tuan Anh Le Gautam Biswas
Can LLMs Grade Short-answer Reading Comprehension Questions : Foundational Literacy Assessment in LMICs 2023-01-01 Owen Henkel L. V. Hills Bill Roberts Joshua A. McGrane
Fine-tuning ChatGPT for Automatic Scoring 2023-01-01 Ehsan Latif Xiaoming Zhaı
Simulation as Reality? The Effectiveness of LLM-Generated Data in Open-ended Question Assessment 2025-02-10 Long Zhang Meng Zhang Wei Lin Wang Yumei Luo

Cited by (0)

Action Title Date Authors