서·논술형 평가에서 생성형 AI 활용 가능성 탐색: 리젠트 시험 데이터를 중심으로 | sam

HOME
학술논문
- 학술논문
사회과학
- 인문학
- 사회과학
- 자연과학
- 공학
- 의약학
- 농수해양
- 예술체육
- 복합학
- 경제경영
- 법학
- 어문학
교육학

학술논문

서·논술형 평가에서 생성형 AI 활용 가능성 탐색: 리젠트 시험 데이터를 중심으로

이용수 0

영문명: Exploring the Potential of Generative AI in Essay-Based Assessments: Evidence from the Regents Exam Data
발행기관: 한국교육평가학회
저자명: 안해연(Haeyeon Ahn)
간행물 정보: 『교육평가연구』제38권 제3호, 823~846쪽, 전체 24쪽
주제분류: 사회과학 > 교육학
파일형태: PDF
발행일자: 2025.09.30

5,680원

구매일시로부터 72시간 이내에 다운로드 가능합니다.
이 학술논문 정보는 (주)교보문고와 각 발행기관 사이에 저작물 이용 계약이 체결된 것으로, 교보문고를 통해 제공되고 있습니다.

1:1 문의

국문 초록

본 연구는 서·논술형 평가에서 생성형 AI의 활용 가능성을 실증적으로 분석하기 위해 미국 뉴욕주 리젠트 시험의 서·논술형 답안을 대상으로 GPT-4o, Gemini 2.0, Gemini 2.5 모델의 평가 성능을 비교·검토하였다. 가중 파카 계수(QWK), 평균 절대 오차(MAE), 상관 계수(PCC)를 분석한 결과, 모든 모델이 QWK 0.889~0.935, MAE 0.210~0.410, PCC 0.904~0.944를 기록하며 높은 정확도를 보였다. 자료 기반 논증형 문항에서는 Gemini 2.5, 텍스트 분석형 문항에서는 GPT-4o가 가장 우수했다. 혼동 행렬 분석에서도 대부분의 오차가 ±1점 이내였으나, 등급 경계 혼동과 0점 과대평가 등 일부 한계가 확인되었다. 본 연구는 정교한 평가 기준표와 등급별 예시 답안을 활용하여 등급 차이를 보정하였고, LLM의 편향을 방지하기 위한 시스템 명령 프롬프트를 적용하였다는 점에서 기존 연구와 차별성을 지닌다. 이를 통해 생성형 AI가 서·논술형 평가에서 신뢰성 있는 도구로 기능할 가능성을 확인하고, 인간-AI 협업 평가 체계를 제안하였다.

영문 초록

This study empirically analyzed the potential of generative AI in constructed-response assessment by comparing the scoring performance of GPT-4o, Gemini 2.0, and Gemini 2.5 on written responses from the New York State Regents Examinations. Analyses using the Quadratic Weighted Kappa (QWK), Mean Absolute Error (MAE), and Pearson Correlation Coefficient (PCC) showed that all models achieved high accuracy, with QWK scores ranging from 0.889 to 0.935, MAE from 0.210 to 0.410, and PCC from 0.904 to 0.944. Gemini 2.5 performed best on evidence-based argument tasks, while GPT-4o showed the highest accuracy on text-analysis items. Confusion matrix analysis revealed that most errors were within ±1 point, though some limitations were observed, including boundary-level misclassifications and overestimation of zero scores. By employing a refined scoring rubric and grade-specific anchor papers as preparatory materials, and by implementing system prompts to mitigate large language model bias, this study distinguishes itself from prior research. These findings suggest that generative AI can serve as a reliable tool for evaluating constructed responses and propose a collaborative human-AI scoring framework.

키워드

서·논술형 평가 생성형 AI LLM 리젠트 시험 automated essay scoring generative AI LLM Regents Exam

국문 초록

영문 초록

목차

키워드

해당간행물 수록 논문

참고문헌

관련논문

사회과학 > 교육학분야 BEST

사회과학 > 교육학분야 NEW

최근 이용한 논문

APA

MLA