A NOVEL HYBRID EVALUATION FRAMEWORK FOR COMPARING AI-BASED ESSAY GRADING WITH HUMAN ASSESSMENTS USING THE DREsS DATASET
Keywords:
AI assessment, essay scoring, DREsS dataset, automated grading, human-AI comparison, educational measurementAbstract
This research hypothesizes and confirms the Hybrid Human-AI Evaluation Framework (HHAEF) to evaluate the validity and reliability of AI-based essay grading as compared to human teachers. The framework uses the open-access DREsS dataset that comprises English as a Foreign Language (EFL) essay that have been scored by expert human raters in terms of content, organization, and language dimensions. Empirical evidence shows a high level of congruency of AI and human ratings with Pearson correlation coefficients of r = 0.89 across rubric dimensions and Intraclass Correlation Coefficient (ICC) values of 0.86. The proposed composite measure of the Weighted Hybrid Accuracy (WHA) produced an overall score of 0.85, which supports the high alignment between AI and human raters. An analysis of qualitative errors indicated that AI systems are more accurate in language scoring but are also less sensitive to creativity and idea development. The proposed methodology exemplifies a scalable and transparent methodology to assess AI graders and creates a benchmark towards implementing automated scoring into secondary education assessment systems.













