Documentation

Overview

This software automates the grading of written assignments by integrating course management systems with large language models (LLMs). It's interface and workflow are designed to minimize the complexities of this process and the time it takes for users to learn how to use it effectively. The performance of this software is governed by the computing principle known as “garbage in, garbage out.” No matter how seemingly intelligent or well-designed a computing system is, it's output is only as good as the input provided. This means that this software has a slight learning curve, but once you have a reasonable idea of how to create an effective grading rubric, it can do all of your grading for you!

Rubric

The workflow of this app is designed to help users build effective grading rubrics. A rubric is a set of criteria used to score written academic work. It maps evaluations of student performance on a series of tasks to the numerical value of such a performance. Although this is conceptually simple, constructing rubrics to evaluate written assignments that LLMs can use correctly and consistently at large scales is a complex task.

Prompts, Criteria, and Instructions

1. Prompts

To manage this complexity, we break rubrics down into prompts, grading criteria, and special grading instructions. Prompts can be thought of as the tasks that students must complete, the questions that they must to answer, or the standards that their work must meet. We provide a number examples here and in the workflow:

Discuss the quantitative relationship between energy and economic output
Identify the conceptual contributions of the Scottish enlightenment to modern social science
Give three examples of convergence in biological or cultural evolution
Compare the sociological assumptions of John Stuart Mill and Karl Marx
Technical writing: Does the spelling, grammar, capitalization, punctuation, and word choice of the submission conform to academic standards?

Users should strive to make these as clear, concise, and logically consistent as possible. Remember: “garbage in, garbage out.”

2. Criteria

Grading criteria are detailed descriptions of how written responses ought to be categorized and assessed. For example, using the prompt “Identify the conceptual contributions of the Scottish enlightenment to modern social science” we suggest criteria like the following:

Excellent (100) Demonstrates thorough understanding of key Scottish enlightenment concepts, with precise explanations and insightful analysis, supported by well-chosen evidence from readings. Writing is clear, focused, and free from significant errors.
Average (85) Shows a good understanding of the concepts, with generally accurate explanations and some analytical depth. Writing is mostly clear, with minor issues, and evidence is used appropriately but could be more thoroughly integrated.
Poor (70) Displays weak understanding of the concepts, with significant errors or lack of detail. Writing is unclear or disjointed, and little to no relevant evidence is used.

As another more simple example, when grading “Give three examples of convergence in biological or cultural evolution” the criteria can simply be tied to the number of examples provided:

Excellent (100) Provides exactly three examples of convergence
Good (85) Provides exactly two examples of convergence
Average (70) Provides exactly one example of convergence
Poor (60) Provides no examples of convergence

If there are any possible contradictions or ambiguity in the criteria, the performance of the LLMs will become unpredictable and inconsistent. In the example criteria given above the term exactly is included because papers that include three examples logically must also provide one and two examples. If this term is left out this ambiguity may result in ChatGPT confusing an excellent paper for an average one.

3. Special Grading Instructions

Users have the option of providing special grading instructions. This are particularly helpful if the user uses ChatGPT-4o-mini instead of the more expensive ChatGPT-4o. These should be included if, after testing and refining your prompts and criteria, the software does not produce the output you would like. This can happen, for example, if the LLM incorrectly assumes the level of background knowledge that students are required to have or if it applies the grading rubric too strictly. For example, if ChatGPT is assuming unnecessary background knowledge include “DO NOT grade students on their background knowledge or level of understanding of X, grade them on whether or not they answered the prompt questions.” Or, for example, when grading the first assignment of an introductory course for which the highest grade is “Excellent”, users may want to include the following instructions: “Grade the submission as Excellent unless there are overwhelmingly strong reasons to grade it otherwise.” These instructions should all be as clear, concise, and consistent as possible.

Settings, Gradebook, and Upload

We have included functionality that allows users to test their rubrics and iteratively improve them before uploading evaluations. There are four gradebook settings that can be used to further this goal. Users can create smaller samples of submissions to evaluate by restricting the numerical and/or alphabetical range of students. By default 100% of students with last names in the A-Z range are graded. (1) Users can restrict the range of students alphabetically by last name, e.g., A-K or N-Z. (2) Users can further restrict this subset by selecting a percentage of students in that alphabetical range to grade (they are randomly chosen). (3) With the Comments? form users can select whether or not the LLM will generate written feedback to be uploaded to Canvas. Comments are included by default. (4) Finally, users can select from either the GPT-4o-mini and GPT-4o. GPT-4o-mini is much cheaper and generally capable of simple to moderately complex grading tasks. GPT-4o is OpenAI's flagship model which can handle complex instructions and make much more complex evaluations while providing written comments that of much higher quality. What model to use is completely up to the user.

The gradebook displays a table with student ID numbers along with the grades and comments generated by the LLM. The gradebook is used to review and, if necessary, edit grades and comments before uploading them to Canvas. The text of the submission can be viewed by clicking on the book icon next to the student's ID number. Clicking on a grade or comments in a row activates the editor and pressing the Enter key completes the editing process and updates the submission. The software does not provide feedback for papers that recieve the highest grade possible, only submissions that have points deducted have comments generated. If the user wishes to provide feedback for these papers they can enter it manually.

Once student evaluations are satisfactorily completed, navigate to the upload page. This software was origianlly designed for university teaching assistants who are required to work roughly 20 hours a week grading. Users can specify how many hours (e.g., .5 or 5) the grading process should take and the system will upload evaluations accordingly. It is important not to close the browser window or navigate away from the upload page as this will stop the upload process. Multiple uploads will create multiple text comments that will have to be deleted manually.

Select Course To Start Grading

Additional questions, technical problems, or errors? Contact us

Grading Workflow

Documentation

Course

Assignment

Prompts

Criteria

Settings

Gradebook

Upload