AI Tools

Custom GPTs for Coursework: Where They Help and Where They Don’t

A university seminar room after hours with an open laptop on a desk and stacked course readers beside a coffee cup.

OpenAI’s GPT Store opened to the public in January 2024. Within six months, the directory listed something like three million custom bots. A non-trivial fraction were built by college students or for college students: flashcard generators, essay editors, organic-chemistry tutors, rubric graders, “study buddy for BIO 201” bots stood up by TAs and never taken down. The promise was seductive. A professor could distill a syllabus into a tutor. A student could paste in a lecture recording and get a drill partner. In practice, two years in, the picture is messier and more interesting than the marketing.

Start with what works. The most reliable custom GPTs in coursework are the narrow ones with a clear input-output contract. A flashcard generator that takes a page of notes and returns Anki-formatted cue-response pairs is useful, not because the pedagogy is clever but because the task is bounded. The bot doesn’t have to reason about what the student knows. It has to parse a definition and return a card. A chemistry TA I spoke with at a large state university built one for her general chemistry section; it front-loads the instruction with a standard format, rejects cards that combine more than one concept, and refuses to generate cards without source text pasted in. She estimates it’s cut the TA team’s flashcard-review questions by about a third. It works because the scope is small and the failure mode is loud: a bad card is easy to spot.

Rubric grader bots are the second category that works, with caveats. A writing instructor at a community college in Oregon built a custom GPT loaded with her course rubric and three calibration samples (one at each grade level) and uses it as a first-pass reader for draft feedback. She’s explicit with students that it’s not assigning grades; it’s telling them where a human reader will likely flag something. The output is inconsistent enough that she treats it as a starting point, not a verdict, and the students have mostly learned to do the same. The bot’s best trick, she told me, is catching the structural issues students learn to ignore in their own drafts: a thesis that doesn’t match the conclusion, a body paragraph whose topic sentence doesn’t match its evidence. It’s useful the way a spell-checker is useful. It catches the obvious and misses the subtle.

Then there are tutor bots, and this is where the honesty has to come in. “Tutor for Introductory Microeconomics” bots are everywhere in the GPT Store. Most are mediocre. The failure modes cluster. They hallucinate definitions that don’t appear in the student’s textbook. They confuse their own prior turns with authoritative sources. They tend, under pressure, to collapse into the generic Wikipedia version of a concept, which is often subtly wrong for the specific course. A student studying Mankiw will get Krugman-flavored answers. A student studying real analysis at Princeton rigor will get answers pitched to a first-year calculus class. The problem isn’t that the model is dumb; it’s that the custom GPT layer is thin. You’re pasting a system prompt on top of a general-purpose model that has already decided what “monopolistic competition” means, and your system prompt is not enough to override that prior.

There are two workarounds in circulation. The first is to load the bot with actual course materials (PDFs of the assigned readings, lecture transcripts, problem sets) so that retrieval augments the answer. This helps, materially. The bot can quote the textbook the student is using rather than a composite textbook the model half-remembers. The second is to narrow the bot’s behavior rather than its knowledge: a tutor that only asks Socratic questions, refuses to give direct answers, and operates in the mode described in our piece on using AI as a study partner rather than an answer machine. These two strategies together produce the best coursework bots I’ve seen. They’re also almost entirely built by instructors with time to tune them, not students building tools for themselves at one in the morning.

A quieter failure mode is worth naming. Custom GPTs have a way of drifting from their stated purpose over a long conversation. A tutor bot starts the session in Socratic mode and ends it drafting the student’s lab report. This isn’t the student’s fault exactly. The model is trained to be helpful in context, and after an hour of back-and-forth the context becomes the conversation, not the system prompt. Students who want the bot to keep its discipline have to restart sessions more than they want to.

Flashcard-generator bots have a more specific limitation, which anyone building one for medical or language vocabulary runs into fast. Generic generators are fine for broad definitions and terrible for anything where the card’s value depends on precise phrasing. Medical terminology is the classic example. A card that says “inflammation of the kidney” for nephritis is not wrong but it’s not the card a medical student needs, which is the one that drills the root, the suffix, and the clinical context together. Our piece on flashcard apps for medical and language study makes the fuller case, but the short version is: a generator is a scaffold, not a replacement for the student writing the cards themselves, because the act of writing the card is half the learning.

The grading bots have a darker edge. A few universities have quietly piloted GPT-based grading for large introductory courses, and the results are what you’d expect: acceptable agreement with human graders on the middle of the distribution, worse agreement at the tails, and the occasional catastrophic miss on a creative answer that the bot marks wrong because it doesn’t fit the template. At one institution I heard about, a student failed a short-answer question because their correct answer used a synonym the bot didn’t recognize; the appeal took three weeks. This is a solvable problem but not a solved one, and the solutions involve human review, which is most of the cost the schools were trying to eliminate.

The honest summary, two years in, is that custom GPTs for coursework are most valuable when they do one narrow thing, are tuned by someone who understands the course, and are used by students who already know how to study. They’re a force multiplier, not a force. The students who benefit most are the ones who’d have figured out how to study anyway; the bots just save them time. The students who most need the help, the ones whose study habits are thin, get the least out of a custom bot and often the most harm, because a bot will happily generate a thousand words of plausible-sounding review that the student reads passively, a failure pattern we sketched in the piece on managing cognitive load while studying.

Build the narrow ones. Ignore the broad ones. Don’t trust any bot that promises to replace the thing the student actually needs to do.

Photo via Unsplash.