The State of AI Detection Tools in 2026
In the spring of 2023, a student at Texas A&M — Commerce received a zero on a final assignment after a professor ran her essay through what he believed was an AI detector and concluded, incorrectly, that the entire paper had been generated. The tool he used was ChatGPT itself, which is not a detector at all and will cheerfully tell you it wrote almost anything you paste into it. The story made the rounds that May, and the university eventually intervened. But the pattern the incident exposed, faculty reaching for a tool they didn’t understand and students facing consequences on the basis of its output, has not gone away. It has metastasized.
Three years later, the market for AI-detection software is larger, more professional, and still built on a foundation that the best researchers in the field describe, privately and sometimes publicly, as unfixable. Turnitin rolled out its AI detector in April 2023 with a claim of 98 percent accuracy. Within six months, the company quietly walked that number back, added a false-positive disclaimer, and acknowledged that the tool struggled on short documents, on non-native-English writing, and on heavily edited text. GPTZero, founded by then-Princeton senior Edward Tian in January 2023, has positioned itself more cautiously but publishes accuracy numbers that depend heavily on the type of text being evaluated. Copyleaks, Originality.ai, and a handful of smaller competitors have all made similar claims and all faced similar scrutiny.
The problem is structural, not incremental. Detection works by looking for statistical fingerprints: unusually low perplexity, unusually consistent sentence structure, token distributions that look like they came from a language model rather than a human. These fingerprints exist for some outputs of some models at some points in time. They do not survive paraphrasing. They do not survive a student rewriting every third sentence. They do not survive a second pass through a different model. And, most importantly, they are not present in all human writing, which is the piece that has been breaking students’ careers.
The false-positive numbers are worse than the marketing suggests. A 2023 study by Weixin Liang and colleagues at Stanford tested seven detectors on TOEFL essays by non-native English speakers and found that the detectors flagged more than half of those essays as AI-generated. The human essays were written years before ChatGPT existed. The bias was structural: non-native writers tend to produce more uniform sentence structures and a narrower vocabulary, which looks, to a detector, like the statistical signature of a model. The paper ran through peer review and appeared in Patterns in July 2023. Its findings have been replicated. No detector vendor has offered a convincing response except to add caveats to their marketing.
There are other categories of writing that trigger false positives at alarming rates. Technical writing with controlled vocabulary, like a methods section in a lab report. Writing by students with autism, whose prose can have a characteristic regularity that looks statistically like a model. Writing by students who were trained early and well in a specific five-paragraph essay format. Writing that has been through a grammar-checker that smoothed the rough edges. All of these produce false positives at rates that would, in almost any other consequential context (a medical test, a security screen) be considered unacceptable.
Meanwhile, the false-negative problem is getting worse, not better. Every generation of language models produces less detectable output. The 2024 wave of models was meaningfully harder to detect than the 2023 wave, and the 2025 wave made the older detectors close to useless on fresh output. Students who actually want to cheat have, for at least two years, been able to do so with near impunity by using a current-generation model and running the output through a paraphraser, or simply by asking the model to write in a more human-sounding register. The tools that are catching them were catching only the ones who didn’t know any better, and even that fraction is shrinking.
This leaves institutions with an ugly equation. A detector that produces false positives at, charitably, ten percent and misses most current-generation output is a tool that reliably punishes the wrong students and lets the right ones through. This is not a calibration problem. It’s the math. When the underlying distributions overlap as much as human and AI-generated text now do, no classifier can separate them cleanly, and any threshold you set trades false positives for false negatives at a rate that makes the tool’s use in academic integrity proceedings, at best, negligent.
A growing number of universities have figured this out. Vanderbilt disabled Turnitin’s AI detector in August 2023, citing the false-positive rate. The University of Pittsburgh, Michigan State, and several UC campuses followed. By the middle of 2024, the trend line was clear: large research universities were turning detection off, while smaller institutions and K-12 districts, which had less internal research capacity and more political pressure, kept the tools running. The Chronicle of Higher Education reported in early 2025 that a majority of flagship state universities had moved away from automated detection as a primary integrity tool. Most didn’t publicize the change.
What’s replacing it is more interesting, and more work. Instructors at the campuses that have moved on have started redesigning assignments around the premise that a student could, in principle, use a language model for any take-home task. Oral defenses of written work. In-class writing under supervision. Drafts that have to be submitted alongside the final paper, with revision history intact. Assignments that require engagement with specific class discussions the model wouldn’t have access to. Process-based grading, where the final artifact is one input among several. This is slower than running a detector. It’s also the only approach that survives contact with the current generation of models.
The pedagogical case for this shift is not new. Long before AI detection was a product category, learning-sciences researchers were arguing that high-stakes single artifacts are poor measures of student learning, and that scaffolded, process-based assessment produces better outcomes almost regardless of the topic, a point we pushed on in the piece about how AI writing assistants fit into academic work. The AI wave has, perversely, forced the profession toward assessment design it should arguably have adopted a decade earlier.
For students caught in the false-positive trap, the practical situation remains bleak. An accusation based on a detector’s output is difficult to rebut because the detector’s output is not reproducible. The same essay run twice on the same day can produce different scores. Students who keep drafts, version histories, and timestamps fare better in appeal proceedings than students who don’t. The advice that’s emerged from student-defense organizations and writing centers is consistent: write in a way that leaves a trail. Google Docs history. Git commits for technical work. Handwritten outlines photographed with a timestamp. None of this should be necessary. All of it is, in 2026, common advice.
Where does this leave the detection industry. In a holding pattern, marketing to administrators who want a simple solution and publishing accuracy numbers that don’t survive independent testing. The tools will improve. The models they’re trying to detect will improve faster. The arithmetic doesn’t change. What does change is whether institutions keep paying for a capability that can’t do what it claims, and the answer, slowly, is no — a shift that connects to the broader conversation about integrating AI into study without outsourcing the thinking, which is the harder question detection was always trying to avoid.
Photo via Unsplash.