Unicode Character Scanners#

These scanners detect special Unicode characters that AI models (ChatGPT, Copilot, etc.) frequently insert into text but that students almost never type manually. They run regardless of the selected language.

Scanner ID	Character	Example	Confidence
`em-dash`	U+2014 — (em dash)	“Text — more text” instead of “Text - more text”	High
`en-dash-word-join`	U+2013 – between letters	“word–joiner” instead of “word-joiner”	High
`smart-quotes`	U+201D " and U+2018 '	“quoted” instead of “quoted”	Medium / Low
`ellipsis`	U+2026 … (horizontal ellipsis)	“and so on…” instead of “and so on…”	Medium
`non-breaking-space`	U+00A0 (non-breaking space)	Invisible — looks like a normal space	Medium
`invisible-space`	U+200B, U+200A, U+2009, U+202F, U+FEFF	Completely invisible zero-width characters	High
`minus-sign`	U+2212 − (minus sign)	“5 − 3” instead of “5 - 3”	Medium

Why these matter#

When students type text in a word processor, they use the standard keyboard characters: hyphens (-), straight quotes ("), three dots (...), and regular spaces. AI models, however, are trained on typographically polished text and tend to output Unicode variants of these characters.

A single em dash is not proof of AI usage. But a document full of em dashes, smart quotes, and non-breaking spaces — combined with other findings — is a strong signal.

Special cases#

U+201C " (left double quotation mark) is not flagged because it is the standard closing quotation mark in German typography.
U+2019 ’ (right single quotation mark) is not flagged because it appears in legitimate German contractions.
Ellipsis in tables of contents is filtered out to avoid false positives from dot leaders.