Unicode Character Scanners#
These scanners detect special Unicode characters that AI models (ChatGPT, Copilot, etc.) frequently insert into text but that students almost never type manually. They run regardless of the selected language.
| Scanner ID | Character | Example | Confidence |
|---|---|---|---|
em-dash | U+2014 — (em dash) | “Text — more text” instead of “Text - more text” | High |
en-dash-word-join | U+2013 – between letters | “word–joiner” instead of “word-joiner” | High |
smart-quotes | U+201D " and U+2018 ' | “quoted” instead of “quoted” | Medium / Low |
ellipsis | U+2026 … (horizontal ellipsis) | “and so on…” instead of “and so on…” | Medium |
non-breaking-space | U+00A0 (non-breaking space) | Invisible — looks like a normal space | Medium |
invisible-space | U+200B, U+200A, U+2009, U+202F, U+FEFF | Completely invisible zero-width characters | High |
minus-sign | U+2212 − (minus sign) | “5 − 3” instead of “5 - 3” | Medium |
Why these matter#
When students type text in a word processor, they use the standard keyboard
characters: hyphens (-), straight quotes ("), three dots (...), and
regular spaces. AI models, however, are trained on typographically polished
text and tend to output Unicode variants of these characters.
A single em dash is not proof of AI usage. But a document full of em dashes, smart quotes, and non-breaking spaces — combined with other findings — is a strong signal.
Special cases#
- U+201C " (left double quotation mark) is not flagged because it is the standard closing quotation mark in German typography.
- U+2019 ’ (right single quotation mark) is not flagged because it appears in legitimate German contractions.
- Ellipsis in tables of contents is filtered out to avoid false positives from dot leaders.