Text Extraction#

Quadrant IntegrityLens uses a smart extraction strategy that balances speed and accuracy depending on the type of PDF.

Embedded text (fast path)#
Most PDFs created from Word processors have an embedded text layer. Extracting this text is very fast (~0.2 seconds) and produces high-quality results. This is the default path for most student submissions.
Broken text layer detection#
Some PDFs — particularly those generated by LaTeX — have a text layer that contains garbled characters. Quadrant IntegrityLens detects this automatically by checking for specific Unicode indicators (standalone diaeresis characters) that signal a broken text layer. When a broken text layer is detected, Quadrant IntegrityLens falls back to OCR automatically. No manual intervention is needed.
OCR fallback#
When OCR is needed (either automatically or via --force-ocr), Quadrant IntegrityLens uses PaddleOCR to re-extract the text. This takes longer (~25 seconds) but works reliably with all PDF types.
Page markers#
Regardless of the extraction method, Quadrant IntegrityLens tracks page boundaries so that every finding can be linked back to a specific page in the original PDF. This makes it easy for teachers to locate flagged passages.

Embedded text (fast path)#