Text Extraction#

Quadrant IntegrityLens uses a smart extraction strategy that balances speed and accuracy depending on the type of PDF.

  1. Embedded text (fast path)#

    Most PDFs created from Word processors have an embedded text layer. Extracting this text is very fast (~0.2 seconds) and produces high-quality results. This is the default path for most student submissions.

  2. Broken text layer detection#

    Some PDFs — particularly those generated by LaTeX — have a text layer that contains garbled characters. Quadrant IntegrityLens detects this automatically by checking for specific Unicode indicators (standalone diaeresis characters) that signal a broken text layer. When a broken text layer is detected, Quadrant IntegrityLens falls back to OCR automatically. No manual intervention is needed.

  3. OCR fallback#

    When OCR is needed (either automatically or via --force-ocr), Quadrant IntegrityLens uses PaddleOCR to re-extract the text. This takes longer (~25 seconds) but works reliably with all PDF types.

  4. Page markers#

    Regardless of the extraction method, Quadrant IntegrityLens tracks page boundaries so that every finding can be linked back to a specific page in the original PDF. This makes it easy for teachers to locate flagged passages.