Skip to content

File Processing System

The FileProcessor module forms the foundation of Supermat's document handling capabilities, converting various document formats into our structured ParsedDocument model while preserving their hierarchical organization.

Architecture Components

Handler

The Handler orchestrates document processing through two key components:

  1. Converters: A collection of utilities that transform various file formats into a standardized format for parsing. For example:
  2. Converting .docx to .pdf
  3. Converting .pptx to .pdf
  4. Future support planned for additional formats

  5. Parser: Processes the standardized format to generate the ParsedDocument model.

This modular approach allows for:

  • Format flexibility
  • Easy integration of new document types
  • Consistent parsing behavior across different input formats

Parser

The Parser component performs the critical task of transforming documents into our structured ParsedDocument model while:

  • Maintaining complete document fidelity (lossless conversion)
  • Preserving hierarchical relationships (sections, paragraphs, sentences)
  • Converting unstructured text into a structured Pydantic model

Current Implementation

We currently support PDF parsing through two powerful backends:

  1. PyMuPDF: An open-source PDF processing library
  2. Adobe PDF Services API: Professional-grade PDF processing

Document Model

Our ParsedDocument model is designed to capture the complete structure of a document while making it processable for AI pipelines. For detailed information about the model structure and capabilities, refer to our model documentation.

Future Enhancements

We plan to expand the File Processing System with:

  • Support for additional document formats
  • Enhanced structure detection algorithms
  • Improved metadata extraction
  • Advanced formatting preservation
  • Custom parsers for specialized document types