File Processing System

The FileProcessor module forms the foundation of Supermat's document handling capabilities, converting various document formats into our structured ParsedDocument model while preserving their hierarchical organization.

Architecture Components

Handler

The Handler orchestrates document processing through two key components:

Converters: A collection of utilities that transform various file formats into a standardized format for parsing. For example:
Converting .docx to .pdf
Converting .pptx to .pdf
Future support planned for additional formats
Parser: Processes the standardized format to generate the ParsedDocument model.

This modular approach allows for:

Format flexibility
Easy integration of new document types
Consistent parsing behavior across different input formats

Parser

The Parser component performs the critical task of transforming documents into our structured ParsedDocument model while:

Maintaining complete document fidelity (lossless conversion)
Preserving hierarchical relationships (sections, paragraphs, sentences)
Converting unstructured text into a structured Pydantic model

Current Implementation

We currently support PDF parsing through two powerful backends:

PyMuPDF: An open-source PDF processing library
Adobe PDF Services API: Professional-grade PDF processing

Document Model

Our ParsedDocument model is designed to capture the complete structure of a document while making it processable for AI pipelines. For detailed information about the model structure and capabilities, refer to our model documentation.

Future Enhancements

We plan to expand the File Processing System with:

Support for additional document formats
Enhanced structure detection algorithms
Improved metadata extraction
Advanced formatting preservation
Custom parsers for specialized document types