File Processing System
The FileProcessor
module forms the foundation of Supermat's document handling capabilities, converting various document formats into our structured ParsedDocument
model while preserving their hierarchical organization.
Architecture Components
Handler
The Handler
orchestrates document processing through two key components:
- Converters: A collection of utilities that transform various file formats into a standardized format for parsing. For example:
- Converting
.docx
to.pdf
- Converting
.pptx
to.pdf
-
Future support planned for additional formats
-
Parser: Processes the standardized format to generate the
ParsedDocument
model.
This modular approach allows for:
- Format flexibility
- Easy integration of new document types
- Consistent parsing behavior across different input formats
Parser
The Parser component performs the critical task of transforming documents into our structured ParsedDocument
model while:
- Maintaining complete document fidelity (lossless conversion)
- Preserving hierarchical relationships (sections, paragraphs, sentences)
- Converting unstructured text into a structured Pydantic model
Current Implementation
We currently support PDF parsing through two powerful backends:
- PyMuPDF: An open-source PDF processing library
- Adobe PDF Services API: Professional-grade PDF processing
Document Model
Our ParsedDocument
model is designed to capture the complete structure of a document while making it processable for AI pipelines. For detailed information about the model structure and capabilities, refer to our model documentation.
Future Enhancements
We plan to expand the File Processing System with:
- Support for additional document formats
- Enhanced structure detection algorithms
- Improved metadata extraction
- Advanced formatting preservation
- Custom parsers for specialized document types