Skip to content

Structure

High Level Overview

Supermat's architecture consists of two primary components:

  1. FileProcessor: A sophisticated component that parses input files and transforms them into our structured ParsedDocument model. This model preserves the document's hierarchical structure while making it machine-processable.

  2. Chunking System: This component takes the lossless ParsedDocument and intelligently segments it into chunks that fit within an LLM's context length. The current version operates primarily at the paragraph level, though future iterations will explore more adaptive chunking strategies.

Design Philosophy

Supermat's primary focus lies in sophisticated document processing rather than traditional retrieval mechanisms. Our unique Structure ID system creates hierarchical connections between different document components, enabling:

  • Precise content location and retrieval
  • Maintenance of context across chunks
  • Efficient navigation through document hierarchies

Current Limitations and Future Work

While the current implementation effectively demonstrates our approach, we acknowledge several areas for future enhancement:

  • Development of more intelligent, adaptive chunking algorithms
  • Optimization of chunk sizes based on content semantics
  • Integration with various document formats
  • Enhanced preservation of document metadata

Our roadmap includes expanding these capabilities while maintaining the system's core strength: preserving document structure throughout the AI processing pipeline.