Supermat Documentation

A novel data representation framework for the AI era—offering structured annotations, granular traceability, and enhanced evaluation metrics to tackle hallucinations and compliance challenges.

Overview

Supermat introduces a structured approach to data processing and retrieval for Large Language Models (LLMs). It preserves annotations even after an LLM is trained, enabling clear traceability from any LLM output back to the original source text. This is critical for:

Hallucination Prevention: Identify and mitigate fabricated answers
Compliance & Auditing: Ensure regulatory standards are met by tracing outputs
Legal & Security: Quickly verify authenticity and control sensitive content

By leveraging Structure IDs (e.g., 2.1.4.8 for document/section/paragraph/sentence), Supermat maintains a transparent map between raw data and tokenized text, thereby reducing hallucinations and offering granular document-level context.

Features

Persistent Annotations
Supermat encodes unique identifiers at the sentence or paragraph level, so the lineage of any output text is never lost—even when building or fine-tuning LLMs.
Structure-Aware Data
Parsed documents maintain hierarchical relationships: sections, paragraphs, and sentences. This allows for more informed chunking and retrieval strategies.
Traceability & Compliance
Instantly link LLM outputs to their original references. Ideal for auditing, legal e-discovery, and policy enforcement.
Drop-In Retriever
The SupermatRetriever class seamlessly integrates with LangChain’s VectorStore, enabling structured queries with minimal refactoring.
Enhanced Evaluation Pipeline
Built-in metrics (Faithfulness, Accuracy, ROUGE, Cosine Similarity, etc.) let you rigorously test and iterate on your retrieval-augmented generation (RAG) workflows.

Installation

Supermat uses Poetry for dependency management:

# 1. Clone the repository
git clone https://github.com/supermatai/supermat.git
cd supermat

# 2. Install Poetry (if not already installed)
#    Follow the official Poetry docs for your environment

# 3. Install dependencies
poetry install --with=frontend --all-extras

# 4. Activate your virtual environment
poetry shell

For additional instructions or troubleshooting, check our Documentation.

Quick Start

1. Parse Documents

from supermat import FileProcessor
from pathlib import Path

pdf_path = Path("sample_document.pdf")
parsed_document = FileProcessor.parse_file(pdf_path)

2. Create a Retriever

from supermat.langchain.bindings import SupermatRetriever
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

# Suppose you have multiple parsed documents
documents = [parsed_document]  # or a list of them

embedding_model = HuggingFaceEmbeddings(model_name="thenlper/gte-base")
vector_store = Chroma(
    embedding_function=embedding_model,
    collection_name="PDFS_SUPERMAT_DEMO",
    persist_directory="./chromadb"
)

retriever = SupermatRetriever(parsed_docs=documents, vector_store=vector_store)

3. Run the Gradio Interface (Optional)

python -m supermat.gradio

Open the provided local URL to see a live demo of how Supermat processes and retrieves text.

4. Explore the Notebook Demo

cd notebooks
poetry run jupyter notebook pdf_demo.ipynb

This end-to-end walkthrough demonstrates: - Parsing and annotating PDF content
- Structuring data into the ParsedDocument model
- Using the retriever for queries and tracing outputs

Hugging Face Spaces Demo

Try Supermat directly in your browser—no setup required:

Code Overview

`FileProcessor`

Purpose: Converts files (PDF, DOCX, HTML, etc.) into a ParsedDocument model, preserving hierarchical structure.
Usage: ```python from supermat import FileProcessor, ParsedDocument from pathlib import Path

doc: ParsedDocument = FileProcessor.parse_file(Path("your_file.pdf")) - **Handler Management**:python handlers = FileProcessor.get_handlers(Path("your_file.pdf")) doc_custom = FileProcessor.get_handler("some_handler").parse(Path("your_file.pdf")) ```

`SupermatRetriever`

Goal: Serve as a drop-in replacement for LangChain’s standard retrievers, adding structure-aware indexing and traceability.
Usage: ```python from supermat.langchain.bindings import SupermatRetriever from langchain.vectorstores import Chroma

retriever = SupermatRetriever(parsed_docs=[doc1, doc2], vector_store=Chroma(...)) ``` - Advantages: - Retains hierarchical references (Structure IDs) - Easily integrates into RAG workflows - Minimizes hallucination risk by enabling direct text tracebacks

Evaluation & Metrics

Supermat includes an evaluation module aligned with LangChain’s frameworks to measure the quality of LLM outputs. Key metrics include:

Faithfulness: Checks if the generated response accurately reflects the source documents (i.e., no made-up facts).
Accuracy: Measures correctness against reference answers or ground truth.
Cosine Similarity: Quantifies semantic closeness between the generated response and reference text.
ROUGE (1, 2, L): Assesses textual overlap at unigram, bigram, and longest common subsequence levels.

Highlights (vs. standard chunking & semantic chunking strategies):

+12.5% improvement in faithfulness
+15.6% improvement in accuracy
+33% ROUGE-1 recall lift
Slightly faster or comparable runtime performance

Such gains emphasize Supermat’s focus on preserving structural context and annotated references, which reduces hallucinations and improves overall LLM response quality.

Conclusion

Supermat is more than just another chunking library. By embedding structured annotations into the document processing pipeline, it ensures every piece of information remains traceable—an essential component for building trustworthy AI systems. Whether you need robust compliance checks, advanced RAG pipelines, or improved user confidence, Supermat delivers a scalable and adaptable solution for AI-driven data workflows.

Contributing

We welcome your contributions! You can help by:

Forking the repository
Creating a feature branch
Submitting a pull request

For guidelines, please see CONTRIBUTING.md (coming soon).

Thanks for trying Supermat!
Find more details and advanced guides at our Documentation. Feel free to open an issue or a pull request if you have any suggestions or improvements!

Table of Contents