A novel data representation framework for the AI era—offering structured annotations, granular traceability, and enhanced evaluation metrics to tackle hallucinations and compliance challenges.
Table of Contents
- Overview
- Features
- Installation
- Quick Start
- Hugging Face Spaces Demo
- Code Overview
- FileProcessor
- SupermatRetriever
- Evaluation & Metrics
- Conclusion
- Contributing
Overview
Supermat introduces a structured approach to data processing and retrieval for Large Language Models (LLMs). It preserves annotations even after an LLM is trained, enabling clear traceability from any LLM output back to the original source text. This is critical for:
- Hallucination Prevention: Identify and mitigate fabricated answers
- Compliance & Auditing: Ensure regulatory standards are met by tracing outputs
- Legal & Security: Quickly verify authenticity and control sensitive content
By leveraging Structure IDs (e.g., 2.1.4.8
for document/section/paragraph/sentence), Supermat maintains a transparent map between raw data and tokenized text, thereby reducing hallucinations and offering granular document-level context.
Features
- Persistent Annotations
-
Supermat encodes unique identifiers at the sentence or paragraph level, so the lineage of any output text is never lost—even when building or fine-tuning LLMs.
-
Structure-Aware Data
-
Parsed documents maintain hierarchical relationships: sections, paragraphs, and sentences. This allows for more informed chunking and retrieval strategies.
-
Traceability & Compliance
-
Instantly link LLM outputs to their original references. Ideal for auditing, legal e-discovery, and policy enforcement.
-
Drop-In Retriever
-
The
SupermatRetriever
class seamlessly integrates with LangChain’s VectorStore, enabling structured queries with minimal refactoring. -
Enhanced Evaluation Pipeline
- Built-in metrics (Faithfulness, Accuracy, ROUGE, Cosine Similarity, etc.) let you rigorously test and iterate on your retrieval-augmented generation (RAG) workflows.
Installation
Supermat uses Poetry for dependency management:
# 1. Clone the repository
git clone https://github.com/supermatai/supermat.git
cd supermat
# 2. Install Poetry (if not already installed)
# Follow the official Poetry docs for your environment
# 3. Install dependencies
poetry install --with=frontend --all-extras
# 4. Activate your virtual environment
poetry shell
For additional instructions or troubleshooting, check our Documentation.
Quick Start
1. Parse Documents
from supermat import FileProcessor
from pathlib import Path
pdf_path = Path("sample_document.pdf")
parsed_document = FileProcessor.parse_file(pdf_path)
2. Create a Retriever
from supermat.langchain.bindings import SupermatRetriever
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
# Suppose you have multiple parsed documents
documents = [parsed_document] # or a list of them
embedding_model = HuggingFaceEmbeddings(model_name="thenlper/gte-base")
vector_store = Chroma(
embedding_function=embedding_model,
collection_name="PDFS_SUPERMAT_DEMO",
persist_directory="./chromadb"
)
retriever = SupermatRetriever(parsed_docs=documents, vector_store=vector_store)
3. Run the Gradio Interface (Optional)
python -m supermat.gradio
Open the provided local URL to see a live demo of how Supermat processes and retrieves text.
4. Explore the Notebook Demo
cd notebooks
poetry run jupyter notebook pdf_demo.ipynb
This end-to-end walkthrough demonstrates:
- Parsing and annotating PDF content
- Structuring data into the ParsedDocument
model
- Using the retriever for queries and tracing outputs
Hugging Face Spaces Demo
Try Supermat directly in your browser—no setup required:
Code Overview
FileProcessor
- Purpose: Converts files (PDF, DOCX, HTML, etc.) into a
ParsedDocument
model, preserving hierarchical structure. - Usage: ```python from supermat import FileProcessor, ParsedDocument from pathlib import Path
doc: ParsedDocument = FileProcessor.parse_file(Path("your_file.pdf"))
- **Handler Management**:
python
handlers = FileProcessor.get_handlers(Path("your_file.pdf"))
doc_custom = FileProcessor.get_handler("some_handler").parse(Path("your_file.pdf"))
```
SupermatRetriever
- Goal: Serve as a drop-in replacement for LangChain’s standard retrievers, adding structure-aware indexing and traceability.
- Usage: ```python from supermat.langchain.bindings import SupermatRetriever from langchain.vectorstores import Chroma
retriever = SupermatRetriever(parsed_docs=[doc1, doc2], vector_store=Chroma(...)) ``` - Advantages: - Retains hierarchical references (Structure IDs) - Easily integrates into RAG workflows - Minimizes hallucination risk by enabling direct text tracebacks
Evaluation & Metrics
Supermat includes an evaluation module aligned with LangChain’s frameworks to measure the quality of LLM outputs. Key metrics include:
- Faithfulness: Checks if the generated response accurately reflects the source documents (i.e., no made-up facts).
- Accuracy: Measures correctness against reference answers or ground truth.
- Cosine Similarity: Quantifies semantic closeness between the generated response and reference text.
- ROUGE (1, 2, L): Assesses textual overlap at unigram, bigram, and longest common subsequence levels.
Highlights (vs. standard chunking & semantic chunking strategies):
- +12.5% improvement in faithfulness
- +15.6% improvement in accuracy
- +33% ROUGE-1 recall lift
- Slightly faster or comparable runtime performance
Such gains emphasize Supermat’s focus on preserving structural context and annotated references, which reduces hallucinations and improves overall LLM response quality.
Conclusion
Supermat is more than just another chunking library. By embedding structured annotations into the document processing pipeline, it ensures every piece of information remains traceable—an essential component for building trustworthy AI systems. Whether you need robust compliance checks, advanced RAG pipelines, or improved user confidence, Supermat delivers a scalable and adaptable solution for AI-driven data workflows.
Contributing
We welcome your contributions! You can help by:
- Forking the repository
- Creating a feature branch
- Submitting a pull request
For guidelines, please see CONTRIBUTING.md (coming soon).
Thanks for trying Supermat!
Find more details and advanced guides at our Documentation. Feel free to open an issue or a pull request if you have any suggestions or improvements!