Retrieval System
The Supermat Retriever integrates seamlessly with existing retrieval frameworks while leveraging our structured document approach. Here's how it works and what makes it unique.
Supermat Retriever
Our retriever is designed as a flexible wrapper that enhances existing vector store capabilities with Supermat's structural awareness. It currently integrates with Langchain's Vector Stores, serving as a drop-in replacement that preserves all standard functionality while adding structural context.
Key Features
- Seamless Integration: Works as a direct replacement for any Langchain vector store
- Structure-Aware Retrieval: Maintains document hierarchies during the retrieval process
- Framework Flexibility: Built to support multiple retrieval frameworks
- Preserved Context: Utilizes Structure IDs to maintain document relationships
Implementation
To use the Supermat Retriever:
- Process all the documents using the
FileProcessor
- Choose your preferred vector store
- Initialize
SupermatRetriever
with your selected store and the processed documents - Process and retrieve documents while maintaining structural integrity
Example:
Step 1: Process all your files
from itertools import chain
from typing import TYPE_CHECKING, cast
from joblib import Parallel, delayed
from tqdm.auto import tqdm
from supermat.core.parser import FileProcessor
parsed_files = Parallel(n_jobs=-1, backend="threading")(delayed(FileProcessor.parse_file)(path) for path in pdf_files)
documents = list(chain.from_iterable(parsed_docs for parsed_docs in parsed_files))
Step 2: Take the processed documents and build the retriever
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from supermat.langchain.bindings import SupermatRetriever
retriever = SupermatRetriever(
parsed_docs=documents,
vector_store=Chroma(
embedding_function=HuggingFaceEmbeddings(
model_name="thenlper/gte-base",
),
persist_directory="./chromadb",
collection_name="NAME",
),
)
Step 3. Now the retriever can be used in a Langchain retrieval chain.
You can use the default chain that we provide that provides citation. This prompt was built on Deepseek 8b model.
from langchain_ollama.llms import OllamaLLM
from supermat.langchain.bindings import get_default_chain
llm_model = OllamaLLM(model="deepseek-r1:8b", temperature=0.0)
chain = get_default_chain(retriever, llm_model, substitute_references=False, return_context=False)