Skip to content

Core

This is where the core supermat parsing logic exists. Core deals with Supermat's parser pydantic models to define structure to the parsed documents, chunking strategies, and parser logic to convert documents into the ParsedDocument model.

export_parsed_document(document, output_path, **kwargs)

Export given ParsedDocument to a json file

Parameters:

Name Type Description Default
document ParsedDocumentType

The ParsedDocument to be dumped.

required
output_path Path | str

JSON file location.

required
Source code in supermat/core/models/parsed_document.py
249
250
251
252
253
254
255
256
257
258
def export_parsed_document(document: ParsedDocumentType, output_path: Path | str, **kwargs):
    """Export given ParsedDocument to a json file

    Args:
        document (ParsedDocumentType): The ParsedDocument to be dumped.
        output_path (Path | str): JSON file location.
    """
    output_path = Path(output_path)
    with output_path.open("wb+") as fp:
        fp.write(ParsedDocument.dump_json(document, **kwargs))

load_parsed_document(path)

Load a json dumped ParsedDocument

Parameters:

Name Type Description Default
path Path | str

file path to the json file.

required

Returns:

Name Type Description
ParsedDocumentType ParsedDocumentType

ParsedDocument model loaded from json.

Source code in supermat/core/models/parsed_document.py
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
def load_parsed_document(path: Path | str) -> ParsedDocumentType:
    """Load a json dumped `ParsedDocument`

    Args:
        path (Path | str): file path to the json file.

    Returns:
        ParsedDocumentType: ParsedDocument model loaded from json.
    """
    path = Path(path)
    with path.open("rb") as fp:
        raw_doc: list[dict[str, Any]] | dict[str, list[dict[str, Any]]] = orjson.loads(fp.read())

    if isinstance(raw_doc, dict) and len(raw_doc.keys()) == 1:
        root_key = next(iter(raw_doc.keys()))
        warn(f"The json document contains a root node {next(iter(raw_doc.keys()))}.", ValidationWarning)
        return ParsedDocument.validate_python(raw_doc[root_key])
    elif isinstance(raw_doc, list):
        return ParsedDocument.validate_python(raw_doc)
    else:
        raise ValueError("Invalid JSON Format")