Utils

`parse_pdf(pdf_file)`

Converts pdf file to a PyMuPDF Document model to easy parsing. pymupdf, doesn't provide a pydantic model in their implementation. We convert it into a pydantic model to make it easier to work with.

Parameters:

Name	Type	Description	Default
`pdf_file`	`Path`	The pdf file that needs to be parsed.	required

Returns:

Name	Type	Description
`PyMuPDFDocument`	`PyMuPDFDocument`	Pydantic model representation of the pdf file.

Source code in supermat/core/parser/pymupdf_parser/utils.py

def parse_pdf(pdf_file: Path) -> PyMuPDFDocument:
    """Converts pdf file to a PyMuPDF Document model to easy parsing.
    pymupdf, doesn't provide a pydantic model in their implementation.
    We convert it into a pydantic model to make it easier to work with.

    Args:
        pdf_file (Path): The pdf file that needs to be parsed.

    Returns:
        PyMuPDFDocument: Pydantic model representation of the pdf file.
    """
    doc = pymupdf.open(pdf_file)
    doc_data = {"filename": pdf_file.name, "total_pages": len(doc), "pages": [create_page(page) for page in doc]}
    return PyMuPDFDocument.model_validate_json(orjson.dumps(doc_data, default=default))