Skip to content

Utils

parse_pdf(pdf_file)

Converts pdf file to a PyMuPDF Document model to easy parsing. pymupdf, doesn't provide a pydantic model in their implementation. We convert it into a pydantic model to make it easier to work with.

Parameters:

Name Type Description Default
pdf_file Path

The pdf file that needs to be parsed.

required

Returns:

Name Type Description
PyMuPDFDocument PyMuPDFDocument

Pydantic model representation of the pdf file.

Source code in supermat/core/parser/pymupdf_parser/utils.py
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def parse_pdf(pdf_file: Path) -> PyMuPDFDocument:
    """Converts pdf file to a PyMuPDF Document model to easy parsing.
    pymupdf, doesn't provide a pydantic model in their implementation.
    We convert it into a pydantic model to make it easier to work with.

    Args:
        pdf_file (Path): The pdf file that needs to be parsed.

    Returns:
        PyMuPDFDocument: Pydantic model representation of the pdf file.
    """
    doc = pymupdf.open(pdf_file)
    doc_data = {"filename": pdf_file.name, "total_pages": len(doc), "pages": [create_page(page) for page in doc]}
    return PyMuPDFDocument.model_validate_json(orjson.dumps(doc_data, default=default))