Parser
The parser submodule contains all Parser implementation that converts a given file type to a ParsedDocument. For the Parser to be registered, it needs to be included here. TODO (@legendof-selda): Dynamically register all parsers.
To create a new Parser
, create a submodule for it and inside the submodule, it should have parser.py
.
Here is where the Parser
implementation will be written.
For any utilities associated to that parser will go to utils.py
.
Also include import the Parser in its corresponding __init__.py
file for easier importing.
FileProcessor
Source code in supermat/core/parser/file_processor.py
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 |
|
get_handler(handler_name)
staticmethod
Retrieve the registered handler from the given handler_name
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
handler_name |
str
|
Unique name given to the registered |
required |
Returns:
Name | Type | Description |
---|---|---|
Handler |
Handler
|
The registered |
Source code in supermat/core/parser/file_processor.py
184 185 186 187 188 189 190 191 192 193 194 |
|
get_handlers(file_path)
staticmethod
Get all the handlers that can handle the given file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
Path | str
|
The file that needs to be handled. |
required |
Returns:
Type | Description |
---|---|
dict[str, Handler]
|
dict[str, Handler]: The handlers associated with this file type. |
Source code in supermat/core/parser/file_processor.py
196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 |
|
get_main_handler(file_path)
staticmethod
Get the 'main' handler that can handle the given file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
Path | str
|
The file that needs to be handled. |
required |
Returns:
Name | Type | Description |
---|---|---|
Handler |
Handler
|
The main handler associated with this file type. |
Source code in supermat/core/parser/file_processor.py
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
|
parse_file(file_path)
staticmethod
Parses a file and returns the ParsedDocument
after retrieving the 'main' handler for it.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
Path | str
|
The file_path that needs to be parsed. |
required |
Returns:
Name | Type | Description |
---|---|---|
ParsedDocumentType |
ParsedDocumentType
|
The parsed format of the file. |
Source code in supermat/core/parser/file_processor.py
214 215 216 217 218 219 220 221 222 223 224 225 |
|
process_file(file_path, **kwargs)
staticmethod
Parses a file and saves the ParsedDocument
json and returns the file path to it
after retrieving the 'main' handler for it.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
Path | str
|
The file_path that needs to be parsed. |
required |
Returns:
Name | Type | Description |
---|---|---|
Path |
Path
|
The path to the json exported |
Source code in supermat/core/parser/file_processor.py
227 228 229 230 231 232 233 234 235 236 237 238 239 |
|
register(extension, *, converters=None, main=False)
staticmethod
A register
decorator that registers a Parser
to specified document extension
type
and the list of Converter
s that needs to run beforing parsing the document.
Example:
@FileProcessor.register(".html")
@FileProcessor.register(".pdf", converters=PDF2HTMLConverter, main=True)
@FileProcessor.register(".docx", converters=[Docx2PDFConverter, PDF2HTMLConverter])
class HTMLParser(Parser):
def parse(self, file_path: Path) -> ParsedDocumentType:
...
Parameters:
Name | Type | Description | Default |
---|---|---|---|
extension |
str
|
The file extension that the parser will handle. |
required |
converters |
type[Converter] | Iterable[type[Converter]] | None
|
List of |
None
|
main |
bool
|
Specifies if the decorated |
False
|
Returns:
Type | Description |
---|---|
Callable[[P], P]
|
Callable[[type[Parser]], type[Parser]]: A decorator that registers the given Parser |
Source code in supermat/core/parser/file_processor.py
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
|
PyMuPDFParser
Bases: Parser
Parses a pdf file using PyMuPDF library.
Source code in supermat/core/parser/pymupdf_parser/parser.py
114 115 116 117 118 119 120 |
|