Utils
extract_meaningful_words(text)
For given text, extract set of relevant keywords using nltk.
Source code in supermat/core/parser/utils.py
54 55 56 57 58 59 60 61 62 63 64 65 66 |
|
get_keywords(text)
For given text, retrieve relevant list of keywords using spacy and nltk.
Source code in supermat/core/parser/utils.py
69 70 71 |
|
split_text_into_token_chunks(text, max_tokens=8000, model_name=TOKENIZER_MODEL_NAME)
Splits a text into chunks based on token count using LangChain's token splitter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text to be split. |
required |
max_tokens |
int
|
The maximum number of tokens in each chunk. |
8000
|
model_name |
str
|
The LLM model name to determine tokenization rules. |
TOKENIZER_MODEL_NAME
|
Returns:
Name | Type | Description |
---|---|---|
list |
list[str]
|
A list of text chunks, each with up to max_tokens tokens. |
Source code in supermat/core/parser/utils.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
|