Evaluation
LangChain Evaluation Overview
LangChain provides evaluation modules to assess the quality of LLM outputs, particularly for RAG systems. The evaluation process typically involves comparing the generated response against reference answers or source documents.
Key Metrics Explained
Faithfulness Metrics
Measures how truthful or accurate the generated response is compared to the source documents. It checks if the LLM's response contains information that is actually present in the retrieved documents and doesn't include hallucinated facts.
Accuracy
A metric that measures whether the generated response is completely correct according to the reference answer. In RAG contexts, this often means checking if all key information from the source documents is preserved.
Cosine Similarity
Measures the semantic similarity between the generated response and reference text by converting them into vector representations and calculating their cosine distance. Values range from -1 to 1, where 1 indicates perfect similarity.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Metrics
ROUGE-L (RougeLsum)
- Overall: Measures the longest common subsequence between generated and reference texts
- Precision: Percentage of words in the generated text that appear in the reference
- Recall: Percentage of words in the reference that appear in the generated text
ROUGE-1
- Overall: Measures overlap of unigrams (single words)
- Precision: Ratio of matching unigrams to total unigrams in generated text
- Recall: Ratio of matching unigrams to total unigrams in reference text
ROUGE-2
- Overall: Measures overlap of bigrams (pairs of consecutive words)
- Precision: Ratio of matching bigrams to total bigrams in generated text
- Recall: Ratio of matching bigrams to total bigrams in reference text
These metrics together provide a comprehensive view of RAG system performance, evaluating both factual accuracy and linguistic similarity to reference materials.
Here we demonstrate the results of our evaluation exercise.
Supermat Evaluation Metrics
Results of evaluation on Supermat retriever.
Unnamed: 0 | feedback.labeled_criteria:faithful | feedback.labeled_criteria:accuracy | feedback.cosine_similarity | feedback.rouge1_f1_score | feedback.rouge1_precision | feedback.rouge1_recall | feedback.rouge2_f1_score | feedback.rouge2_precision | feedback.rouge2_recall | feedback.rougeLsum_f1_score | feedback.rougeLsum_precision | feedback.rougeLsum_recall | error | execution_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 490 | 491 | 491 | 491 | 491 | 491 | 491 | 491 | 491 | 491 | 491 | 491 | 1 | 492 |
mean | 0.786327 | 0.853564 | 0.791328 | 0.0431129 | 0.0275313 | 0.424209 | 0.0142695 | 0.0107686 | 0.0355544 | 0.04223 | 0.0267061 | 0.423059 | nan | 1.4606 |
std | 0.20909 | 0.221319 | 0.0358326 | 0.0937076 | 0.0747593 | 0.479402 | 0.0692836 | 0.0553136 | 0.164963 | 0.0901142 | 0.0702288 | 0.479115 | nan | 0.46089 |
min | 0.2 | 0.1 | 0.717511 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | 0.562514 |
25% | 0.8 | 0.7 | 0.773459 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | 1.14542 |
50% | 0.9 | 1 | 0.787848 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | 1.39943 |
75% | 0.9 | 1 | 0.802803 | 0.0533428 | 0.0277778 | 1 | 0 | 0 | 0 | 0.0533428 | 0.0277778 | 1 | nan | 1.71919 |
max | 1 | 1 | 0.971989 | 0.666667 | 0.692308 | 1 | 0.625 | 0.588235 | 1 | 0.666667 | 0.634615 | 1 | nan | 3.4466 |
SOTA Comparison
Aggregated Results Difference compared with Langchain SemanticChunker Percentile
desc | feedback.labeled_criteria:faithful | feedback.labeled_criteria:accuracy | feedback.cosine_similarity | feedback.rouge1_f1_score | feedback.rouge1_precision | feedback.rouge1_recall | feedback.rouge2_f1_score | feedback.rouge2_precision | feedback.rouge2_recall | feedback.rougeLsum_f1_score | feedback.rougeLsum_precision | feedback.rougeLsum_recall | execution_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | 0.125303 | 0.155652 | -0.00757736 | 0.206762 | 0.126177 | 0.333367 | 0.00773695 | -0.0448571 | -0.0776622 | 0.228423 | 0.151071 | 0.343074 | -0.0618311 |
min | 0 | 0 | 0.00461873 | nan | nan | nan | nan | nan | nan | nan | nan | nan | -0.193716 |
25% | 0.555556 | 0 | -0.00499138 | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0.0338315 |
50% | 0.125 | 0.25 | -0.00275206 | nan | nan | nan | nan | nan | nan | nan | nan | nan | -0.00420015 |
75% | 0 | 0 | -0.0117252 | 0.230769 | 0.236842 | 0 | nan | nan | nan | 0.230769 | 0.236842 | 0 | -0.0840367 |
max | 0 | 0 | 0 | 0 | -0.0666667 | 0 | 0 | -0.0271226 | 0 | 0 | -0.0750751 | 0 | -0.417665 |
Faithfulness and Accuracy
- Our method demonstrates higher faithfulness (+12.5% mean improvement)
- Better accuracy scores (+15.6% mean improvement)
- Particularly strong in the lower quartile with +55.6% improvement in faithfulness
Semantic Similarity
- Slightly lower cosine similarity (-0.76% mean difference)
- Small range of variation (from +0.46% to -1.17%)
- Median difference of -0.28%
ROUGE Scores
- Significant improvements in ROUGE-1 metrics:
- F1 score: +20.7% better
- Precision: +12.6% improvement
- Recall: +33.3% higher
- Lower performance in ROUGE-2 scores:
- Marginal improvement in F1 (+0.77%)
- Decreased precision (-4.5%)
- Lower recall (-7.8%)
- Notable improvements in ROUGE-L metrics:
- F1 score: +22.8% better
- Precision: +15.1% higher
- Recall: +34.3% improvement
Performance
- Slightly faster execution times (-0.062s mean difference)
- Variable performance improvements:
- Best case: -0.42s improvement
- Some cases slightly slower (+0.03s in 25th percentile)
- Median case shows minimal difference (-0.004s)
Overall, our methodology shows substantial improvements over semantic chunking in faithfulness, accuracy, and most ROUGE metrics, particularly in ROUGE-1 and ROUGE-L scores. While there's a slight decrease in semantic similarity and ROUGE-2 metrics, the performance times remain comparable with slight improvements in most cases.
Baseline Comparison
Aggregated Results Difference compared with Langchain RecursiveCharacterTextSplitter
desc | feedback.labeled_criteria:faithful | feedback.labeled_criteria:accuracy | feedback.cosine_similarity | feedback.rouge1_f1_score | feedback.rouge1_precision | feedback.rouge1_recall | feedback.rouge2_f1_score | feedback.rouge2_precision | feedback.rouge2_recall | feedback.rougeLsum_f1_score | feedback.rougeLsum_precision | feedback.rougeLsum_recall | execution_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | 0.118129 | 0.135316 | -0.00548876 | 0.131199 | 0.0417132 | 0.307893 | 0.0591786 | -0.00574848 | 0.0892132 | 0.171545 | 0.0941226 | 0.315888 | -0.446078 |
min | 1 | 0 | -0.018756 | nan | nan | nan | nan | nan | nan | nan | nan | nan | -0.423493 |
25% | 0.333333 | 0 | -0.001981 | nan | nan | nan | nan | nan | nan | nan | nan | nan | -0.378166 |
50% | 0.125 | 0 | -0.00224861 | nan | nan | nan | nan | nan | nan | nan | nan | nan | -0.391275 |
75% | 0 | 0 | -0.00517427 | 0.200213 | 0.222222 | 0 | nan | nan | nan | 0.200213 | 0.222222 | 0 | -0.42135 |
max | 0 | 0 | 0.00623509 | 0 | 0 | 0 | 0.171875 | 0 | 0 | 0.166667 | 0 | 0 | -0.638781 |
Here's a brief summary comparing our method with LangChain's Recursive chunking:
Faithfulness and Accuracy
- Our method demonstrates better faithfulness (+11.8% mean improvement)
- Improved accuracy scores (+13.5% mean improvement)
- Notable improvements in lower quartiles, showing better minimum performance
Semantic Similarity
- Very slight decrease in cosine similarity (-0.55% mean difference)
- Minimal variation in differences (range: -1.88% to +0.62%)
- Most differences concentrated near the median (-0.22%)
ROUGE Scores
- Notable improvements in ROUGE-1 metrics:
- F1 score: +13.1% better
- Precision: +4.2% improvement
- Recall: +30.8% higher
- Better ROUGE-2 performance:
- F1 score: +5.9% improvement
- Slightly lower precision (-0.57%)
- Higher recall (+8.9%)
- Significant gains in ROUGE-L metrics:
- F1 score: +17.2% better
- Precision: +9.4% higher
- Recall: +31.6% improvement
Performance
- Substantially faster execution time (-0.45s mean difference)
- Very consistent performance improvements across all quartiles
- Maximum time savings of 0.64s in best cases
Overall, our methodology shows meaningful improvements over recursive chunking across most metrics, with particularly strong gains in faithfulness, accuracy, and ROUGE-L scores. The performance improvement is notably more significant compared to the semantic chunker comparison, with consistent time savings while maintaining better quality metrics.