Skip to content

Evaluation

LangChain Evaluation Overview

LangChain provides evaluation modules to assess the quality of LLM outputs, particularly for RAG systems. The evaluation process typically involves comparing the generated response against reference answers or source documents.

Key Metrics Explained

Faithfulness Metrics

Measures how truthful or accurate the generated response is compared to the source documents. It checks if the LLM's response contains information that is actually present in the retrieved documents and doesn't include hallucinated facts.

Accuracy

A metric that measures whether the generated response is completely correct according to the reference answer. In RAG contexts, this often means checking if all key information from the source documents is preserved.

Cosine Similarity

Measures the semantic similarity between the generated response and reference text by converting them into vector representations and calculating their cosine distance. Values range from -1 to 1, where 1 indicates perfect similarity.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Metrics

ROUGE-L (RougeLsum)

  • Overall: Measures the longest common subsequence between generated and reference texts
  • Precision: Percentage of words in the generated text that appear in the reference
  • Recall: Percentage of words in the reference that appear in the generated text

ROUGE-1

  • Overall: Measures overlap of unigrams (single words)
  • Precision: Ratio of matching unigrams to total unigrams in generated text
  • Recall: Ratio of matching unigrams to total unigrams in reference text

ROUGE-2

  • Overall: Measures overlap of bigrams (pairs of consecutive words)
  • Precision: Ratio of matching bigrams to total bigrams in generated text
  • Recall: Ratio of matching bigrams to total bigrams in reference text

These metrics together provide a comprehensive view of RAG system performance, evaluating both factual accuracy and linguistic similarity to reference materials.

Here we demonstrate the results of our evaluation exercise.

Supermat Evaluation Metrics

Results of evaluation on Supermat retriever.

Unnamed: 0 feedback.labeled_criteria:faithful feedback.labeled_criteria:accuracy feedback.cosine_similarity feedback.rouge1_f1_score feedback.rouge1_precision feedback.rouge1_recall feedback.rouge2_f1_score feedback.rouge2_precision feedback.rouge2_recall feedback.rougeLsum_f1_score feedback.rougeLsum_precision feedback.rougeLsum_recall error execution_time
count 490 491 491 491 491 491 491 491 491 491 491 491 1 492
mean 0.786327 0.853564 0.791328 0.0431129 0.0275313 0.424209 0.0142695 0.0107686 0.0355544 0.04223 0.0267061 0.423059 nan 1.4606
std 0.20909 0.221319 0.0358326 0.0937076 0.0747593 0.479402 0.0692836 0.0553136 0.164963 0.0901142 0.0702288 0.479115 nan 0.46089
min 0.2 0.1 0.717511 0 0 0 0 0 0 0 0 0 nan 0.562514
25% 0.8 0.7 0.773459 0 0 0 0 0 0 0 0 0 nan 1.14542
50% 0.9 1 0.787848 0 0 0 0 0 0 0 0 0 nan 1.39943
75% 0.9 1 0.802803 0.0533428 0.0277778 1 0 0 0 0.0533428 0.0277778 1 nan 1.71919
max 1 1 0.971989 0.666667 0.692308 1 0.625 0.588235 1 0.666667 0.634615 1 nan 3.4466

SOTA Comparison

Aggregated Results Difference compared with Langchain SemanticChunker Percentile

desc feedback.labeled_criteria:faithful feedback.labeled_criteria:accuracy feedback.cosine_similarity feedback.rouge1_f1_score feedback.rouge1_precision feedback.rouge1_recall feedback.rouge2_f1_score feedback.rouge2_precision feedback.rouge2_recall feedback.rougeLsum_f1_score feedback.rougeLsum_precision feedback.rougeLsum_recall execution_time
mean 0.125303 0.155652 -0.00757736 0.206762 0.126177 0.333367 0.00773695 -0.0448571 -0.0776622 0.228423 0.151071 0.343074 -0.0618311
min 0 0 0.00461873 nan nan nan nan nan nan nan nan nan -0.193716
25% 0.555556 0 -0.00499138 nan nan nan nan nan nan nan nan nan 0.0338315
50% 0.125 0.25 -0.00275206 nan nan nan nan nan nan nan nan nan -0.00420015
75% 0 0 -0.0117252 0.230769 0.236842 0 nan nan nan 0.230769 0.236842 0 -0.0840367
max 0 0 0 0 -0.0666667 0 0 -0.0271226 0 0 -0.0750751 0 -0.417665

Faithfulness and Accuracy

  • Our method demonstrates higher faithfulness (+12.5% mean improvement)
  • Better accuracy scores (+15.6% mean improvement)
  • Particularly strong in the lower quartile with +55.6% improvement in faithfulness

Semantic Similarity

  • Slightly lower cosine similarity (-0.76% mean difference)
  • Small range of variation (from +0.46% to -1.17%)
  • Median difference of -0.28%

ROUGE Scores

  • Significant improvements in ROUGE-1 metrics:
  • F1 score: +20.7% better
  • Precision: +12.6% improvement
  • Recall: +33.3% higher
  • Lower performance in ROUGE-2 scores:
  • Marginal improvement in F1 (+0.77%)
  • Decreased precision (-4.5%)
  • Lower recall (-7.8%)
  • Notable improvements in ROUGE-L metrics:
  • F1 score: +22.8% better
  • Precision: +15.1% higher
  • Recall: +34.3% improvement

Performance

  • Slightly faster execution times (-0.062s mean difference)
  • Variable performance improvements:
  • Best case: -0.42s improvement
  • Some cases slightly slower (+0.03s in 25th percentile)
  • Median case shows minimal difference (-0.004s)

Overall, our methodology shows substantial improvements over semantic chunking in faithfulness, accuracy, and most ROUGE metrics, particularly in ROUGE-1 and ROUGE-L scores. While there's a slight decrease in semantic similarity and ROUGE-2 metrics, the performance times remain comparable with slight improvements in most cases.

Baseline Comparison

Aggregated Results Difference compared with Langchain RecursiveCharacterTextSplitter

desc feedback.labeled_criteria:faithful feedback.labeled_criteria:accuracy feedback.cosine_similarity feedback.rouge1_f1_score feedback.rouge1_precision feedback.rouge1_recall feedback.rouge2_f1_score feedback.rouge2_precision feedback.rouge2_recall feedback.rougeLsum_f1_score feedback.rougeLsum_precision feedback.rougeLsum_recall execution_time
mean 0.118129 0.135316 -0.00548876 0.131199 0.0417132 0.307893 0.0591786 -0.00574848 0.0892132 0.171545 0.0941226 0.315888 -0.446078
min 1 0 -0.018756 nan nan nan nan nan nan nan nan nan -0.423493
25% 0.333333 0 -0.001981 nan nan nan nan nan nan nan nan nan -0.378166
50% 0.125 0 -0.00224861 nan nan nan nan nan nan nan nan nan -0.391275
75% 0 0 -0.00517427 0.200213 0.222222 0 nan nan nan 0.200213 0.222222 0 -0.42135
max 0 0 0.00623509 0 0 0 0.171875 0 0 0.166667 0 0 -0.638781

Here's a brief summary comparing our method with LangChain's Recursive chunking:

Faithfulness and Accuracy

  • Our method demonstrates better faithfulness (+11.8% mean improvement)
  • Improved accuracy scores (+13.5% mean improvement)
  • Notable improvements in lower quartiles, showing better minimum performance

Semantic Similarity

  • Very slight decrease in cosine similarity (-0.55% mean difference)
  • Minimal variation in differences (range: -1.88% to +0.62%)
  • Most differences concentrated near the median (-0.22%)

ROUGE Scores

  • Notable improvements in ROUGE-1 metrics:
  • F1 score: +13.1% better
  • Precision: +4.2% improvement
  • Recall: +30.8% higher
  • Better ROUGE-2 performance:
  • F1 score: +5.9% improvement
  • Slightly lower precision (-0.57%)
  • Higher recall (+8.9%)
  • Significant gains in ROUGE-L metrics:
  • F1 score: +17.2% better
  • Precision: +9.4% higher
  • Recall: +31.6% improvement

Performance

  • Substantially faster execution time (-0.45s mean difference)
  • Very consistent performance improvements across all quartiles
  • Maximum time savings of 0.64s in best cases

Overall, our methodology shows meaningful improvements over recursive chunking across most metrics, with particularly strong gains in faithfulness, accuracy, and ROUGE-L scores. The performance improvement is notably more significant compared to the semantic chunker comparison, with consistent time savings while maintaining better quality metrics.