Evaluation

LangChain Evaluation Overview

LangChain provides evaluation modules to assess the quality of LLM outputs, particularly for RAG systems. The evaluation process typically involves comparing the generated response against reference answers or source documents.

Key Metrics Explained

Faithfulness Metrics

Measures how truthful or accurate the generated response is compared to the source documents. It checks if the LLM's response contains information that is actually present in the retrieved documents and doesn't include hallucinated facts.

Accuracy

A metric that measures whether the generated response is completely correct according to the reference answer. In RAG contexts, this often means checking if all key information from the source documents is preserved.

Cosine Similarity

Measures the semantic similarity between the generated response and reference text by converting them into vector representations and calculating their cosine distance. Values range from -1 to 1, where 1 indicates perfect similarity.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Metrics

ROUGE-L (RougeLsum)

Overall: Measures the longest common subsequence between generated and reference texts
Precision: Percentage of words in the generated text that appear in the reference
Recall: Percentage of words in the reference that appear in the generated text

ROUGE-1

Overall: Measures overlap of unigrams (single words)
Precision: Ratio of matching unigrams to total unigrams in generated text
Recall: Ratio of matching unigrams to total unigrams in reference text

ROUGE-2

Overall: Measures overlap of bigrams (pairs of consecutive words)
Precision: Ratio of matching bigrams to total bigrams in generated text
Recall: Ratio of matching bigrams to total bigrams in reference text

These metrics together provide a comprehensive view of RAG system performance, evaluating both factual accuracy and linguistic similarity to reference materials.

Here we demonstrate the results of our evaluation exercise.

Supermat Evaluation Metrics

Results of evaluation on Supermat retriever.

Unnamed: 0	feedback.labeled_criteria:faithful	feedback.labeled_criteria:accuracy	feedback.cosine_similarity	feedback.rouge1_f1_score	feedback.rouge1_precision	feedback.rouge1_recall	feedback.rouge2_f1_score	feedback.rouge2_precision	feedback.rouge2_recall	feedback.rougeLsum_f1_score	feedback.rougeLsum_precision	feedback.rougeLsum_recall	error	execution_time
count	490	491	491	491	491	491	491	491	491	491	491	491	1	492
mean	0.786327	0.853564	0.791328	0.0431129	0.0275313	0.424209	0.0142695	0.0107686	0.0355544	0.04223	0.0267061	0.423059	nan	1.4606
std	0.20909	0.221319	0.0358326	0.0937076	0.0747593	0.479402	0.0692836	0.0553136	0.164963	0.0901142	0.0702288	0.479115	nan	0.46089
min	0.2	0.1	0.717511	0	0	0	0	0	0	0	0	0	nan	0.562514
25%	0.8	0.7	0.773459	0	0	0	0	0	0	0	0	0	nan	1.14542
50%	0.9	1	0.787848	0	0	0	0	0	0	0	0	0	nan	1.39943
75%	0.9	1	0.802803	0.0533428	0.0277778	1	0	0	0	0.0533428	0.0277778	1	nan	1.71919
max	1	1	0.971989	0.666667	0.692308	1	0.625	0.588235	1	0.666667	0.634615	1	nan	3.4466

SOTA Comparison

Aggregated Results Difference compared with Langchain SemanticChunker Percentile

desc	feedback.labeled_criteria:faithful	feedback.labeled_criteria:accuracy	feedback.cosine_similarity	feedback.rouge1_f1_score	feedback.rouge1_precision	feedback.rouge1_recall	feedback.rouge2_f1_score	feedback.rouge2_precision	feedback.rouge2_recall	feedback.rougeLsum_f1_score	feedback.rougeLsum_precision	feedback.rougeLsum_recall	execution_time
mean	0.125303	0.155652	-0.00757736	0.206762	0.126177	0.333367	0.00773695	-0.0448571	-0.0776622	0.228423	0.151071	0.343074	-0.0618311
min	0	0	0.00461873	nan	nan	nan	nan	nan	nan	nan	nan	nan	-0.193716
25%	0.555556	0	-0.00499138	nan	nan	nan	nan	nan	nan	nan	nan	nan	0.0338315
50%	0.125	0.25	-0.00275206	nan	nan	nan	nan	nan	nan	nan	nan	nan	-0.00420015
75%	0	0	-0.0117252	0.230769	0.236842	0	nan	nan	nan	0.230769	0.236842	0	-0.0840367
max	0	0	0	0	-0.0666667	0	0	-0.0271226	0	0	-0.0750751	0	-0.417665

Faithfulness and Accuracy

Our method demonstrates higher faithfulness (+12.5% mean improvement)
Better accuracy scores (+15.6% mean improvement)
Particularly strong in the lower quartile with +55.6% improvement in faithfulness

Semantic Similarity

Slightly lower cosine similarity (-0.76% mean difference)
Small range of variation (from +0.46% to -1.17%)
Median difference of -0.28%

ROUGE Scores

Significant improvements in ROUGE-1 metrics:
F1 score: +20.7% better
Precision: +12.6% improvement
Recall: +33.3% higher
Lower performance in ROUGE-2 scores:
Marginal improvement in F1 (+0.77%)
Decreased precision (-4.5%)
Lower recall (-7.8%)
Notable improvements in ROUGE-L metrics:
F1 score: +22.8% better
Precision: +15.1% higher
Recall: +34.3% improvement

Performance

Slightly faster execution times (-0.062s mean difference)
Variable performance improvements:
Best case: -0.42s improvement
Some cases slightly slower (+0.03s in 25th percentile)
Median case shows minimal difference (-0.004s)

Overall, our methodology shows substantial improvements over semantic chunking in faithfulness, accuracy, and most ROUGE metrics, particularly in ROUGE-1 and ROUGE-L scores. While there's a slight decrease in semantic similarity and ROUGE-2 metrics, the performance times remain comparable with slight improvements in most cases.

Baseline Comparison

Aggregated Results Difference compared with Langchain RecursiveCharacterTextSplitter

desc	feedback.labeled_criteria:faithful	feedback.labeled_criteria:accuracy	feedback.cosine_similarity	feedback.rouge1_f1_score	feedback.rouge1_precision	feedback.rouge1_recall	feedback.rouge2_f1_score	feedback.rouge2_precision	feedback.rouge2_recall	feedback.rougeLsum_f1_score	feedback.rougeLsum_precision	feedback.rougeLsum_recall	execution_time
mean	0.118129	0.135316	-0.00548876	0.131199	0.0417132	0.307893	0.0591786	-0.00574848	0.0892132	0.171545	0.0941226	0.315888	-0.446078
min	1	0	-0.018756	nan	nan	nan	nan	nan	nan	nan	nan	nan	-0.423493
25%	0.333333	0	-0.001981	nan	nan	nan	nan	nan	nan	nan	nan	nan	-0.378166
50%	0.125	0	-0.00224861	nan	nan	nan	nan	nan	nan	nan	nan	nan	-0.391275
75%	0	0	-0.00517427	0.200213	0.222222	0	nan	nan	nan	0.200213	0.222222	0	-0.42135
max	0	0	0.00623509	0	0	0	0.171875	0	0	0.166667	0	0	-0.638781

Here's a brief summary comparing our method with LangChain's Recursive chunking:

Faithfulness and Accuracy

Our method demonstrates better faithfulness (+11.8% mean improvement)
Improved accuracy scores (+13.5% mean improvement)
Notable improvements in lower quartiles, showing better minimum performance

Semantic Similarity

Very slight decrease in cosine similarity (-0.55% mean difference)
Minimal variation in differences (range: -1.88% to +0.62%)
Most differences concentrated near the median (-0.22%)

ROUGE Scores

Notable improvements in ROUGE-1 metrics:
F1 score: +13.1% better
Precision: +4.2% improvement
Recall: +30.8% higher
Better ROUGE-2 performance:
F1 score: +5.9% improvement
Slightly lower precision (-0.57%)
Higher recall (+8.9%)
Significant gains in ROUGE-L metrics:
F1 score: +17.2% better
Precision: +9.4% higher
Recall: +31.6% improvement

Performance

Substantially faster execution time (-0.45s mean difference)
Very consistent performance improvements across all quartiles
Maximum time savings of 0.64s in best cases

Overall, our methodology shows meaningful improvements over recursive chunking across most metrics, with particularly strong gains in faithfulness, accuracy, and ROUGE-L scores. The performance improvement is notably more significant compared to the semantic chunker comparison, with consistent time savings while maintaining better quality metrics.