Detecting Hallucinations: A Python Script to Score RAG Accuracy

Even with Retrieval-Augmented Generation (RAG) technology, Large Language Models (LLMs) continue to generate hallucinations—factually incorrect responses that appear plausible but contradict the provided context. This remains one of the most critical challenges in enterprise AI deployments.

⚠️ Real-World Impact: A major airline lost a court case after their RAG system hallucinated details about their refund policy, demonstrating the serious consequences of undetected hallucinations in production systems.

Why Do Hallucinations Occur?

LLM Brittleness: Even when context contains correct answers, models may fail to synthesize information accurately, especially across multiple facts
Poor Retrieval: Incomplete context from suboptimal search or poor document chunking can cause models to fill gaps with invented information
Knowledge Conflicts: When retrieved context contradicts the model’s training data, it may prioritize training knowledge over provided facts
Complex Reasoning: Models struggle with reasoning tasks that require multiple logical steps across different context sections

The solution is to implement automated hallucination detection that can flag untrustworthy responses in real-time, allowing for human review or alternative retrieval strategies.

Types of Hallucinations

Understanding the different types of hallucinations helps select appropriate detection methods:

🚫 Context-Conflicting

Response directly contradicts the provided context or adds facts not supported by retrieved documents

❓ Irrelevant

Response doesn’t address the user’s question or is semantically unrelated to the query

⚠️ Partially Correct

Some parts are accurate while others are fabricated or misleading, making detection more challenging

Most production hallucination detection systems focus on context-conflicting hallucinations, which are the most dangerous because they provide false information grounded in real documents.

Four Primary Detection Methods

Method 1: LLM Prompt-Based Detection 🤖

Use another LLM instance to evaluate whether the answer is grounded in the context.

How It Works:

Send the context, question, and answer to an evaluator LLM
Provide few-shot examples showing what grounded vs. hallucinated responses look like
Request a hallucination score between 0.0 (grounded) and 1.0 (hallucinated)
Apply a threshold to classify responses (e.g., >0.5 = hallucination)

Python Implementation:

from langchain.prompts import PromptTemplate
from langchain.llms import Ollama

# Hallucination detection prompt
HALLUCINATION_PROMPT = """
You are an expert at detecting hallucinations in LLM responses.

Your task: Determine if the statement is directly supported by the context.
- Score 0.0 if confident the statement is grounded in context
- Score 1.0 if confident the statement contradicts or is absent from context
- Score between 0.0-1.0 if uncertain

Examples:
Context: AWS provides cloud computing services with EC2 for virtual machines.
Statement: "AWS offers EC2 for virtual computing"
Score: 0.05 (directly supported)

Context: AWS provides cloud computing services with EC2 for virtual machines.
Statement: "AWS revenue in 2024 was $100 billion"
Score: 1.0 (not in context)

Context: {context}
Statement: {statement}
Response: [score only, no explanation]
"""

def detect_hallucination_llm(context: str, statement: str, llm) -> float:
    """Detect hallucination using LLM evaluation"""
    prompt = PromptTemplate(
        template=HALLUCINATION_PROMPT,
        input_variables=["context", "statement"]
    )
    
    response = llm(prompt.format(context=context, statement=statement))
    
    try:
        return float(response.strip())
    except ValueError:
        return 0.5  # Return neutral score if parsing fails

# Usage
llm = Ollama(model="llama3.1:8b")
context = "The Earth orbits the Sun in approximately 365.25 days."
statement = "It takes Earth 365 days to orbit the Sun"
score = detect_hallucination_llm(context, statement, llm)
print(f"Hallucination Score: {score:.2f}")  # Expected: ~0.1

✅ Pros:75% accuracy, good precision (94%), balanced cost

❌ Cons:Requires external LLM calls, moderate latency, may inherit evaluator biases

Method 2: Semantic Similarity Detection 📊

Compare embeddings of context and answer using cosine similarity.

How It Works:

Generate embeddings for context and answer using an embedding model
Calculate cosine similarity between them
Convert similarity to hallucination score (1 – similarity)
Low similarity = likely hallucination

Python Implementation:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from langchain.embeddings import OllamaEmbeddings

def detect_hallucination_semantic(context: str, statement: str, embeddings) -> float:
    """
    Detect hallucination using semantic similarity.
    
    Args:
        context: Retrieved document context
        statement: Generated answer to evaluate
        embeddings: Embedding model instance
        
    Returns:
        Hallucination score (0.0-1.0)
    """
    if not context or not statement:
        return 0.0
    
    # Generate embeddings
    context_emb = embeddings.embed_query(context)
    statement_emb = embeddings.embed_query(statement)
    
    # Reshape for sklearn
    context_emb = np.array(context_emb).reshape(1, -1)
    statement_emb = np.array(statement_emb).reshape(1, -1)
    
    # Calculate cosine similarity
    sim_score = cosine_similarity(context_emb, statement_emb)[0][0]
    
    # Convert to hallucination score (1 - similarity)
    hallucination_score = 1 - sim_score
    
    return float(hallucination_score)

# Usage
embeddings = OllamaEmbeddings(model="nomic-embed-text")
context = "Machine learning is a subset of artificial intelligence."
statement = "ML is a subset of AI"
score = detect_hallucination_semantic(context, statement, embeddings)
print(f"Hallucination Score: {score:.2f}")  # Expected: ~0.05

✅ Pros:Fast execution, easy to understand, high precision (90%)

❌ Cons:Only 48% accuracy, very low recall (2%), misses partial hallucinations

Method 3: BERT Stochastic Checker 🎲

Generate multiple responses and check consistency using BERT scores.

How It Works:

Generate N multiple responses from the same LLM using temperature sampling
Compare original response against all N stochastic samples
Calculate BERT F1 scores for semantic similarity
Low variance across samples = factual, high variance = hallucinated

Python Implementation:

from bert_score import score as bert_score_compute
import numpy as np

def detect_hallucination_bert_stochastic(
    original_response: str,
    stochastic_samples: list,
    model_type: str = "distilbert-base-uncased"
) -> float:
    """
    Detect hallucination using BERT stochastic consistency checking.
    
    Args:
        original_response: The main answer to verify
        stochastic_samples: List of N alternative responses from same model
        model_type: BERT model for scoring
        
    Returns:
        Hallucination score (lower = more hallucinated)
    """
    if len(stochastic_samples) == 0:
        return 0.5
    
    f1_scores = []
    
    # Compare original against each stochastic sample
    for sample in stochastic_samples:
        try:
            _, _, f1 = bert_score_compute(
                [original_response],
                [sample],
                model_type=model_type,
                verbose=False
            )
            f1_scores.append(f1.item())
        except:
            f1_scores.append(0.5)
    
    # High variance = hallucination, low variance = factual
    mean_f1 = np.mean(f1_scores)
    std_f1 = np.std(f1_scores)
    
    # Convert to hallucination score
    # Low mean or high variance indicates hallucination
    hallucination_score = (1 - mean_f1) + (std_f1 * 0.1)
    
    return min(1.0, max(0.0, hallucination_score))

# Usage
from langchain.llms import Ollama

llm = Ollama(model="llama3.1:8b")
context = "Python is a programming language known for readability"
question = "What is Python known for?"

# Generate original response
original = llm(f"Context: {context}\nQuestion: {question}\nAnswer: ")

# Generate stochastic samples with temperature
stochastic_samples = [
    llm(f"Context: {context}\nQuestion: {question}\nAnswer: ")
    for _ in range(5)
]

score = detect_hallucination_bert_stochastic(original, stochastic_samples)
print(f"Hallucination Score: {score:.2f}")  # Lower = more hallucinated

✅ Pros:Highest recall (90%), detects subtle hallucinations, no external LLM needed

❌ Cons:Highest cost (N+1 model calls), requires temperature sampling, slower execution

Method 4: Token Similarity Detection 🔤

Compare token overlap and BLEU scores between context and response.

Python Implementation:

import re
from collections import Counter
import nltk
from nltk.translate.bleu_score import sentence_bleu

def detect_hallucination_token_similarity(
    context: str,
    statement: str,
    stopwords: set = None
) -> dict:
    """
    Detect hallucination using token-level similarity metrics.
    
    Args:
        context: Retrieved context
        statement: Generated answer
        stopwords: Set of words to ignore (e.g., 'the', 'a', 'is')
        
    Returns:
        Dictionary with intersection and BLEU scores
    """
    if stopwords is None:
        stopwords = {
            'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at',
            'to', 'for', 'of', 'with', 'by', 'is', 'are', 'was', 'were'
        }
    
    # Clean and tokenize
    context_clean = re.sub(r'[^\w\s]', '', context).lower()
    statement_clean = re.sub(r'[^\w\s]', '', statement).lower()
    
    context_tokens = [t for t in context_clean.split() if t not in stopwords]
    statement_tokens = [t for t in statement_clean.split() if t not in stopwords]
    
    if not statement_tokens:
        return {"intersection": 0.0, "bleu": 0.0}
    
    # Calculate token intersection
    context_set = set(context_tokens)
    statement_set = set(statement_tokens)
    intersection = len(statement_set & context_set) / len(statement_set)
    
    # Calculate BLEU score
    bleu = sentence_bleu(
        [context_tokens],
        statement_tokens,
        weights=(0.25, 0.25, 0.25, 0.25)
    )
    
    # Return as hallucination scores (1 - similarity)
    return {
        "intersection_score": 1 - intersection,  # 1 = hallucinated
        "bleu_score": 1 - bleu,
        "combined_score": (1 - intersection + 1 - bleu) / 2
    }

# Usage
context = "Python supports object-oriented and functional programming paradigms."
statement = "Python is a programming language"
scores = detect_hallucination_token_similarity(context, statement)
print(f"Scores: {scores}")
# Expected: low scores (high overlap)

✅ Pros:Very high precision (96%), no cost, instant execution

❌ Cons:Very low recall (3%), misses sophisticated hallucinations, surface-level only

Comparing Detection Methods

Method Performance Metrics

Comparison of Hallucination Detection Methods

Method	Accuracy	Precision	Recall	Cost	Best For
Token Similarity	47%	96% ⭐	3%	Zero	Quick filtering of obvious hallucinations
Semantic Similarity	48%	90%	2%	Low	Fast approximate detection
LLM Prompt-Based	75% ⭐	94%	53%	Moderate ⭐	Balanced production systems
BERT Stochastic	76% ⭐	72%	90% ⭐	High	Critical systems where recall is essential

📊 Key Insights:

Best Accuracy: BERT Stochastic (76%) and LLM Prompt-Based (75%) are statistically equivalent
Best Precision: Token Similarity (96%) catches only the most obvious hallucinations
Best Recall: BERT Stochastic (90%) detects subtle hallucinations others miss
Best Balance: LLM Prompt-Based offers optimal accuracy-to-cost tradeoff

Production Implementation Strategy

Hybrid Approach: Combining Methods

Rather than choosing a single method, production systems should use a cascading pipeline that combines speed with accuracy:

class HalluccinationDetector:
    """Production-grade hallucination detection with cascading methods"""
    
    def __init__(self, llm, embeddings):
        self.llm = llm
        self.embeddings = embeddings
        self.token_threshold = 0.5
        self.semantic_threshold = 0.4
        self.llm_threshold = 0.6
    
    def detect_hallucination(self, context: str, question: str, answer: str) -> dict:
        """
        Cascade detection: fast methods first, expensive methods if needed
        
        Returns:
            {
                'hallucination_probability': float (0-1),
                'confidence': float (0-1),
                'method_used': str,
                'details': dict
            }
        """
        
        # Stage 1: Token-level check (zero cost, instant)
        token_score = self._token_check(context, answer)
        if token_score["combined_score"] < 0.2: # Obviously grounded return { 'hallucination_probability': 0.05, 'confidence': 0.95, 'method_used': 'token_similarity', 'details': token_score } # Stage 2: Semantic similarity (fast, embedding-based) if token_score["combined_score"] > 0.7:  # Likely hallucinated
            semantic_score = self._semantic_check(context, answer)
            if semantic_score > 0.6:
                return {
                    'hallucination_probability': 0.85,
                    'confidence': 0.80,
                    'method_used': 'semantic_similarity',
                    'details': semantic_score
                }
        
        # Stage 3: LLM-based evaluation (more cost, better accuracy)
        llm_score = self._llm_check(context, answer)
        
        return {
            'hallucination_probability': llm_score,
            'confidence': 0.90,
            'method_used': 'llm_prompt_based',
            'details': llm_score
        }
    
    def _token_check(self, context, answer) -> dict:
        # Token similarity implementation
        pass
    
    def _semantic_check(self, context, answer) -> float:
        # Semantic similarity implementation
        pass
    
    def _llm_check(self, context, answer) -> float:
        # LLM prompt-based implementation
        pass

# Usage in RAG pipeline
detector = HalluccinationDetector(llm, embeddings)

# After generating a response
response = rag_system.query(user_question)

# Check for hallucinations
hallucination_check = detector.detect_hallucination(
    context=response.context,
    question=user_question,
    answer=response.answer
)

if hallucination_check['hallucination_probability'] > 0.7:
    # Flag for human review
    flag_for_review(response, hallucination_check)
elif hallucination_check['hallucination_probability'] > 0.5:
    # Request additional retrieval
    response = rag_system.query_with_expanded_search(user_question)
else:
    # Confident - return to user
    return response

Threshold Tuning

The optimal threshold depends on your use case:

🏥 High-Stakes (Medical/Legal)

Threshold: 0.3
Flag more responses for review to ensure accuracy. Better to have false positives than missed hallucinations.

💼 Standard Enterprise

Threshold: 0.6
Balanced approach flagging probable hallucinations while letting most responses through.

⚡ High-Volume Systems

Threshold: 0.75
Only flag obvious hallucinations to minimize false positives and latency.

Monitoring and Feedback Loops

class HalluccinationMonitor:
    """Track hallucination detection performance in production"""
    
    def __init__(self):
        self.detections = []
        self.human_reviews = []
    
    def log_detection(self, context, question, answer, detection_result):
        """Log hallucination detection for analysis"""
        self.detections.append({
            'timestamp': datetime.now(),
            'context_length': len(context),
            'question': question,
            'answer': answer,
            'hallucination_score': detection_result['hallucination_probability'],
            'method': detection_result['method_used']
        })
    
    def log_human_feedback(self, detection_id, was_hallucination: bool):
        """Record human review results to improve threshold"""
        # Use this to adjust thresholds based on real feedback
        pass
    
    def analyze_false_positives(self) -> dict:
        """Identify patterns in false positive hallucination detections"""
        # Helps tune thresholds and improve detection
        pass
    
    def get_metrics(self) -> dict:
        """Calculate real-world precision/recall in production"""
        return {
            'total_detections': len(self.detections),
            'avg_hallucination_score': np.mean([d['hallucination_score'] for d in self.detections]),
            'detection_rate': len([d for d in self.detections if d['hallucination_score'] > 0.6]) / len(self.detections),
            'methods_used': Counter([d['method'] for d in self.detections])
        }

Building Trustworthy RAG Systems

✅ Key Takeaways:

No single perfect method: Each detection technique has tradeoffs between accuracy, cost, and latency
Cascading pipelines win: Combining fast, cheap methods with expensive accurate ones optimizes production performance
Context matters: High-stakes applications justify higher computational cost for better recall
Monitoring is essential: Real-world feedback loops improve detection thresholds and catch edge cases
Hybrid approaches scale: Semantic + LLM methods provide 75%+ accuracy at manageable cost

Implementation Roadmap

Week 1: Implement token-based detection as a baseline filter
Week 2-3: Add semantic similarity for faster approximate detection
Week 4-5: Integrate LLM-based evaluator for better accuracy
Week 6: Deploy cascading pipeline with appropriate thresholds
Ongoing: Monitor metrics and adjust thresholds based on human feedback

🎯 Next Steps:

Evaluate which detection method fits your latency requirements
Start with LLM prompt-based method for best accuracy-to-cost ratio
Build monitoring infrastructure to track false positives/negatives
Establish human feedback loops for continuous improvement
Consider combining BERT stochastic for critical applications

Detecting Hallucinations: A Python Script to Score RAG Accuracy

Why Do Hallucinations Occur?

Types of Hallucinations

🚫 Context-Conflicting

❓ Irrelevant

⚠️ Partially Correct

Four Primary Detection Methods

Method 1: LLM Prompt-Based Detection 🤖

Python Implementation:

Method 2: Semantic Similarity Detection 📊

Python Implementation:

Method 3: BERT Stochastic Checker 🎲

Python Implementation:

Method 4: Token Similarity Detection 🔤

Python Implementation:

Comparing Detection Methods

Method Performance Metrics

Production Implementation Strategy

Hybrid Approach: Combining Methods

Threshold Tuning

🏥 High-Stakes (Medical/Legal)

💼 Standard Enterprise

⚡ High-Volume Systems

Monitoring and Feedback Loops

Building Trustworthy RAG Systems

Implementation Roadmap

Leave a Reply Cancel reply

Related Post

How to Build a 100% Local RAG Pipeline (LlamaIndex + Ollama)How to Build a 100% Local RAG Pipeline (LlamaIndex + Ollama)

Fine-Tuning Llama-3 on a Single GPU: A QLoRA Implementation GuideFine-Tuning Llama-3 on a Single GPU: A QLoRA Implementation Guide