Machine Learning how to Tutorials Detecting Hallucinations: A Python Script to Score RAG Accuracy

Detecting Hallucinations: A Python Script to Score RAG Accuracy

Even with Retrieval-Augmented Generation (RAG) technology, Large Language Models (LLMs) continue to generate hallucinationsβ€”factually incorrect responses that appear plausible but contradict the provided context. This remains one of the most critical challenges in enterprise AI deployments.

⚠️ Real-World Impact: A major airline lost a court case after their RAG system hallucinated details about their refund policy, demonstrating the serious consequences of undetected hallucinations in production systems.

Why Do Hallucinations Occur?

  • LLM Brittleness: Even when context contains correct answers, models may fail to synthesize information accurately, especially across multiple facts
  • Poor Retrieval: Incomplete context from suboptimal search or poor document chunking can cause models to fill gaps with invented information
  • Knowledge Conflicts: When retrieved context contradicts the model’s training data, it may prioritize training knowledge over provided facts
  • Complex Reasoning: Models struggle with reasoning tasks that require multiple logical steps across different context sections

The solution is to implement automated hallucination detection that can flag untrustworthy responses in real-time, allowing for human review or alternative retrieval strategies.

Types of Hallucinations

Understanding the different types of hallucinations helps select appropriate detection methods:

🚫 Context-Conflicting

Response directly contradicts the provided context or adds facts not supported by retrieved documents

❓ Irrelevant

Response doesn’t address the user’s question or is semantically unrelated to the query

⚠️ Partially Correct

Some parts are accurate while others are fabricated or misleading, making detection more challenging

Most production hallucination detection systems focus on context-conflicting hallucinations, which are the most dangerous because they provide false information grounded in real documents.

Four Primary Detection Methods

Method 1: LLM Prompt-Based Detection πŸ€–

Use another LLM instance to evaluate whether the answer is grounded in the context.

How It Works:

  1. Send the context, question, and answer to an evaluator LLM
  2. Provide few-shot examples showing what grounded vs. hallucinated responses look like
  3. Request a hallucination score between 0.0 (grounded) and 1.0 (hallucinated)
  4. Apply a threshold to classify responses (e.g., >0.5 = hallucination)

Python Implementation:

from langchain.prompts import PromptTemplate
from langchain.llms import Ollama

# Hallucination detection prompt
HALLUCINATION_PROMPT = """
You are an expert at detecting hallucinations in LLM responses.

Your task: Determine if the statement is directly supported by the context.
- Score 0.0 if confident the statement is grounded in context
- Score 1.0 if confident the statement contradicts or is absent from context
- Score between 0.0-1.0 if uncertain

Examples:
Context: AWS provides cloud computing services with EC2 for virtual machines.
Statement: "AWS offers EC2 for virtual computing"
Score: 0.05 (directly supported)

Context: AWS provides cloud computing services with EC2 for virtual machines.
Statement: "AWS revenue in 2024 was $100 billion"
Score: 1.0 (not in context)

Context: {context}
Statement: {statement}
Response: [score only, no explanation]
"""

def detect_hallucination_llm(context: str, statement: str, llm) -> float:
    """Detect hallucination using LLM evaluation"""
    prompt = PromptTemplate(
        template=HALLUCINATION_PROMPT,
        input_variables=["context", "statement"]
    )
    
    response = llm(prompt.format(context=context, statement=statement))
    
    try:
        return float(response.strip())
    except ValueError:
        return 0.5  # Return neutral score if parsing fails

# Usage
llm = Ollama(model="llama3.1:8b")
context = "The Earth orbits the Sun in approximately 365.25 days."
statement = "It takes Earth 365 days to orbit the Sun"
score = detect_hallucination_llm(context, statement, llm)
print(f"Hallucination Score: {score:.2f}")  # Expected: ~0.1

Method 2: Semantic Similarity Detection πŸ“Š

Compare embeddings of context and answer using cosine similarity.

How It Works:

  1. Generate embeddings for context and answer using an embedding model
  2. Calculate cosine similarity between them
  3. Convert similarity to hallucination score (1 – similarity)
  4. Low similarity = likely hallucination

Python Implementation:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from langchain.embeddings import OllamaEmbeddings

def detect_hallucination_semantic(context: str, statement: str, embeddings) -> float:
    """
    Detect hallucination using semantic similarity.
    
    Args:
        context: Retrieved document context
        statement: Generated answer to evaluate
        embeddings: Embedding model instance
        
    Returns:
        Hallucination score (0.0-1.0)
    """
    if not context or not statement:
        return 0.0
    
    # Generate embeddings
    context_emb = embeddings.embed_query(context)
    statement_emb = embeddings.embed_query(statement)
    
    # Reshape for sklearn
    context_emb = np.array(context_emb).reshape(1, -1)
    statement_emb = np.array(statement_emb).reshape(1, -1)
    
    # Calculate cosine similarity
    sim_score = cosine_similarity(context_emb, statement_emb)[0][0]
    
    # Convert to hallucination score (1 - similarity)
    hallucination_score = 1 - sim_score
    
    return float(hallucination_score)

# Usage
embeddings = OllamaEmbeddings(model="nomic-embed-text")
context = "Machine learning is a subset of artificial intelligence."
statement = "ML is a subset of AI"
score = detect_hallucination_semantic(context, statement, embeddings)
print(f"Hallucination Score: {score:.2f}")  # Expected: ~0.05
βœ… Pros:Fast execution, easy to understand, high precision (90%)

❌ Cons:Only 48% accuracy, very low recall (2%), misses partial hallucinations

Method 3: BERT Stochastic Checker 🎲

Generate multiple responses and check consistency using BERT scores.

How It Works:

  1. Generate N multiple responses from the same LLM using temperature sampling
  2. Compare original response against all N stochastic samples
  3. Calculate BERT F1 scores for semantic similarity
  4. Low variance across samples = factual, high variance = hallucinated

Python Implementation:

from bert_score import score as bert_score_compute
import numpy as np

def detect_hallucination_bert_stochastic(
    original_response: str,
    stochastic_samples: list,
    model_type: str = "distilbert-base-uncased"
) -> float:
    """
    Detect hallucination using BERT stochastic consistency checking.
    
    Args:
        original_response: The main answer to verify
        stochastic_samples: List of N alternative responses from same model
        model_type: BERT model for scoring
        
    Returns:
        Hallucination score (lower = more hallucinated)
    """
    if len(stochastic_samples) == 0:
        return 0.5
    
    f1_scores = []
    
    # Compare original against each stochastic sample
    for sample in stochastic_samples:
        try:
            _, _, f1 = bert_score_compute(
                [original_response],
                [sample],
                model_type=model_type,
                verbose=False
            )
            f1_scores.append(f1.item())
        except:
            f1_scores.append(0.5)
    
    # High variance = hallucination, low variance = factual
    mean_f1 = np.mean(f1_scores)
    std_f1 = np.std(f1_scores)
    
    # Convert to hallucination score
    # Low mean or high variance indicates hallucination
    hallucination_score = (1 - mean_f1) + (std_f1 * 0.1)
    
    return min(1.0, max(0.0, hallucination_score))

# Usage
from langchain.llms import Ollama

llm = Ollama(model="llama3.1:8b")
context = "Python is a programming language known for readability"
question = "What is Python known for?"

# Generate original response
original = llm(f"Context: {context}\nQuestion: {question}\nAnswer: ")

# Generate stochastic samples with temperature
stochastic_samples = [
    llm(f"Context: {context}\nQuestion: {question}\nAnswer: ")
    for _ in range(5)
]

score = detect_hallucination_bert_stochastic(original, stochastic_samples)
print(f"Hallucination Score: {score:.2f}")  # Lower = more hallucinated

Method 4: Token Similarity Detection πŸ”€

Compare token overlap and BLEU scores between context and response.

Python Implementation:

import re
from collections import Counter
import nltk
from nltk.translate.bleu_score import sentence_bleu

def detect_hallucination_token_similarity(
    context: str,
    statement: str,
    stopwords: set = None
) -> dict:
    """
    Detect hallucination using token-level similarity metrics.
    
    Args:
        context: Retrieved context
        statement: Generated answer
        stopwords: Set of words to ignore (e.g., 'the', 'a', 'is')
        
    Returns:
        Dictionary with intersection and BLEU scores
    """
    if stopwords is None:
        stopwords = {
            'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at',
            'to', 'for', 'of', 'with', 'by', 'is', 'are', 'was', 'were'
        }
    
    # Clean and tokenize
    context_clean = re.sub(r'[^\w\s]', '', context).lower()
    statement_clean = re.sub(r'[^\w\s]', '', statement).lower()
    
    context_tokens = [t for t in context_clean.split() if t not in stopwords]
    statement_tokens = [t for t in statement_clean.split() if t not in stopwords]
    
    if not statement_tokens:
        return {"intersection": 0.0, "bleu": 0.0}
    
    # Calculate token intersection
    context_set = set(context_tokens)
    statement_set = set(statement_tokens)
    intersection = len(statement_set & context_set) / len(statement_set)
    
    # Calculate BLEU score
    bleu = sentence_bleu(
        [context_tokens],
        statement_tokens,
        weights=(0.25, 0.25, 0.25, 0.25)
    )
    
    # Return as hallucination scores (1 - similarity)
    return {
        "intersection_score": 1 - intersection,  # 1 = hallucinated
        "bleu_score": 1 - bleu,
        "combined_score": (1 - intersection + 1 - bleu) / 2
    }

# Usage
context = "Python supports object-oriented and functional programming paradigms."
statement = "Python is a programming language"
scores = detect_hallucination_token_similarity(context, statement)
print(f"Scores: {scores}")
# Expected: low scores (high overlap)
βœ… Pros:Very high precision (96%), no cost, instant execution

❌ Cons:Very low recall (3%), misses sophisticated hallucinations, surface-level only

Comparing Detection Methods

Method Performance Metrics

Comparison of Hallucination Detection Methods

Method Accuracy Precision Recall Cost Best For
Token Similarity 47% 96% ⭐ 3% Zero Quick filtering of obvious hallucinations
Semantic Similarity 48% 90% 2% Low Fast approximate detection
LLM Prompt-Based 75% ⭐ 94% 53% Moderate ⭐ Balanced production systems
BERT Stochastic 76% ⭐ 72% 90% ⭐ High Critical systems where recall is essential
πŸ“Š Key Insights:

  • Best Accuracy: BERT Stochastic (76%) and LLM Prompt-Based (75%) are statistically equivalent
  • Best Precision: Token Similarity (96%) catches only the most obvious hallucinations
  • Best Recall: BERT Stochastic (90%) detects subtle hallucinations others miss
  • Best Balance: LLM Prompt-Based offers optimal accuracy-to-cost tradeoff

Production Implementation Strategy

Hybrid Approach: Combining Methods

Rather than choosing a single method, production systems should use a cascading pipeline that combines speed with accuracy:

class HalluccinationDetector:
    """Production-grade hallucination detection with cascading methods"""
    
    def __init__(self, llm, embeddings):
        self.llm = llm
        self.embeddings = embeddings
        self.token_threshold = 0.5
        self.semantic_threshold = 0.4
        self.llm_threshold = 0.6
    
    def detect_hallucination(self, context: str, question: str, answer: str) -> dict:
        """
        Cascade detection: fast methods first, expensive methods if needed
        
        Returns:
            {
                'hallucination_probability': float (0-1),
                'confidence': float (0-1),
                'method_used': str,
                'details': dict
            }
        """
        
        # Stage 1: Token-level check (zero cost, instant)
        token_score = self._token_check(context, answer)
        if token_score["combined_score"] < 0.2: # Obviously grounded return { 'hallucination_probability': 0.05, 'confidence': 0.95, 'method_used': 'token_similarity', 'details': token_score } # Stage 2: Semantic similarity (fast, embedding-based) if token_score["combined_score"] > 0.7:  # Likely hallucinated
            semantic_score = self._semantic_check(context, answer)
            if semantic_score > 0.6:
                return {
                    'hallucination_probability': 0.85,
                    'confidence': 0.80,
                    'method_used': 'semantic_similarity',
                    'details': semantic_score
                }
        
        # Stage 3: LLM-based evaluation (more cost, better accuracy)
        llm_score = self._llm_check(context, answer)
        
        return {
            'hallucination_probability': llm_score,
            'confidence': 0.90,
            'method_used': 'llm_prompt_based',
            'details': llm_score
        }
    
    def _token_check(self, context, answer) -> dict:
        # Token similarity implementation
        pass
    
    def _semantic_check(self, context, answer) -> float:
        # Semantic similarity implementation
        pass
    
    def _llm_check(self, context, answer) -> float:
        # LLM prompt-based implementation
        pass

# Usage in RAG pipeline
detector = HalluccinationDetector(llm, embeddings)

# After generating a response
response = rag_system.query(user_question)

# Check for hallucinations
hallucination_check = detector.detect_hallucination(
    context=response.context,
    question=user_question,
    answer=response.answer
)

if hallucination_check['hallucination_probability'] > 0.7:
    # Flag for human review
    flag_for_review(response, hallucination_check)
elif hallucination_check['hallucination_probability'] > 0.5:
    # Request additional retrieval
    response = rag_system.query_with_expanded_search(user_question)
else:
    # Confident - return to user
    return response

Threshold Tuning

The optimal threshold depends on your use case:

Monitoring and Feedback Loops

class HalluccinationMonitor:
    """Track hallucination detection performance in production"""
    
    def __init__(self):
        self.detections = []
        self.human_reviews = []
    
    def log_detection(self, context, question, answer, detection_result):
        """Log hallucination detection for analysis"""
        self.detections.append({
            'timestamp': datetime.now(),
            'context_length': len(context),
            'question': question,
            'answer': answer,
            'hallucination_score': detection_result['hallucination_probability'],
            'method': detection_result['method_used']
        })
    
    def log_human_feedback(self, detection_id, was_hallucination: bool):
        """Record human review results to improve threshold"""
        # Use this to adjust thresholds based on real feedback
        pass
    
    def analyze_false_positives(self) -> dict:
        """Identify patterns in false positive hallucination detections"""
        # Helps tune thresholds and improve detection
        pass
    
    def get_metrics(self) -> dict:
        """Calculate real-world precision/recall in production"""
        return {
            'total_detections': len(self.detections),
            'avg_hallucination_score': np.mean([d['hallucination_score'] for d in self.detections]),
            'detection_rate': len([d for d in self.detections if d['hallucination_score'] > 0.6]) / len(self.detections),
            'methods_used': Counter([d['method'] for d in self.detections])
        }

Building Trustworthy RAG Systems

βœ… Key Takeaways:

  • No single perfect method: Each detection technique has tradeoffs between accuracy, cost, and latency
  • Cascading pipelines win: Combining fast, cheap methods with expensive accurate ones optimizes production performance
  • Context matters: High-stakes applications justify higher computational cost for better recall
  • Monitoring is essential: Real-world feedback loops improve detection thresholds and catch edge cases
  • Hybrid approaches scale: Semantic + LLM methods provide 75%+ accuracy at manageable cost

Implementation Roadmap

  1. Week 1: Implement token-based detection as a baseline filter
  2. Week 2-3: Add semantic similarity for faster approximate detection
  3. Week 4-5: Integrate LLM-based evaluator for better accuracy
  4. Week 6: Deploy cascading pipeline with appropriate thresholds
  5. Ongoing: Monitor metrics and adjust thresholds based on human feedback
🎯 Next Steps:

  • Evaluate which detection method fits your latency requirements
  • Start with LLM prompt-based method for best accuracy-to-cost ratio
  • Build monitoring infrastructure to track false positives/negatives
  • Establish human feedback loops for continuous improvement
  • Consider combining BERT stochastic for critical applications

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Post