Documentation Index
Fetch the complete documentation index at: https://mintlify.com/georgeguimaraes/arcana/llms.txt
Use this file to discover all available pages before exploring further.
Arcana.Evaluation
TheArcana.Evaluation module provides tools for measuring and improving retrieval quality using standard information retrieval metrics.
Overview
Evaluation helps you:- Generate synthetic test cases from your document chunks
- Measure retrieval performance (MRR, Recall, Precision, NDCG)
- Compare different search modes and configurations
- Track quality improvements over time
Main Functions
generate_test_cases/1
Generates synthetic test cases from existing chunks using an LLM.Your Ecto repo module
LLM for generating questions. Can be a model string, function, or module implementing
Arcana.LLMNumber of chunks to sample for test case generation
Limit to chunks from a specific source
Limit to chunks from a specific collection
Custom prompt template function
fn chunk_text -> prompt end{:ok, [%TestCase{}]} or {:error, reason}
run/1
Runs evaluation against existing test cases and returns metrics.Your Ecto repo module
Search mode:
:semantic, :fulltext, or :hybridLimit evaluation to specific source
Limit evaluation to specific collection
Also evaluate answer quality (requires LLM)
LLM for answer evaluation (required when
evaluate_answers: true)Number of results to evaluate (for recall@k, precision@k, NDCG@k)
{:ok, %Run{}} or {:error, reason}
The returned Run struct contains:
metrics: Map of metric name to scoretest_case_count: Number of test cases evaluatedmode: Search mode usedinserted_at: Timestamp
Test Case Management
list_test_cases/1
Lists all test cases.Your Ecto repo module
Filter by source ID
Filter by collection
get_test_case/2
Retrieves a specific test case by ID.create_test_case/1
Manually creates a test case.Your Ecto repo module
The test question
UUID of the chunk that should be retrieved
Source identifier
Collection name
delete_test_case/2
Deletes a test case.count_test_cases/1
Returns the total number of test cases.Run Management
list_runs/1
Lists evaluation runs.Your Ecto repo module
Maximum number of runs to return
get_run/2
Retrieves a specific evaluation run.delete_run/2
Deletes an evaluation run.Metrics
Arcana.Evaluation provides standard information retrieval metrics:Recall@k
Percentage of relevant documents retrieved in top k results.Precision@k
Percentage of retrieved documents that are relevant in top k results.MRR (Mean Reciprocal Rank)
Average of reciprocal ranks of the first relevant document.NDCG@k (Normalized Discounted Cumulative Gain)
Measures ranking quality, giving more weight to relevant documents at higher positions.Complete Example
Best Practices
Generate Representative Test Cases
Generate Representative Test Cases
- Sample from diverse documents and topics
- Include both easy and hard questions
- Aim for 50-200 test cases for reliable metrics
- Regenerate periodically as your content evolves
Run Regular Evaluations
Run Regular Evaluations
- Evaluate after configuration changes
- Compare different search modes
- Track metrics over time
- Test with different chunk sizes and overlap
Interpret Metrics Together
Interpret Metrics Together
- High recall, low precision: Too many irrelevant results
- Low recall, high precision: Missing relevant results
- High MRR: Relevant results ranked highly
- Use NDCG@k for ranking quality assessment
Optimize Based on Metrics
Optimize Based on Metrics
- Recall@5 < 0.7: Increase search limit or adjust chunking
- Precision@5 < 0.6: Add re-ranking or adjust thresholds
- MRR < 0.6: Review embedding model or query rewriting
- Low NDCG@5: Improve ranking (hybrid search, re-ranking)
Related
- Evaluation Guide - Comprehensive evaluation guide
- Search Algorithms - Understanding search modes
- Re-ranking - Improving result quality
- Telemetry - Monitoring search performance