Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/georgeguimaraes/arcana/llms.txt

Use this file to discover all available pages before exploring further.

Arcana.Evaluation

The Arcana.Evaluation module provides tools for measuring and improving retrieval quality using standard information retrieval metrics.

Overview

Evaluation helps you:
  • Generate synthetic test cases from your document chunks
  • Measure retrieval performance (MRR, Recall, Precision, NDCG)
  • Compare different search modes and configurations
  • Track quality improvements over time

Main Functions

generate_test_cases/1

Generates synthetic test cases from existing chunks using an LLM.
{:ok, test_cases} = Arcana.Evaluation.generate_test_cases(
  repo: MyApp.Repo,
  llm: "openai:gpt-4o-mini",
  sample_size: 50
)
repo
module
required
Your Ecto repo module
llm
string | function | module
required
LLM for generating questions. Can be a model string, function, or module implementing Arcana.LLM
sample_size
integer
default:"50"
Number of chunks to sample for test case generation
source_id
string
Limit to chunks from a specific source
collection
string
Limit to chunks from a specific collection
prompt
function
Custom prompt template function fn chunk_text -> prompt end
Returns: {:ok, [%TestCase{}]} or {:error, reason}

run/1

Runs evaluation against existing test cases and returns metrics.
{:ok, run} = Arcana.Evaluation.run(
  repo: MyApp.Repo,
  mode: :semantic
)

run.metrics
# => %{
#   recall_at_5: 0.84,
#   precision_at_5: 0.68,
#   mrr: 0.76,
#   ndcg_at_5: 0.81
# }
repo
module
required
Your Ecto repo module
mode
atom
default:":semantic"
Search mode: :semantic, :fulltext, or :hybrid
source_id
string
Limit evaluation to specific source
collection
string
Limit evaluation to specific collection
evaluate_answers
boolean
default:"false"
Also evaluate answer quality (requires LLM)
llm
string | function | module
LLM for answer evaluation (required when evaluate_answers: true)
k
integer
default:"5"
Number of results to evaluate (for recall@k, precision@k, NDCG@k)
Returns: {:ok, %Run{}} or {:error, reason} The returned Run struct contains:
  • metrics: Map of metric name to score
  • test_case_count: Number of test cases evaluated
  • mode: Search mode used
  • inserted_at: Timestamp

Test Case Management

list_test_cases/1

Lists all test cases.
{:ok, test_cases} = Arcana.Evaluation.list_test_cases(repo: MyApp.Repo)
repo
module
required
Your Ecto repo module
source_id
string
Filter by source ID
collection
string
Filter by collection

get_test_case/2

Retrieves a specific test case by ID.
{:ok, test_case} = Arcana.Evaluation.get_test_case(id, repo: MyApp.Repo)

create_test_case/1

Manually creates a test case.
{:ok, test_case} = Arcana.Evaluation.create_test_case(
  repo: MyApp.Repo,
  question: "What is Elixir?",
  expected_chunk_id: chunk_id,
  source_id: "docs"
)
repo
module
required
Your Ecto repo module
question
string
required
The test question
expected_chunk_id
string
required
UUID of the chunk that should be retrieved
source_id
string
Source identifier
collection
string
Collection name

delete_test_case/2

Deletes a test case.
:ok = Arcana.Evaluation.delete_test_case(id, repo: MyApp.Repo)

count_test_cases/1

Returns the total number of test cases.
{:ok, count} = Arcana.Evaluation.count_test_cases(repo: MyApp.Repo)

Run Management

list_runs/1

Lists evaluation runs.
{:ok, runs} = Arcana.Evaluation.list_runs(repo: MyApp.Repo)
repo
module
required
Your Ecto repo module
limit
integer
default:"50"
Maximum number of runs to return

get_run/2

Retrieves a specific evaluation run.
{:ok, run} = Arcana.Evaluation.get_run(id, repo: MyApp.Repo)

delete_run/2

Deletes an evaluation run.
:ok = Arcana.Evaluation.delete_run(id, repo: MyApp.Repo)

Metrics

Arcana.Evaluation provides standard information retrieval metrics:

Recall@k

Percentage of relevant documents retrieved in top k results.

Precision@k

Percentage of retrieved documents that are relevant in top k results.

MRR (Mean Reciprocal Rank)

Average of reciprocal ranks of the first relevant document.

NDCG@k (Normalized Discounted Cumulative Gain)

Measures ranking quality, giving more weight to relevant documents at higher positions.

Complete Example

defmodule MyApp.Evaluation do
  alias Arcana.Evaluation

  def evaluate_retrieval do
    repo = MyApp.Repo
    llm = "openai:gpt-4o-mini"

    # 1. Generate test cases
    IO.puts("Generating test cases...")
    {:ok, test_cases} = Evaluation.generate_test_cases(
      repo: repo,
      llm: llm,
      sample_size: 100,
      collection: "docs"
    )
    IO.puts("Generated #{length(test_cases)} test cases")

    # 2. Evaluate semantic search
    IO.puts("\nEvaluating semantic search...")
    {:ok, semantic_run} = Evaluation.run(
      repo: repo,
      mode: :semantic,
      collection: "docs"
    )
    print_metrics("Semantic", semantic_run.metrics)

    # 3. Evaluate hybrid search
    IO.puts("\nEvaluating hybrid search...")
    {:ok, hybrid_run} = Evaluation.run(
      repo: repo,
      mode: :hybrid,
      collection: "docs"
    )
    print_metrics("Hybrid", hybrid_run.metrics)

    # 4. Compare results
    compare_runs(semantic_run, hybrid_run)
  end

  defp print_metrics(label, metrics) do
    IO.puts("#{label} Search Metrics:")
    IO.puts("  Recall@5: #{Float.round(metrics.recall_at_5, 3)}")
    IO.puts("  Precision@5: #{Float.round(metrics.precision_at_5, 3)}")
    IO.puts("  MRR: #{Float.round(metrics.mrr, 3)}")
    IO.puts("  NDCG@5: #{Float.round(metrics.ndcg_at_5, 3)}")
  end

  defp compare_runs(run1, run2) do
    IO.puts("\nComparison:")
    improvement = (run2.metrics.mrr - run1.metrics.mrr) / run1.metrics.mrr * 100
    IO.puts("  MRR improvement: #{Float.round(improvement, 1)}%")
  end
end

Best Practices

  • Sample from diverse documents and topics
  • Include both easy and hard questions
  • Aim for 50-200 test cases for reliable metrics
  • Regenerate periodically as your content evolves
  • Evaluate after configuration changes
  • Compare different search modes
  • Track metrics over time
  • Test with different chunk sizes and overlap
  • High recall, low precision: Too many irrelevant results
  • Low recall, high precision: Missing relevant results
  • High MRR: Relevant results ranked highly
  • Use NDCG@k for ranking quality assessment
  • Recall@5 < 0.7: Increase search limit or adjust chunking
  • Precision@5 < 0.6: Add re-ranking or adjust thresholds
  • MRR < 0.6: Review embedding model or query rewriting
  • Low NDCG@5: Improve ranking (hybrid search, re-ranking)