Arcana supports two sizing modes:Tokens (default):
# Good for LLM context limitschunk_size: 450, # ~450 tokenssize_unit: :tokens# Converted to ~1800 characters internally (4 chars/token)# From lib/arcana/chunker/default.ex:42-48
Characters:
# Good for fixed-width displayschunk_size: 2000, # 2000 characterssize_unit: :characters
Token estimation uses ~4 characters per token for English text, matching typical BPE tokenizer behavior (lib/arcana/chunker/default.ex:73-77).
Why overlap matters:
Without overlap:[Chunk 1: "...discussing machine learning"] [Chunk 2: "models for prediction tasks..."]❌ "machine learning models" split across chunksWith 50 token overlap:[Chunk 1: "...discussing machine learning models for..."][Chunk 2: "...learning models for prediction tasks..."]✅ "machine learning models" appears in both chunks
text = """# IntroductionElixir is a functional language.## Key Features- Concurrency via processes- Pattern matching## OTP FrameworkBuilt on Erlang's OTP..."""{:ok, document} = Arcana.ingest(text, repo: MyApp.Repo, format: :markdown, # Chunker preserves section boundaries chunk_size: 300)# Result chunks:# 1. "# Introduction\nElixir is a functional language."# 2. "## Key Features\n- Concurrency via processes\n- Pattern matching"# 3. "## OTP Framework\nBuilt on Erlang's OTP..."
Benefits:
Keeps related content together
Preserves hierarchical structure
Better semantic coherence
Respects function boundaries:
code = """defmodule MyApp.User do def create(attrs) do %User{} |> User.changeset(attrs) |> Repo.insert() end def update(user, attrs) do user |> User.changeset(attrs) |> Repo.update() endend"""{:ok, document} = Arcana.ingest(code, repo: MyApp.Repo, format: :elixir, # Chunks at function boundaries chunk_size: 200)# Each function stays together when possible
Supported formats (from text_chunker library):
:elixir
:python
:javascript
:typescript
Chunks at sentence/paragraph boundaries:
text = """First paragraph with multiple sentences. This continues the same idea. More content here.Second paragraph starts a new topic. It has its own sentences and concepts.Third paragraph..."""{:ok, document} = Arcana.ingest(text, repo: MyApp.Repo, format: :plaintext, # default chunk_size: 150)# Tries to break at paragraph boundaries first,# then sentence boundaries if needed
# API documentation searchconfig :arcana, chunker: {:default, chunk_size: 300, chunk_overlap: 50 }# Each function gets its own chunk:# Chunk 1: "Arcana.search/2 - Searches for chunks..."# Chunk 2: "Arcana.ingest/2 - Ingests text content..."
Pros:
✅ High precision
✅ Fast embedding generation
✅ Fits more context chunks in LLM window
Cons:
❌ May split related concepts
❌ More chunks to search through
❌ Less context per chunk
Best for (DEFAULT):
General knowledge bases
Documentation
Blog posts
Technical articles
Example:
# Default configurationconfig :arcana, chunker: {:default, chunk_size: 450, # ← default chunk_overlap: 50 }# Balanced chunks with full context:# "The RAG pipeline has six steps. First, chunking # splits documents... Second, embedding converts..."
Pros:
✅ Good balance of precision and context
✅ Reasonable embedding speed
✅ Works well for most use cases
Cons:
⚠️ May need tuning per domain
Best for:
Narrative content
Research papers
Long-form explanations
Summarization tasks
Example:
# Research paper ingestion{:ok, document} = Arcana.ingest(paper_text, repo: MyApp.Repo, chunk_size: 1000, chunk_overlap: 150, format: :markdown # Preserve section structure)# Each chunk contains a complete section/argument
Pros:
✅ Full context preserved
✅ Better for complex topics
✅ Fewer chunks overall
Cons:
❌ Lower precision (more irrelevant content per chunk)
# Rule of thumb: 10-15% of chunk size# Small chunkschunk_size: 300chunk_overlap: 30 # 10%# Medium chunks (default)chunk_size: 450chunk_overlap: 50 # 11%# Large chunkschunk_size: 1000chunk_overlap: 150 # 15%# Too little overlapchunk_overlap: 10 # ❌ Concepts at boundaries get split# Too much overlapchunk_overlap: 300 # ❌ Redundant chunks, slower search
# Function documentation (one function per chunk)defmodule MyApp.API.Ingest do def ingest_api_docs(module_docs) do module_docs |> extract_function_docs() |> Enum.each(fn {function_name, doc_text} -> # Small chunks - each function separate {:ok, document} = Arcana.ingest(doc_text, repo: MyApp.Repo, collection: "api-docs", chunk_size: 250, # Small for precision chunk_overlap: 25, # Minimal overlap metadata: %{ function: function_name, module: module_docs.module } ) end) end defp extract_function_docs(module_docs) do # Parse ExDoc format module_docs.functions |> Enum.map(fn {name, arity, doc} -> function_name = "#{name}/#{arity}" {function_name, doc} end) endend# Result: High precision for function lookup# Query: "Arcana.search/2" → exact function doc
# Academic papers (large chunks, preserve arguments)defmodule MyApp.Research.Ingest do def ingest_paper(pdf_path) do {:ok, document} = Arcana.ingest_file(pdf_path, repo: MyApp.Repo, collection: "research-papers", chunk_size: 1000, # Large chunks chunk_overlap: 150, # 15% overlap format: :plaintext, # PDFs convert to plain text metadata: %{ content_type: "research-paper", source: Path.basename(pdf_path) } ) # Also extract citations and metadata extract_paper_metadata(document) document end defp extract_paper_metadata(document) do # Could use GraphRAG to extract: # - Authors # - Institutions # - Citations # - Key terms Arcana.Graph.build_and_persist( document.chunks, document.collection, MyApp.Repo, graph: true ) endend# Result: Each chunk = complete argument/section# Better for understanding complex concepts
# FAQ and support articlesdefmodule MyApp.Support.Ingest do def ingest_faq(faq_list) do faq_list |> Enum.each(fn %{question: q, answer: a} -> # Each Q&A pair = one chunk text = "Q: #{q}\n\nA: #{a}" {:ok, document} = Arcana.ingest(text, repo: MyApp.Repo, collection: "support-faq", chunk_size: 400, # One Q&A per chunk chunk_overlap: 0, # No overlap needed metadata: %{ type: "faq", question: q } ) end) end def search_faq(user_question) do # High precision search {:ok, results} = Arcana.search(user_question, repo: MyApp.Repo, collection: "support-faq", mode: :hybrid, # Match question semantics + keywords limit: 3, # Top 3 most relevant FAQs threshold: 0.75 # High threshold for quality ) results endend# Result: Precise matching for user questions