Skip to content

Ask (AI Chat)

Ask is Contox's conversational interface for querying your project memory. Instead of browsing the brain hierarchy or searching contexts manually, you ask a natural language question and receive an AI-synthesized answer grounded in your actual memory items -- complete with source citations.

How it works

Ask combines three capabilities into a single flow:

  1. Semantic search -- Your question is embedded into a vector and compared against all memory item embeddings for the project. The top matches (by cosine similarity) become the answer's source material.
  2. Context assembly -- The matched memory items are assembled into a structured context block with titles, facts, and file references.
  3. LLM synthesis -- Gemini 2.0 Flash receives the context and your question, then streams a markdown-formatted answer that synthesizes information across all relevant sources.

The result is a single, coherent answer that draws from multiple parts of your project brain, with each source traceable back to specific memory items.

Architecture

Ask uses a ticket-based direct connection architecture to avoid the Appwrite Cloud function timeout (30 seconds), which is too short for LLM streaming responses.

mermaid
sequenceDiagram
    participant Browser as Browser
    participant API as Appwrite (Next.js API)
    participant Worker as VPS Worker

    Browser->>API: POST /api/v2/ask (question, projectId)
    API->>API: Authenticate user (session/API key)
    API->>API: Check AI token quota
    API->>API: Sign HMAC ticket (60s validity)
    API-->>Browser: { ticket, workerUrl }

    Browser->>Worker: POST /ask (ticket, question)
    Worker->>Worker: Verify HMAC ticket signature + expiry
    Worker->>Worker: Embed question (Mistral Embed)
    Worker->>Worker: Fetch project embeddings (cached 5min)
    Worker->>Worker: Cosine similarity → top 8 sources
    Worker->>Worker: Fetch memory items from Appwrite
    Worker->>Worker: Build system prompt with context
    Worker-->>Browser: SSE stream starts
    Worker->>Browser: data: { type: "token", content: "..." }
    Worker->>Browser: data: { type: "token", content: "..." }
    Worker->>Browser: data: { type: "done", answer, sources, usage }
    Worker->>Worker: Persist messages to Appwrite
    Worker->>Worker: Track token usage on rollup

Why a ticket-based flow?

The two-step architecture exists for a specific reason: Appwrite Cloud has a 30-second function timeout. LLM responses can take 10-60 seconds depending on context size and answer length. By having the browser connect directly to the VPS worker via SSE, there is no timeout constraint.

The ticket acts as a short-lived authentication token:

PropertyValue
Formatbase64url(payload).hmac_sha256_hex
Validity60 seconds
ContainsprojectId, teamId, userId, chatSessionId, expiry
Signed withWORKER_API_SECRET (shared between API and worker)

Source grounding

Every answer from Ask is grounded in your project memory. The system is designed to prevent hallucination through multiple mechanisms:

When you ask a question, it is converted to a vector embedding (using Mistral Embed) and compared against all memory item embeddings for your project. Only items with a cosine similarity above 0.50 are considered relevant. The top 8 sources are selected.

Context window

Each source's facts are truncated to 800 characters and assembled into a numbered context block. The LLM receives only these sources -- it has no other knowledge of your codebase.

Anti-hallucination prompt

The system prompt enforces strict rules:

  • Answer only using facts from the provided context sources
  • Never invent file paths, component names, or code that is not in the sources
  • Never write code blocks pretending to show source code (the system only has descriptions, not actual code)
  • If the context does not cover the topic, explicitly say so and suggest scanning the codebase
  • If the context partially covers the question, clearly state what is known and what is not

Source attribution

The LLM includes a hidden metadata comment at the end of its response (<!-- USED: 1, 3, 5 -->) indicating which numbered sources it actually used. The worker parses this to mark sources as "Used" vs "Related" in the response.

Model

Ask uses Gemini 2.0 Flash for answer generation:

PropertyValue
Modelgemini-2.0-flash
Max output tokens2,048
Temperature0.3 (low creativity, high factual accuracy)
Embedding modelmistral-embed (for question embedding)

Gemini 2.0 Flash was chosen for its speed, accuracy, and strong instruction following. The low temperature setting further reduces the likelihood of generating information not present in the sources.

Token usage tracking

Every Ask interaction is tracked for billing purposes:

  • Embed tokens -- Tokens used to embed the question (Mistral Embed)
  • LLM tokens -- Prompt tokens (system prompt + context + question) and completion tokens (the answer)
  • Storage tokens -- Estimated storage cost for persisted messages (question + answer + sources JSON)

Token usage is tracked in two places:

  1. Per-message -- Each assistant message records promptTokens, completionTokens, and totalTokens
  2. Per-session -- The chat session accumulates aiTokensUsed and storageTokens across all messages

All token usage is tracked against the team's quota via the rollup system.

Chat sessions

Ask conversations are organized into chat sessions. Sessions are distinct from the ingestion sessions used by the enrichment pipeline -- they are specific to the Ask feature.

Lifecycle

  1. Auto-creation -- When you ask your first question (or start a new conversation), a chat session is created automatically. The session title is set to the first question (truncated to 60 characters).
  2. Active -- Messages accumulate in the session. Each question-answer pair adds two messages.
  3. Archived -- Sessions can be archived via the PATCH endpoint. Archived sessions are hidden from the default list.
  4. Deleted -- Deleting a session removes the session and all associated messages. Storage token rollups are decremented.

Persistence

Both the user question and the assistant answer are persisted to Appwrite after each interaction. Messages include:

  • Role (user or assistant)
  • Content (the question or answer text)
  • Sources (JSON array of search results, for assistant messages)
  • Token usage (for assistant messages)
  • Model identifier

This means conversations survive page reloads and can be resumed at any time.

Embedding cache

To keep Ask fast, the worker maintains an in-memory embedding cache per project:

  • TTL: 5 minutes
  • Scope: All memory item embeddings for a project
  • Invalidation: Automatic after new embeddings are created (e.g., after enrichment)

On the first Ask for a project (or after cache expiry), all embeddings are fetched from Appwrite in paginated batches of 500. Subsequent questions within the 5-minute window use the cache, making the semantic search step near-instant.

When Ask returns no sources

If no memory items meet the similarity threshold (0.50), Ask returns immediately without calling the LLM:

"I couldn't find relevant information in the project memory to answer this question."

This happens when:

  • The project has no memory items yet (run a scan or save some sessions first)
  • The question is about a topic not covered in the project memory
  • Embeddings have not been generated for existing items

Next steps