Ask (AI Chat)
Ask is Contox's conversational interface for querying your project memory. Instead of browsing the brain hierarchy or searching contexts manually, you ask a natural language question and receive an AI-synthesized answer grounded in your actual memory items -- complete with source citations.
How it works
Ask combines three capabilities into a single flow:
- Semantic search -- Your question is embedded into a vector and compared against all memory item embeddings for the project. The top matches (by cosine similarity) become the answer's source material.
- Context assembly -- The matched memory items are assembled into a structured context block with titles, facts, and file references.
- LLM synthesis -- Gemini 2.0 Flash receives the context and your question, then streams a markdown-formatted answer that synthesizes information across all relevant sources.
The result is a single, coherent answer that draws from multiple parts of your project brain, with each source traceable back to specific memory items.
Architecture
Ask uses a ticket-based direct connection architecture to avoid the Appwrite Cloud function timeout (30 seconds), which is too short for LLM streaming responses.
sequenceDiagram
participant Browser as Browser
participant API as Appwrite (Next.js API)
participant Worker as VPS Worker
Browser->>API: POST /api/v2/ask (question, projectId)
API->>API: Authenticate user (session/API key)
API->>API: Check AI token quota
API->>API: Sign HMAC ticket (60s validity)
API-->>Browser: { ticket, workerUrl }
Browser->>Worker: POST /ask (ticket, question)
Worker->>Worker: Verify HMAC ticket signature + expiry
Worker->>Worker: Embed question (Mistral Embed)
Worker->>Worker: Fetch project embeddings (cached 5min)
Worker->>Worker: Cosine similarity → top 8 sources
Worker->>Worker: Fetch memory items from Appwrite
Worker->>Worker: Build system prompt with context
Worker-->>Browser: SSE stream starts
Worker->>Browser: data: { type: "token", content: "..." }
Worker->>Browser: data: { type: "token", content: "..." }
Worker->>Browser: data: { type: "done", answer, sources, usage }
Worker->>Worker: Persist messages to Appwrite
Worker->>Worker: Track token usage on rollup
Why a ticket-based flow?
The two-step architecture exists for a specific reason: Appwrite Cloud has a 30-second function timeout. LLM responses can take 10-60 seconds depending on context size and answer length. By having the browser connect directly to the VPS worker via SSE, there is no timeout constraint.
The ticket acts as a short-lived authentication token:
| Property | Value |
|---|---|
| Format | base64url(payload).hmac_sha256_hex |
| Validity | 60 seconds |
| Contains | projectId, teamId, userId, chatSessionId, expiry |
| Signed with | WORKER_API_SECRET (shared between API and worker) |
Source grounding
Every answer from Ask is grounded in your project memory. The system is designed to prevent hallucination through multiple mechanisms:
Semantic search
When you ask a question, it is converted to a vector embedding (using Mistral Embed) and compared against all memory item embeddings for your project. Only items with a cosine similarity above 0.50 are considered relevant. The top 8 sources are selected.
Context window
Each source's facts are truncated to 800 characters and assembled into a numbered context block. The LLM receives only these sources -- it has no other knowledge of your codebase.
Anti-hallucination prompt
The system prompt enforces strict rules:
- Answer only using facts from the provided context sources
- Never invent file paths, component names, or code that is not in the sources
- Never write code blocks pretending to show source code (the system only has descriptions, not actual code)
- If the context does not cover the topic, explicitly say so and suggest scanning the codebase
- If the context partially covers the question, clearly state what is known and what is not
Source attribution
The LLM includes a hidden metadata comment at the end of its response (<!-- USED: 1, 3, 5 -->) indicating which numbered sources it actually used. The worker parses this to mark sources as "Used" vs "Related" in the response.
Model
Ask uses Gemini 2.0 Flash for answer generation:
| Property | Value |
|---|---|
| Model | gemini-2.0-flash |
| Max output tokens | 2,048 |
| Temperature | 0.3 (low creativity, high factual accuracy) |
| Embedding model | mistral-embed (for question embedding) |
Gemini 2.0 Flash was chosen for its speed, accuracy, and strong instruction following. The low temperature setting further reduces the likelihood of generating information not present in the sources.
Token usage tracking
Every Ask interaction is tracked for billing purposes:
- Embed tokens -- Tokens used to embed the question (Mistral Embed)
- LLM tokens -- Prompt tokens (system prompt + context + question) and completion tokens (the answer)
- Storage tokens -- Estimated storage cost for persisted messages (question + answer + sources JSON)
Token usage is tracked in two places:
- Per-message -- Each assistant message records
promptTokens,completionTokens, andtotalTokens - Per-session -- The chat session accumulates
aiTokensUsedandstorageTokensacross all messages
All token usage is tracked against the team's quota via the rollup system.
Chat sessions
Ask conversations are organized into chat sessions. Sessions are distinct from the ingestion sessions used by the enrichment pipeline -- they are specific to the Ask feature.
Lifecycle
- Auto-creation -- When you ask your first question (or start a new conversation), a chat session is created automatically. The session title is set to the first question (truncated to 60 characters).
- Active -- Messages accumulate in the session. Each question-answer pair adds two messages.
- Archived -- Sessions can be archived via the PATCH endpoint. Archived sessions are hidden from the default list.
- Deleted -- Deleting a session removes the session and all associated messages. Storage token rollups are decremented.
Persistence
Both the user question and the assistant answer are persisted to Appwrite after each interaction. Messages include:
- Role (
userorassistant) - Content (the question or answer text)
- Sources (JSON array of search results, for assistant messages)
- Token usage (for assistant messages)
- Model identifier
This means conversations survive page reloads and can be resumed at any time.
Embedding cache
To keep Ask fast, the worker maintains an in-memory embedding cache per project:
- TTL: 5 minutes
- Scope: All memory item embeddings for a project
- Invalidation: Automatic after new embeddings are created (e.g., after enrichment)
On the first Ask for a project (or after cache expiry), all embeddings are fetched from Appwrite in paginated batches of 500. Subsequent questions within the 5-minute window use the cache, making the semantic search step near-instant.
When Ask returns no sources
If no memory items meet the similarity threshold (0.50), Ask returns immediately without calling the LLM:
"I couldn't find relevant information in the project memory to answer this question."
This happens when:
- The project has no memory items yet (run a scan or save some sessions first)
- The question is about a topic not covered in the project memory
- Embeddings have not been generated for existing items
Next steps
- Ask Dashboard Guide -- How to use the Ask interface
- Ask Your Codebase -- Tips for getting the best answers
- V2 Ask API -- API endpoint reference
- Context Packs -- Another way to query project memory