Enrichment Pipeline
The enrichment pipeline is the core intelligence layer of Contox. It transforms raw events from coding sessions into structured, verified memory items that populate the project brain. The pipeline combines security validation, AI extraction, and programmatic verification to produce high-quality knowledge.
End-to-End Flow
sequenceDiagram
participant Client as MCP / CLI / VS Code
participant Ingest as POST /api/v2/ingest
participant DB as Raw Events DB
participant Storage as Blob Storage
participant User as Dashboard
participant Enrich as Enrichment Worker
participant Indexer as Evidence Indexer
participant LLM as Mistral AI
participant Verify as Quote Verifier
participant Dedup as Deduplicator
participant Items as Memory Items DB
Client->>Ingest: Send event + HMAC signature
Ingest->>Ingest: Validate HMAC + timestamp
Ingest->>Ingest: Compute event hash (idempotence)
Ingest->>DB: Store raw_event
Ingest->>Storage: Upload blobs (if any)
Ingest->>DB: Find/create session
Ingest-->>Client: 202 Accepted (sessionId, eventId)
Note over User: User clicks "Generate Memory"
User->>Enrich: POST /sessions/:id/enrich
Enrich->>DB: Fetch unprocessed raw_events
Enrich->>Enrich: Build evidence index
Enrich->>Indexer: Pre-classify evidence entries
Indexer-->>Enrich: Schema key hints + relevance filter
Enrich->>DB: Fetch existing brain context
Enrich->>LLM: Extract memory items (structured JSON)
LLM-->>Enrich: V16 response (items + sessionSummary)
Enrich->>Verify: Programmatic quote verification
Verify-->>Enrich: Verified items (hallucinated items rejected)
Enrich->>Dedup: Check for duplicates
Dedup-->>Enrich: Deduplicated items
Enrich->>Items: Store memory items
Enrich->>DB: Mark raw_events as processed
Pipeline Stages
Stage 1: Ingestion
All events enter through POST /api/v2/ingest. This endpoint handles four event types from four sources:
| Event Type | Source | Description |
|---|---|---|
mcp_save | MCP Server | Structured save from AI coding assistants (Claude, etc.) |
auto_save | CLI | Automatic periodic capture of file changes and commands |
scan | CLI | Full codebase scan results |
vscode_capture | VS Code Extension | Git commits with file changes and diffs |
Security Validation
Every ingest request is validated through multiple checks:
- HMAC Signature -- The event payload is signed with a per-project HMAC secret (or a global fallback). The server recomputes the signature and rejects mismatches.
- Timestamp Check -- Anti-replay protection. Events with timestamps too far in the past or future are rejected.
- Idempotence -- A SHA-256 hash of the event payload is computed and checked against existing events. Duplicate events return the original response without re-processing.
Blob Storage
Events can include binary blobs (diffs, transcripts, file snapshots) encoded as base64. These are uploaded to the Appwrite Storage evidence-blobs bucket with metadata tracking.
Session Assignment
Each event is assigned to a session using the 4-hour window logic (see Sessions). If an active session exists and is less than 4 hours old, the event joins that session. Otherwise, a new session is created.
Stage 2: Evidence Indexing
Before calling the LLM, the pipeline builds an evidence index -- a structured list of evidence entries derived from the raw event:
- For
mcp_save: The session summary becomesev_transcript_01, and each change entry becomes a separate evidence (e.g.,ev_impl_01,ev_bugfix_02) - For
vscode_capture: The session context becomesev_session_01, and each commit becomesev_commit_01,ev_commit_02, etc.
Each evidence entry has an ID, text content, and optional metadata (commit range, etc.).
Stage 3: Pre-Classification
The Evidence Indexer agent runs a lightweight classification pass on each evidence entry before the main LLM extraction. It determines:
- Relevance -- Whether the evidence is worth extracting (filters out noise like version bumps, auto-generated files)
- Schema Key Hints -- Suggested schema keys based on file paths and content
- Tags -- Preliminary categorization tags
- Confidence -- How confident the classifier is
Irrelevant evidence entries are filtered out, reducing noise and token usage for the main extraction step. Classification hints are injected into the evidence entries as metadata for the LLM.
Stage 4: Context-Aware LLM Extraction
The core extraction step uses Mistral AI to transform evidence into structured memory items. The pipeline is context-aware -- it feeds the existing brain context to the LLM to avoid creating duplicates.
Model Selection
The AI model is selected based on the team's plan, with an optional per-project override available to all paid plans:
| Plan | Default Model | With aiTier Override |
|---|---|---|
| Free | mistral-small-latest | N/A (upgrade required) |
| Personal | mistral-small-latest | small or medium |
| Team | mistral-medium-latest | small or medium |
| Business | mistral-medium-latest | small or medium |
| Enterprise | mistral-medium-latest | small or medium |
Users on any paid plan can override the default model in Project Settings > AI Model using radio cards. Setting aiTier: 'small' or aiTier: 'medium' explicitly selects the model, while null uses the plan default.
V16 Agent Schema
For mcp_save and vscode_capture events, the pipeline uses a strict JSON schema (the "V16 agent") that enforces:
- Structured output with
sessionSummaryanditemsarrays - Each item must have
evidenceRefspointing to actual evidence IDs - Each item must have
evidenceSpanswith verbatim quotes from the evidence - Schema keys must come from the canonical 29-key enum
- Importance calibrated on a strict 1--5 scale
- DedupHints in
category:topic:keyformat - Maximum 5 items per event (quality over quantity)
The system prompt instructs the LLM to:
- Only extract facts grounded in the evidence (no speculation)
- Focus on genuinely new information not already in the existing memory
- Reuse dedup hints when the session continues work on existing topics
- Prefer skipping over creating redundant items
Auto-Save Extraction
For auto_save events, a simpler extraction is used with mistral-small-latest and free-form JSON output. This produces lower-fidelity items suitable for background capture.
Scan Extraction
For scan events, no LLM is involved. Scan nodes are directly mapped to CodeMapNode items with confidence 0.9.
Stage 5: Quote Verification
After LLM extraction, a programmatic quote verification step checks every item's evidence spans against the actual evidence text:
- Each item's
evidenceSpanscontains verbatim quotes and theevidenceRefIdthey came from. - The verifier looks up the referenced evidence text and checks if each quote is a substring (case-insensitive, whitespace-normalized).
- If none of an item's quotes can be found in the evidence, the item is rejected as hallucinated.
- If at least one quote matches, the item is accepted. The match ratio is logged for monitoring.
This zero-cost verification step catches cases where the LLM fabricates citations -- a known failure mode when evidence is sparse or ambiguous.
// A quote must be >= 5 chars and found as a substring in the evidence
const quote = span.quote.toLowerCase().trim().replace(/\s+/g, ' ');
if (quote.length >= 5 && evidenceText.includes(quote)) {
matchCount++;
}
Stage 6: Deduplication
After verification, items are checked against existing memory items using:
- DedupHint matching -- Items with the same dedup hint prefix (first two segments) as an existing active item are consolidated
- Title similarity -- Near-duplicate titles are detected
- Supersession -- When a new item covers the same topic as an existing one, the old item's status is set to
supersededwith a pointer to the new item
Stage 7: Storage
Verified, deduplicated items are stored in the memory_items collection with:
- Generated ULID as the item ID
- Association with the session and project
- Initial status of
active - All metadata from the LLM extraction
Raw events are marked as processed with a timestamp.
Stage 8: Deep Enrichment (optional)
If the project has a linked GitHub repository with deep enrichment enabled, an additional stage runs after storage. Deep enrichment reads the actual source files referenced by memory items and uses Devstral Small (Mistral's code-specialized model) to enrich them with implementation details.
Two-pass architecture
- Extract -- For each memory item, the pipeline fetches the related source files from GitHub and asks Devstral to extract structured facts by category: routes (HTTP method, params, validation schema), models (fields, types, relations), functions (signatures, logic steps, error cases), auth flows, config, and middleware.
- Compile -- Extracted facts are compiled into concise markdown (max 3,500 characters per item) and written back to the item's
factsfield, replacing the original shallow content with implementation-level detail.
Resource limits
| Limit | Value |
|---|---|
| Max files per session | 20 |
| Max file size | 100 KB |
| Total fetch budget | 500 KB |
| Concurrent GitHub requests | 5 |
| Request timeout | 10 s |
Only analyzable file types are fetched (.ts, .tsx, .js, .py, .go, .rs, .java, .json, .yaml, .prisma, .sql, .graphql, .md, .css, etc.). Files are fetched at the session's headCommitSha for reproducibility.
Deep enrichment is skipped silently if the project has no linked repository or if the toggle is disabled. Items keep their original facts from Stage 4.
Stage 9: Staleness Detection (post-enrich)
After enrichment completes, a zero-cost staleness check runs automatically. It compares the files modified in the current commit against files referenced by existing memory items:
- The pipeline collects all file paths from the session's commits
- It queries existing active memory items that reference any of these files
- If an item has 2 or more of its referenced files modified, it is flagged as potentially stale
- Flagged items have their status set to
reviewand a problem document is created
Flagged items appear in the Review tab for human triage. Users can accept (item is still accurate), edit (update the content), or archive (item is outdated).
Staleness detection has zero LLM cost — it uses only file path matching. It runs as a post-enrichment hook alongside auto-resolve (which marks BugFix/Todo items as resolved when their files are fixed).
Enrichment Trigger
Enrichment can be triggered in two ways:
Automatic (default)
When Auto-Learn from commits is enabled in Project Settings, enrichment is triggered automatically after each VS Code commit capture. This is the default behavior for new projects. Disable this toggle to save credits and switch to manual-only mode.
Manual
Click Generate Memory on any session in the dashboard to trigger enrichment manually. This is useful when auto-enrich is disabled or when you want to re-process a session.
The enrichment trigger calls POST /api/v2/sessions/:sessionId/enrich, which creates enrichment jobs for all unprocessed raw events in the session.
Monitoring the Pipeline in the Dashboard
You can monitor enrichment progress in real time from the Memory page:
- Go to the Sessions tab and find your session
- Click the session to open its detail view
- The Pipeline Timeline shows visual progress through each stage with duration tracking
- The Jobs tab lists all enrichment jobs with their status, stage progress, and any error details
- Once complete, extracted memory items appear in the Brain tab
Retrying Failed Jobs
If a stage fails (e.g., due to a temporary API timeout or rate limit), click Retry on the failed job. The pipeline restarts from the failed stage, preserving progress from earlier stages.
AI Model Configuration
The AI model used for enrichment can be configured per-project from Project Settings:
| Plan | Default Model | Override Available |
|---|---|---|
| Free | Small (mistral-small-latest) | No |
| Personal | Small (mistral-small-latest) | Yes -- Small or Medium |
| Team / Business / Enterprise | Medium (mistral-medium-latest) | Yes -- Small or Medium |
To change the enrichment model, open your project settings (gear icon on the project card) and select a model in the AI Model section of the General tab.