Enrichment Pipeline

The enrichment pipeline is the core intelligence layer of Contox. It transforms raw events from coding sessions into structured, verified memory items that populate the project brain. The pipeline combines security validation, AI extraction, and programmatic verification to produce high-quality knowledge.

End-to-End Flow

mermaid

sequenceDiagram
    participant Client as MCP / CLI / VS Code
    participant Ingest as POST /api/v2/ingest
    participant DB as Raw Events DB
    participant Storage as Blob Storage
    participant User as Dashboard
    participant Enrich as Enrichment Worker
    participant Indexer as Evidence Indexer
    participant LLM as Mistral AI
    participant Verify as Quote Verifier
    participant Dedup as Deduplicator
    participant Items as Memory Items DB

    Client->>Ingest: Send event + HMAC signature
    Ingest->>Ingest: Validate HMAC + timestamp
    Ingest->>Ingest: Compute event hash (idempotence)
    Ingest->>DB: Store raw_event
    Ingest->>Storage: Upload blobs (if any)
    Ingest->>DB: Find/create session
    Ingest-->>Client: 202 Accepted (sessionId, eventId)

    Note over User: User clicks "Generate Memory"

    User->>Enrich: POST /sessions/:id/enrich
    Enrich->>DB: Fetch unprocessed raw_events
    Enrich->>Enrich: Build evidence index
    Enrich->>Indexer: Pre-classify evidence entries
    Indexer-->>Enrich: Schema key hints + relevance filter
    Enrich->>DB: Fetch existing brain context
    Enrich->>LLM: Extract memory items (structured JSON)
    LLM-->>Enrich: V16 response (items + sessionSummary)
    Enrich->>Verify: Programmatic quote verification
    Verify-->>Enrich: Verified items (hallucinated items rejected)
    Enrich->>Dedup: Check for duplicates
    Dedup-->>Enrich: Deduplicated items
    Enrich->>Items: Store memory items
    Enrich->>DB: Mark raw_events as processed

Pipeline Stages

Stage 1: Ingestion

All events enter through POST /api/v2/ingest. This endpoint handles four event types from four sources:

Event Type	Source	Description
`mcp_save`	MCP Server	Structured save from AI coding assistants (Claude, etc.)
`auto_save`	CLI	Automatic periodic capture of file changes and commands
`scan`	CLI	Full codebase scan results
`vscode_capture`	VS Code Extension	Git commits with file changes and diffs

Security Validation

Every ingest request is validated through multiple checks:

HMAC Signature -- The event payload is signed with a per-project HMAC secret (or a global fallback). The server recomputes the signature and rejects mismatches.
Timestamp Check -- Anti-replay protection. Events with timestamps too far in the past or future are rejected.
Idempotence -- A SHA-256 hash of the event payload is computed and checked against existing events. Duplicate events return the original response without re-processing.

Blob Storage

Events can include binary blobs (diffs, transcripts, file snapshots) encoded as base64. These are uploaded to the Appwrite Storage evidence-blobs bucket with metadata tracking.

Session Assignment

Each event is assigned to a session using the 4-hour window logic (see Sessions). If an active session exists and is less than 4 hours old, the event joins that session. Otherwise, a new session is created.

Stage 2: Evidence Indexing

Before calling the LLM, the pipeline builds an evidence index -- a structured list of evidence entries derived from the raw event:

For mcp_save: The session summary becomes ev_transcript_01, and each change entry becomes a separate evidence (e.g., ev_impl_01, ev_bugfix_02)
For vscode_capture: The session context becomes ev_session_01, and each commit becomes ev_commit_01, ev_commit_02, etc.

Each evidence entry has an ID, text content, and optional metadata (commit range, etc.).

Stage 3: Pre-Classification

The Evidence Indexer agent runs a lightweight classification pass on each evidence entry before the main LLM extraction. It determines:

Relevance -- Whether the evidence is worth extracting (filters out noise like version bumps, auto-generated files)
Schema Key Hints -- Suggested schema keys based on file paths and content
Tags -- Preliminary categorization tags
Confidence -- How confident the classifier is

Irrelevant evidence entries are filtered out, reducing noise and token usage for the main extraction step. Classification hints are injected into the evidence entries as metadata for the LLM.

Stage 4: Context-Aware LLM Extraction

The core extraction step uses Mistral AI to transform evidence into structured memory items. The pipeline is context-aware -- it feeds the existing brain context to the LLM to avoid creating duplicates.

Model Selection

The AI model is selected based on the team's plan, with an optional per-project override available to all paid plans:

Plan	Default Model	With `aiTier` Override
Free	`mistral-small-latest`	N/A (upgrade required)
Personal	`mistral-small-latest`	`small` or `medium`
Team	`mistral-medium-latest`	`small` or `medium`
Business	`mistral-medium-latest`	`small` or `medium`
Enterprise	`mistral-medium-latest`	`small` or `medium`

Users on any paid plan can override the default model in Project Settings > AI Model using radio cards. Setting aiTier: 'small' or aiTier: 'medium' explicitly selects the model, while null uses the plan default.

V16 Agent Schema

For mcp_save and vscode_capture events, the pipeline uses a strict JSON schema (the "V16 agent") that enforces:

Structured output with sessionSummary and items arrays
Each item must have evidenceRefs pointing to actual evidence IDs
Each item must have evidenceSpans with verbatim quotes from the evidence
Schema keys must come from the canonical 29-key enum
Importance calibrated on a strict 1--5 scale
DedupHints in category:topic:key format
Maximum 5 items per event (quality over quantity)

The system prompt instructs the LLM to:

Only extract facts grounded in the evidence (no speculation)
Focus on genuinely new information not already in the existing memory
Reuse dedup hints when the session continues work on existing topics
Prefer skipping over creating redundant items

Auto-Save Extraction

For auto_save events, a simpler extraction is used with mistral-small-latest and free-form JSON output. This produces lower-fidelity items suitable for background capture.

Scan Extraction

For scan events, no LLM is involved. Scan nodes are directly mapped to CodeMapNode items with confidence 0.9.

Stage 5: Quote Verification

After LLM extraction, a programmatic quote verification step checks every item's evidence spans against the actual evidence text:

Each item's evidenceSpans contains verbatim quotes and the evidenceRefId they came from.
The verifier looks up the referenced evidence text and checks if each quote is a substring (case-insensitive, whitespace-normalized).
If none of an item's quotes can be found in the evidence, the item is rejected as hallucinated.
If at least one quote matches, the item is accepted. The match ratio is logged for monitoring.

This zero-cost verification step catches cases where the LLM fabricates citations -- a known failure mode when evidence is sparse or ambiguous.

typescript

// A quote must be >= 5 chars and found as a substring in the evidence
const quote = span.quote.toLowerCase().trim().replace(/\s+/g, ' ');
if (quote.length >= 5 && evidenceText.includes(quote)) {
  matchCount++;
}

Stage 6: Deduplication

After verification, items are checked against existing memory items using:

DedupHint matching -- Items with the same dedup hint prefix (first two segments) as an existing active item are consolidated
Title similarity -- Near-duplicate titles are detected
Supersession -- When a new item covers the same topic as an existing one, the old item's status is set to superseded with a pointer to the new item

Stage 7: Storage

Verified, deduplicated items are stored in the memory_items collection with:

Generated ULID as the item ID
Association with the session and project
Initial status of active
All metadata from the LLM extraction

Raw events are marked as processed with a timestamp.

Stage 8: Deep Enrichment (optional)

If the project has a linked GitHub repository with deep enrichment enabled, an additional stage runs after storage. Deep enrichment reads the actual source files referenced by memory items and uses Devstral Small (Mistral's code-specialized model) to enrich them with implementation details.

Two-pass architecture

Extract -- For each memory item, the pipeline fetches the related source files from GitHub and asks Devstral to extract structured facts by category: routes (HTTP method, params, validation schema), models (fields, types, relations), functions (signatures, logic steps, error cases), auth flows, config, and middleware.
Compile -- Extracted facts are compiled into concise markdown (max 3,500 characters per item) and written back to the item's facts field, replacing the original shallow content with implementation-level detail.

Resource limits

Limit	Value
Max files per session	20
Max file size	100 KB
Total fetch budget	500 KB
Concurrent GitHub requests	5
Request timeout	10 s

Only analyzable file types are fetched (.ts, .tsx, .js, .py, .go, .rs, .java, .json, .yaml, .prisma, .sql, .graphql, .md, .css, etc.). Files are fetched at the session's headCommitSha for reproducibility.

Tip

Deep enrichment is skipped silently if the project has no linked repository or if the toggle is disabled. Items keep their original facts from Stage 4.

Stage 9: Staleness Detection (post-enrich)

After enrichment completes, a zero-cost staleness check runs automatically. It compares the files modified in the current commit against files referenced by existing memory items:

The pipeline collects all file paths from the session's commits
It queries existing active memory items that reference any of these files
If an item has 2 or more of its referenced files modified, it is flagged as potentially stale
Flagged items have their status set to review and a problem document is created

Flagged items appear in the Review tab for human triage. Users can accept (item is still accurate), edit (update the content), or archive (item is outdated).

Info

Staleness detection has zero LLM cost — it uses only file path matching. It runs as a post-enrichment hook alongside auto-resolve (which marks BugFix/Todo items as resolved when their files are fixed).

Enrichment Trigger

Enrichment can be triggered in two ways:

Automatic (default)

When Auto-Learn from commits is enabled in Project Settings, enrichment is triggered automatically after each VS Code commit capture. This is the default behavior for new projects. Disable this toggle to save credits and switch to manual-only mode.

Manual

Click Generate Memory on any session in the dashboard to trigger enrichment manually. This is useful when auto-enrich is disabled or when you want to re-process a session.

The enrichment trigger calls POST /api/v2/sessions/:sessionId/enrich, which creates enrichment jobs for all unprocessed raw events in the session.

Monitoring the Pipeline in the Dashboard

You can monitor enrichment progress in real time from the Memory page:

Go to the Sessions tab and find your session
Click the session to open its detail view
The Pipeline Timeline shows visual progress through each stage with duration tracking
The Jobs tab lists all enrichment jobs with their status, stage progress, and any error details
Once complete, extracted memory items appear in the Brain tab

Pipeline Progress2/4 steps

Enrich12s3,240 tokens

Embed4s

Dedup

Drift Check

Retrying Failed Jobs

If a stage fails (e.g., due to a temporary API timeout or rate limit), click Retry on the failed job. The pipeline restarts from the failed stage, preserving progress from earlier stages.

AI Model Configuration

The AI model used for enrichment can be configured per-project from Project Settings:

Plan	Default Model	Override Available
Free	Small (`mistral-small-latest`)	No
Personal	Small (`mistral-small-latest`)	Yes -- Small or Medium
Team / Business / Enterprise	Medium (`mistral-medium-latest`)	Yes -- Small or Medium

Tip

To change the enrichment model, open your project settings (gear icon on the project card) and select a model in the AI Model section of the General tab.

Schema Keys

Knowledge Graph

On this page