How Memory Works¶

This page explains the complete memory lifecycle — from your raw input to a searchable team knowledge base. If you work with Claude Code daily, this is the mental model you need.

The big picture¶

You don't need to tell Claude to "remember" anything. Every tool call is automatically captured in the background and distilled into anonymous team knowledge.

graph LR
    A["Tool call (Read, Bash, Edit...)"] --> B["PostToolUse hook"]
    B --> C["POST /observe"]
    C --> D["JSONL queue"]
    D --> E["Secret scan"]
    E --> F["Local distillation"]
    F --> G["Post-scan"]
    G --> H["Embedding"]
    H --> I["Dedup check"]
    I --> J["Team DB"]

Every stage exists for a reason:

Stage	Purpose
Hook capture	Records tool I/O automatically — zero latency impact on Claude
JSONL queue	Persists raw observations locally, survives crashes
Secret scan	Redacts API keys, tokens, passwords before they reach even the local LLM
Distillation	Strips personal language, names, emotions — keeps only technical facts
Post-scan	Catches anything the LLM accidentally reproduced (hard block)
Embedding	Converts text to a 768-dim vector for similarity search
Dedup check	Rejects near-identical memories (cosine > 0.95)

What distillation actually does¶

The local LLM (Ollama, running on your machine) transforms your raw input into impersonal, factual knowledge. Here's what that looks like in practice:

Your input:

"I spent 3 hours debugging this yesterday and it turns out the auth middleware was silently swallowing 401 responses because someone hardcoded a fallback to 200 in error_handler.py. So frustrating. @jake found it."

Distilled output:

"Auth middleware in error_handler.py silently converts 401 responses to 200 due to a hardcoded fallback in the error handler. Identified 2026-03-18."

Notice what changed:

"I spent 3 hours" → removed (first-person, emotional)
"yesterday" → "2026-03-18" (absolute date)
"So frustrating" → removed (emotional language)
"@jake found it" → removed (personal attribution)
The technical fact is preserved exactly

Another example:

"We decided in standup to use Celery instead of RQ because we need retry logic and RQ's retry support is basically nonexistent"

Distilled output:

"Celery chosen over RQ for task queue. Reason: RQ lacks robust retry support."

The distiller compresses to 1–3 factual sentences. No bullet points, no headers — just dense knowledge.

When distillation rejects input¶

Not everything becomes a memory. The system rejects:

Noise: "ok", "thanks", "lgtm", "sure" — trivial chat
Too short: Anything under 20 characters
Too long: Over 8,000 characters (configurable via MAX_MEMORY_SIZE)
No technical content: "Had a great weekend" → the distiller returns NO_FACTUAL_CONTENT and the memory is rejected

How observations are captured¶

Memory capture is fully automatic. A Claude Code PostToolUse hook fires after every tool call and POSTs the tool name, input, and output to a local HTTP endpoint (http://127.0.0.1:<port>/observe).

The hook runs in the background — Claude never waits for the save to complete (0ms latency impact).

What happens in the background¶

The /observe endpoint appends the observation to a JSONL file and signals the background worker
The worker picks up the entry and runs the full distillation pipeline:
Noise filter rejects trivial entries (short commands, empty output)
Secret scanner redacts credentials and PII
Local Ollama distills raw tool I/O into an impersonal factual statement
Scanner re-checks the distilled output
Embedding converts the text to a 768-dim vector
Dedup check rejects near-duplicates (cosine > 0.95)
Memory is saved to the team database
Entries that arrive faster than the worker can process queue in the JSONL file naturally

What about the raw text?¶

Raw tool I/O is written to the private_store JSONL at ~/.team-memory/private/ with file permissions 0600 (owner-only read/write). These files:

Are never synced to any remote
Exist as the observation queue — processed entries are tracked via a cursor
Can be deleted manually at any time

Error handling¶

Failed entries are retried up to 3 times, then skipped
If the worker is behind (burst of tool calls), it catches up when the burst subsides
If the distill process crashes mid-queue, unprocessed entries survive in the JSONL and are processed on restart

Memory types¶

Every memory has a type that affects how long it stays relevant in search results. Choose the type that matches the nature of the knowledge, not its importance:

Type	What it captures	Decay rate	Examples
`decision`	Choices and their rationale	Fast (14 days)	"Chose Celery over RQ", "Moved from REST to gRPC"
`context`	Situational knowledge	Very fast (7 days)	"Deploy freeze until Thursday", "API down for maintenance"
`failure`	What went wrong and why	Medium (45 days)	"OOM on staging due to unbounded cache", "Migration failed on FK constraint"
`pattern`	Established conventions	Slow (90 days)	"All API responses use envelope format", "Tests use factory_boy, not fixtures"
`dependency`	Technology and version choices	Very slow (180 days)	"PostgreSQL 16 on RDS", "Python 3.12 minimum"

Decay doesn't delete

The decay rate affects search ranking, not storage. A 6-month-old decision still exists — it just ranks lower than yesterday's decision for the same topic. You can always find it with get_memory(id) or by searching specifically.

Why type-aware decay?¶

Consider a decision like "We're using Redis for session storage". After 6 months, either:

It's still true → it should appear in search, but a newer decision about the same topic should rank higher
It was superseded → the newer memory naturally outranks it

Meanwhile, a pattern like "All API handlers validate input with Pydantic" stays relevant for months. A flat decay rate would either penalize durable patterns or keep stale decisions artificially high.

The math behind this is a Weibull survival function: S(t) = exp(-(t/λ)^k), where λ is the scale (how many days until significant decay) and k is the shape (how the curve bends). You don't need to know the formula — just pick the right type.

Memory levels¶

Each memory has a level derived from its type. Levels group types by how broadly the knowledge applies:

Level	Types	What it means
`short-term`	`context`	Ephemeral, situational — relevant right now
`long-term`	`decision`, `pattern`, `failure`, `dependency`	Durable project knowledge
`shared`	(multi-repo memories)	Knowledge that spans multiple repositories

Levels affect search scoring through multipliers applied during ranking:

Short-term: ×0.8 — slightly deprioritized since it's transient
Long-term: ×1.0 — baseline weight
Shared: ×1.2 — boosted because cross-repo knowledge is harder to rediscover

You don't set the level directly. It's derived from the memory type and repo scope.

How search works¶

When Claude calls search_memory, a hybrid search pipeline runs:

1. Dual retrieval¶

The query is processed through two independent search systems simultaneously:

Full-text search (FTS): Keyword matching. Good for exact terms like "Redis", "OOM error", "migration"
Vector similarity: Semantic matching. Good for conceptual queries like "how do we handle auth" even if the memory says "authentication middleware"

Both systems return ranked candidate lists.

2. Reciprocal Rank Fusion (RRF)¶

The two ranked lists are merged using RRF with k=60:

score(doc) = 1/(60 + rank_fts) + 1/(60 + rank_vec)

A memory that ranks high in both lists gets a combined score. A memory that ranks #1 in FTS but doesn't appear in vector results still gets credit, just less.

Why RRF instead of a weighted average? Because FTS and vector scores are on different scales and aren't directly comparable. RRF only uses rank positions, so it works regardless of how each system scores internally.

3. Optional cross-encoder reranking¶

If enabled (RERANK_ENABLED=true), the top candidates are re-scored by a cross-encoder model (Jina) that reads query and document together. This is more accurate than embedding similarity but slower and requires an API key.

4. Weibull recency boost¶

Each result gets a time-decay adjustment based on its type:

final_score = 0.85 × base_score + 0.15 × weibull_recency

A 1-day-old decision gets nearly full recency boost. A 30-day-old decision gets ~5%. A 30-day-old pattern still gets ~72%.

5. Access-frequency boost¶

Memories that are frequently accessed in search results get a small boost:

final_score *= 1.0 + log(access_count + 1) × 0.1

This creates a feedback loop: useful memories surface more often, which makes them even more discoverable. The log dampens the effect so a memory accessed 100 times isn't dramatically different from one accessed 50 times.

6. Score threshold¶

Results below 0.10 are dropped. This prevents low-confidence matches from cluttering the output.

What you get back¶

Search returns a compact index, not full content:

[
  {
    "id": "abc123",
    "type": "decision",
    "snippet": "Celery chosen over RQ for task queue. Reason: RQ lacks robust...",
    "repos": ["myapp"],
    "score": 0.87,
    "created_at": "2026-03-18T14:30:00Z",
    "est_tokens": 25,
    "agent_id": null
  }
]

Each result costs ~30 tokens. For 5 results, that's ~150 tokens — cheap enough that Claude can search proactively without burning through your context window.

Progressive disclosure¶

The search response is deliberately compact. This is a design choice called progressive disclosure:

Layer 1 — search_memory: Returns IDs, types, 80-char snippets, scores, and estimated token counts. Claude uses this to decide which memories are relevant.

Layer 2 — get_memories: Claude fetches full content only for the IDs it actually needs.

Why not return full content immediately? Because most search results aren't relevant to the current task. If search_memory returned 5 full memories at ~100 tokens each, that's 500 tokens consumed even if only 1 memory matters. With progressive disclosure, Claude spends ~150 tokens on the index and ~100 tokens on the one memory it actually uses.

This is especially important for agents running search_memory before every architectural decision — the protocol keeps the token budget predictable.

Deduplication¶

Before any memory is saved, the system checks if a near-identical memory already exists by comparing embedding vectors:

Cosine similarity ≥ 0.95 → duplicate, rejected with a pointer to the existing memory
Below 0.95 → unique, proceeds to save

This threshold is deliberately high. Two memories about the same topic but with different details (e.g., "chose Redis for caching" vs. "chose Redis for session storage") will both be saved. Only near-verbatim duplicates are caught.

The dedup check runs at confirmation time, not at preview time. This means if you submit two identical memories within the preview window, the first one to be confirmed wins and the second gets a duplicate rejection.

Contradiction detection¶

The background worker doesn't just check for exact duplicates — it also looks for related memories that might contradict a new observation.

After distillation, the system searches for existing memories with cosine similarity > 0.80 (well below the 0.95 dedup threshold). When a contradiction is detected, the worker logs it for review via list_stale or search_memory.

Since observations are captured automatically, contradiction resolution happens at search time rather than at save time. When Claude searches for a topic and finds conflicting memories, it can use update_memory to supersede the outdated one, or forget to remove it.

Updating memories¶

update_memory doesn't edit in place. It creates a new memory and soft-deletes the old one:

Fetches the existing memory
Distills your new input
Embeds the distilled text
Saves a new memory with supersedes=old_id
Soft-deletes the old memory (sets deleted_at)

The old memory stops appearing in search results, but the chain of supersession is preserved in the database. This means you can always trace how knowledge evolved.

Lineage tracking¶

When a memory is updated via update_memory, the new memory records a supersedes link to the old one. The get_lineage tool traces this chain in both directions — predecessors (what this memory replaced) and successors (what replaced this memory) — ordered oldest to newest.

This lets you understand how a decision evolved:

v1: "Redis chosen for caching"
  └── superseded by v2: "Redis chosen for caching and session storage"
       └── superseded by v3: "Switched from Redis to Valkey for caching and sessions"

Call get_lineage(id) with any memory in the chain to see the full history.

Temporal search¶

search_memory supports after and before parameters (ISO 8601 date strings) to restrict results to a specific time window. This is useful for questions like:

"What decisions did we make last week?" → after="2026-03-15"
"What did we know before the migration?" → before="2026-03-01"
"What changed in February?" → after="2026-02-01", before="2026-03-01"

Temporal filters are applied after the hybrid search pipeline, so they work with all other search features (FTS, vector similarity, reranking).

Forgetting¶

forget is a soft delete — it sets deleted_at on the memory, which excludes it from all search results. The data isn't physically removed from the database immediately.

When agent_id is provided, forget only works if the memory belongs to that agent. This prevents one agent from accidentally deleting another agent's knowledge.

Retention purge¶

Soft-deleted memories are hard-deleted after RETENTION_DAYS (default: 90 days). This runs automatically on server startup. Set RETENTION_DAYS=0 to disable automatic purging and keep soft-deleted memories indefinitely.

Stale memory detection¶

The list_stale tool identifies memories that have outlived their usefulness. It combines two signals:

Weibull survival score < 0.1 — the memory has decayed past its type-appropriate lifespan
Access count < 2 — nobody is finding it useful in search results

Both conditions must be true. A frequently accessed old memory isn't stale — it's still providing value. A rarely accessed recent memory isn't stale either — it hasn't had time to prove itself.

The staleness thresholds are type-aware because each type has a different natural lifespan:

Type	Approximate stale age
`context`	~15 days
`decision`	~30 days
`failure`	~60 days
`pattern`	Several months
`dependency`	Several months

list_stale returns candidates for review — it doesn't delete anything automatically. You decide whether to forget them or leave them in place.

Multi-agent support¶

Every tool accepts an optional agent_id parameter. When set:

Remember: The memory is tagged with the agent's ID
Search: Results can be filtered to only that agent's memories
Forget: Only the owning agent can delete its memories

This allows multiple Claude Code instances (or custom agents) to maintain isolated knowledge bases within the same database. Omitting agent_id gives access to all memories.

Storage backends¶

Local (SQLite + LanceDB)¶

The default. Everything runs on your machine:

Memories: SQLite with WAL mode for concurrent reads
Full-text search: FTS5 with unicode61 tokenizer
Vectors: LanceDB (embedded, file-based) with cosine distance
Data directory: ~/.team-memory/ (configurable via DATA_DIR)

Good for: Solo developers, local-first workflows, air-gapped environments.

PostgreSQL (asyncpg + pgvector)¶

For teams sharing a knowledge base:

Memories: PostgreSQL with JSONB for repos/tags
Full-text search: Generated tsvector column with GIN index
Vectors: pgvector extension with IVFFlat index (created after 100+ rows)
Row-Level Security: When AUTH_ENABLED=true, queries are scoped to the developer's repos

Good for: Teams, shared knowledge bases, cloud deployments (Neon, Cloud SQL, RDS).

Both backends implement the same StoragePort interface. The domain layer doesn't know which one is running — you can switch by changing BACKEND=local to BACKEND=postgres and providing a DATABASE_URL.

Putting it all together¶

Here's what a typical day looks like for a developer using Distill with Claude Code:

Morning — context loading: Claude calls search_memory before proposing architecture for a new feature. It finds 3 relevant memories from last week's decisions. You see the compact index, Claude fetches full content for 2 of them, and adjusts its proposal accordingly.

During work — automatic capture: You and Claude decide to use WebSockets instead of SSE. As Claude edits code and runs tests, the PostToolUse hook captures every tool call. The background worker distills these into factual memories like "WebSockets chosen over SSE for real-time updates" — no manual intervention needed.

Debugging — finding prior failures: You hit a cryptic error. Claude searches for related failures and finds a memory from 3 weeks ago: "Service mesh timeout caused by Envoy default idle_timeout=1h conflicting with long-polling connections. Fix: set idle_timeout=0 in Envoy config." Crisis averted.

New team member — onboarding: A colleague sets up Distill pointed at the same PostgreSQL database. They immediately have access to months of team decisions, patterns, and failure lessons — without ever reading through Slack history or meeting notes.