How Memory Works¶
This page explains the complete memory lifecycle — from your raw input to a searchable team knowledge base. If you work with Claude Code daily, this is the mental model you need.
The big picture¶
You don't need to tell Claude to "remember" anything. Every tool call is automatically captured in the background and distilled into anonymous team knowledge.
graph LR
A["Tool call (Read, Bash, Edit...)"] --> B["PostToolUse hook"]
B --> C["POST /observe"]
C --> D["JSONL queue"]
D --> E["Secret scan"]
E --> F["Local distillation"]
F --> G["Post-scan"]
G --> H["Embedding"]
H --> I["Dedup check"]
I --> J["Team DB"]
Every stage exists for a reason:
| Stage | Purpose |
|---|---|
| Hook capture | Records tool I/O automatically — zero latency impact on Claude |
| JSONL queue | Persists raw observations locally, survives crashes |
| Secret scan | Redacts API keys, tokens, passwords before they reach even the local LLM |
| Distillation | Strips personal language, names, emotions — keeps only technical facts |
| Post-scan | Catches anything the LLM accidentally reproduced (hard block) |
| Embedding | Converts text to a 768-dim vector for similarity search |
| Dedup check | Rejects near-identical memories (cosine > 0.95) |
What distillation actually does¶
The local LLM (Ollama, running on your machine) transforms your raw input into impersonal, factual knowledge. Here's what that looks like in practice:
Your input:
"I spent 3 hours debugging this yesterday and it turns out the auth middleware was silently swallowing 401 responses because someone hardcoded a fallback to 200 in error_handler.py. So frustrating. @jake found it."
Distilled output:
"Auth middleware in error_handler.py silently converts 401 responses to 200 due to a hardcoded fallback in the error handler. Identified 2026-03-18."
Notice what changed:
- "I spent 3 hours" → removed (first-person, emotional)
- "yesterday" → "2026-03-18" (absolute date)
- "So frustrating" → removed (emotional language)
- "@jake found it" → removed (personal attribution)
- The technical fact is preserved exactly
Another example:
"We decided in standup to use Celery instead of RQ because we need retry logic and RQ's retry support is basically nonexistent"
Distilled output:
"Celery chosen over RQ for task queue. Reason: RQ lacks robust retry support."
The distiller compresses to 1–3 factual sentences. No bullet points, no headers — just dense knowledge.
When distillation rejects input¶
Not everything becomes a memory. The system rejects:
- Noise: "ok", "thanks", "lgtm", "sure" — trivial chat
- Too short: Anything under 20 characters
- Too long: Over 8,000 characters (configurable via
MAX_MEMORY_SIZE) - No technical content: "Had a great weekend" → the distiller returns
NO_FACTUAL_CONTENTand the memory is rejected
How observations are captured¶
Memory capture is fully automatic. A Claude Code PostToolUse hook fires after every tool call and POSTs the tool name, input, and output to a local HTTP endpoint (http://127.0.0.1:<port>/observe).
The hook runs in the background — Claude never waits for the save to complete (0ms latency impact).
What happens in the background¶
- The
/observeendpoint appends the observation to a JSONL file and signals the background worker - The worker picks up the entry and runs the full distillation pipeline:
- Noise filter rejects trivial entries (short commands, empty output)
- Secret scanner redacts credentials and PII
- Local Ollama distills raw tool I/O into an impersonal factual statement
- Scanner re-checks the distilled output
- Embedding converts the text to a 768-dim vector
- Dedup check rejects near-duplicates (cosine > 0.95)
- Memory is saved to the team database
- Entries that arrive faster than the worker can process queue in the JSONL file naturally
What about the raw text?¶
Raw tool I/O is written to the private_store JSONL at ~/.team-memory/private/ with file permissions 0600 (owner-only read/write). These files:
- Are never synced to any remote
- Exist as the observation queue — processed entries are tracked via a cursor
- Can be deleted manually at any time
Error handling¶
- Failed entries are retried up to 3 times, then skipped
- If the worker is behind (burst of tool calls), it catches up when the burst subsides
- If the distill process crashes mid-queue, unprocessed entries survive in the JSONL and are processed on restart
Memory types¶
Every memory has a type that affects how long it stays relevant in search results. Choose the type that matches the nature of the knowledge, not its importance:
| Type | What it captures | Decay rate | Examples |
|---|---|---|---|
decision |
Choices and their rationale | Fast (14 days) | "Chose Celery over RQ", "Moved from REST to gRPC" |
context |
Situational knowledge | Very fast (7 days) | "Deploy freeze until Thursday", "API down for maintenance" |
failure |
What went wrong and why | Medium (45 days) | "OOM on staging due to unbounded cache", "Migration failed on FK constraint" |
pattern |
Established conventions | Slow (90 days) | "All API responses use envelope format", "Tests use factory_boy, not fixtures" |
dependency |
Technology and version choices | Very slow (180 days) | "PostgreSQL 16 on RDS", "Python 3.12 minimum" |
Decay doesn't delete
The decay rate affects search ranking, not storage. A 6-month-old decision still exists — it just ranks lower than yesterday's decision for the same topic. You can always find it with get_memory(id) or by searching specifically.
Why type-aware decay?¶
Consider a decision like "We're using Redis for session storage". After 6 months, either:
- It's still true → it should appear in search, but a newer decision about the same topic should rank higher
- It was superseded → the newer memory naturally outranks it
Meanwhile, a pattern like "All API handlers validate input with Pydantic" stays relevant for months. A flat decay rate would either penalize durable patterns or keep stale decisions artificially high.
The math behind this is a Weibull survival function: S(t) = exp(-(t/λ)^k), where λ is the scale (how many days until significant decay) and k is the shape (how the curve bends). You don't need to know the formula — just pick the right type.
Memory levels¶
Each memory has a level derived from its type. Levels group types by how broadly the knowledge applies:
| Level | Types | What it means |
|---|---|---|
short-term |
context |
Ephemeral, situational — relevant right now |
long-term |
decision, pattern, failure, dependency |
Durable project knowledge |
shared |
(multi-repo memories) | Knowledge that spans multiple repositories |
Levels affect search scoring through multipliers applied during ranking:
- Short-term: ×0.8 — slightly deprioritized since it's transient
- Long-term: ×1.0 — baseline weight
- Shared: ×1.2 — boosted because cross-repo knowledge is harder to rediscover
You don't set the level directly. It's derived from the memory type and repo scope.
How search works¶
When Claude calls search_memory, a hybrid search pipeline runs:
1. Dual retrieval¶
The query is processed through two independent search systems simultaneously:
- Full-text search (FTS): Keyword matching. Good for exact terms like "Redis", "OOM error", "migration"
- Vector similarity: Semantic matching. Good for conceptual queries like "how do we handle auth" even if the memory says "authentication middleware"
Both systems return ranked candidate lists.
2. Reciprocal Rank Fusion (RRF)¶
The two ranked lists are merged using RRF with k=60:
score(doc) = 1/(60 + rank_fts) + 1/(60 + rank_vec)
A memory that ranks high in both lists gets a combined score. A memory that ranks #1 in FTS but doesn't appear in vector results still gets credit, just less.
Why RRF instead of a weighted average? Because FTS and vector scores are on different scales and aren't directly comparable. RRF only uses rank positions, so it works regardless of how each system scores internally.
3. Optional cross-encoder reranking¶
If enabled (RERANK_ENABLED=true), the top candidates are re-scored by a cross-encoder model (Jina) that reads query and document together. This is more accurate than embedding similarity but slower and requires an API key.
4. Weibull recency boost¶
Each result gets a time-decay adjustment based on its type:
final_score = 0.85 × base_score + 0.15 × weibull_recency
A 1-day-old decision gets nearly full recency boost. A 30-day-old decision gets ~5%. A 30-day-old pattern still gets ~72%.
5. Access-frequency boost¶
Memories that are frequently accessed in search results get a small boost:
final_score *= 1.0 + log(access_count + 1) × 0.1
This creates a feedback loop: useful memories surface more often, which makes them even more discoverable. The log dampens the effect so a memory accessed 100 times isn't dramatically different from one accessed 50 times.
6. Score threshold¶
Results below 0.10 are dropped. This prevents low-confidence matches from cluttering the output.
What you get back¶
Search returns a compact index, not full content:
[
{
"id": "abc123",
"type": "decision",
"snippet": "Celery chosen over RQ for task queue. Reason: RQ lacks robust...",
"repos": ["myapp"],
"score": 0.87,
"created_at": "2026-03-18T14:30:00Z",
"est_tokens": 25,
"agent_id": null
}
]
Each result costs ~30 tokens. For 5 results, that's ~150 tokens — cheap enough that Claude can search proactively without burning through your context window.
Progressive disclosure¶
The search response is deliberately compact. This is a design choice called progressive disclosure:
Layer 1 — search_memory: Returns IDs, types, 80-char snippets, scores, and estimated token counts. Claude uses this to decide which memories are relevant.
Layer 2 — get_memories: Claude fetches full content only for the IDs it actually needs.
Why not return full content immediately? Because most search results aren't relevant to the current task. If search_memory returned 5 full memories at ~100 tokens each, that's 500 tokens consumed even if only 1 memory matters. With progressive disclosure, Claude spends ~150 tokens on the index and ~100 tokens on the one memory it actually uses.
This is especially important for agents running search_memory before every architectural decision — the protocol keeps the token budget predictable.
Deduplication¶
Before any memory is saved, the system checks if a near-identical memory already exists by comparing embedding vectors:
- Cosine similarity ≥ 0.95 → duplicate, rejected with a pointer to the existing memory
- Below 0.95 → unique, proceeds to save
This threshold is deliberately high. Two memories about the same topic but with different details (e.g., "chose Redis for caching" vs. "chose Redis for session storage") will both be saved. Only near-verbatim duplicates are caught.
The dedup check runs at confirmation time, not at preview time. This means if you submit two identical memories within the preview window, the first one to be confirmed wins and the second gets a duplicate rejection.
Contradiction detection¶
The background worker doesn't just check for exact duplicates — it also looks for related memories that might contradict a new observation.
After distillation, the system searches for existing memories with cosine similarity > 0.80 (well below the 0.95 dedup threshold). When a contradiction is detected, the worker logs it for review via list_stale or search_memory.
Since observations are captured automatically, contradiction resolution happens at search time rather than at save time. When Claude searches for a topic and finds conflicting memories, it can use update_memory to supersede the outdated one, or forget to remove it.
Updating memories¶
update_memory doesn't edit in place. It creates a new memory and soft-deletes the old one:
- Fetches the existing memory
- Distills your new input
- Embeds the distilled text
- Saves a new memory with
supersedes=old_id - Soft-deletes the old memory (sets
deleted_at)
The old memory stops appearing in search results, but the chain of supersession is preserved in the database. This means you can always trace how knowledge evolved.
Forgetting¶
forget is a soft delete — it sets deleted_at on the memory, which excludes it from all search results. The data isn't physically removed from the database.
When agent_id is provided, forget only works if the memory belongs to that agent. This prevents one agent from accidentally deleting another agent's knowledge.
Stale memory detection¶
The list_stale tool identifies memories that have outlived their usefulness. It combines two signals:
- Weibull survival score < 0.1 — the memory has decayed past its type-appropriate lifespan
- Access count < 2 — nobody is finding it useful in search results
Both conditions must be true. A frequently accessed old memory isn't stale — it's still providing value. A rarely accessed recent memory isn't stale either — it hasn't had time to prove itself.
The staleness thresholds are type-aware because each type has a different natural lifespan:
| Type | Approximate stale age |
|---|---|
context |
~15 days |
decision |
~30 days |
failure |
~60 days |
pattern |
Several months |
dependency |
Several months |
list_stale returns candidates for review — it doesn't delete anything automatically. You decide whether to forget them or leave them in place.
Multi-agent support¶
Every tool accepts an optional agent_id parameter. When set:
- Remember: The memory is tagged with the agent's ID
- Search: Results can be filtered to only that agent's memories
- Forget: Only the owning agent can delete its memories
This allows multiple Claude Code instances (or custom agents) to maintain isolated knowledge bases within the same database. Omitting agent_id gives access to all memories.
Storage backends¶
Local (SQLite + LanceDB)¶
The default. Everything runs on your machine:
- Memories: SQLite with WAL mode for concurrent reads
- Full-text search: FTS5 with unicode61 tokenizer
- Vectors: LanceDB (embedded, file-based) with cosine distance
- Data directory:
~/.team-memory/(configurable viaDATA_DIR)
Good for: Solo developers, local-first workflows, air-gapped environments.
PostgreSQL (asyncpg + pgvector)¶
For teams sharing a knowledge base:
- Memories: PostgreSQL with JSONB for repos/tags
- Full-text search: Generated tsvector column with GIN index
- Vectors: pgvector extension with IVFFlat index (created after 100+ rows)
- Row-Level Security: When
AUTH_ENABLED=true, queries are scoped to the developer's repos
Good for: Teams, shared knowledge bases, cloud deployments (Neon, Cloud SQL, RDS).
Both backends implement the same StoragePort interface. The domain layer doesn't know which one is running — you can switch by changing BACKEND=local to BACKEND=postgres and providing a DATABASE_URL.
Putting it all together¶
Here's what a typical day looks like for a developer using Distill with Claude Code:
Morning — context loading:
Claude calls search_memory before proposing architecture for a new feature. It finds 3 relevant memories from last week's decisions. You see the compact index, Claude fetches full content for 2 of them, and adjusts its proposal accordingly.
During work — automatic capture: You and Claude decide to use WebSockets instead of SSE. As Claude edits code and runs tests, the PostToolUse hook captures every tool call. The background worker distills these into factual memories like "WebSockets chosen over SSE for real-time updates" — no manual intervention needed.
Debugging — finding prior failures: You hit a cryptic error. Claude searches for related failures and finds a memory from 3 weeks ago: "Service mesh timeout caused by Envoy default idle_timeout=1h conflicting with long-polling connections. Fix: set idle_timeout=0 in Envoy config." Crisis averted.
New team member — onboarding: A colleague sets up Distill pointed at the same PostgreSQL database. They immediately have access to months of team decisions, patterns, and failure lessons — without ever reading through Slack history or meeting notes.