SDK Feedback: Memory Observability Tools

Summary

Add built-in observability and debugging tools to the Memori SDK that let developers understand why specific memories were recalled, inspect similarity scores, trace memory lineage, and diagnose recall failures. This transforms memory from a black box into a transparent, debuggable system.

The Problem

Memory-powered AI applications are difficult to debug because the retrieval process is opaque. When an agent gives an unexpected response, developers face a frustrating investigation:

No visibility into recall decisions. Developers can't see which memories were considered, their similarity scores, or why certain memories were ranked higher than others.
Silent failures are common. If no memories match a query, the SDK returns an empty result with no explanation of what was searched or why nothing matched.
Attribution tracing is manual. Understanding where a memory originated (which conversation, which user action) requires custom logging infrastructure.
Performance bottlenecks are invisible. Slow recalls provide no breakdown of time spent on embedding generation vs. database queries vs. reranking.
Memory drift is hard to detect. As memories accumulate, relevance can degrade, but there's no tooling to surface this pattern.

Real-world impact: A support agent that "forgets" a user's previous issue mid-conversation is a critical bug, but without observability tools, debugging it requires manually querying the database and reverse-engineering the recall logic.

Proposed Solution

Introduce a debug=True mode and a companion MemoryInspector class that provides rich introspection into every memory operation. The debug mode should be zero-config to enable, with structured output that integrates with existing logging and observability tools.

Core Features

Recall Explainer

See every memory considered during recall, with similarity scores, ranking factors, and the final selection reason.

Query Analysis

Inspect how queries are parsed, embedded, and matched against the memory index.

Lineage Tracing

Track memory provenance from creation through every access, with full attribution chain.

Performance Profiling

Timing breakdowns for embedding, search, reranking, and total latency per operation.

Memory Health Metrics

Aggregate stats on memory freshness, access patterns, and potential staleness.

Export & Replay

Capture debug sessions for offline analysis or bug report attachments.

Developer Experience

from memori import MemoriClient

# Enable debug mode
client = MemoriClient(api_key="...", debug=True)

# Perform a recall - debug info is captured automatically
result = client.memory.recall(
    query="What is the user's preferred language?",
    attribution="user:456"
)

# Access the debug report
debug = result.debug

# See all memories that were considered
for candidate in debug.candidates:
    print(f"Memory: {candidate.content[:50]}...")
    print(f"  Score: {candidate.similarity_score:.3f}")
    print(f"  Recency boost: {candidate.recency_factor:.2f}")
    print(f"  Final rank: {candidate.rank}")
    print(f"  Selected: {candidate.selected}")

# Understand why certain memories weren't selected
print(f"Rejection reasons: {debug.rejection_summary}")

# Performance breakdown
print(f"Embedding time: {debug.timing.embedding_ms}ms")
print(f"Search time: {debug.timing.search_ms}ms")
print(f"Total: {debug.timing.total_ms}ms")
            

Example Debug Output

Attribution: user:456

Candidates evaluated: 12

Memories returned: 3

TOP CANDIDATES:

"User set language preference to Spanish" 0.923

"User asked about Spanish documentation" 0.847

"Conversation was in Spanish" 0.812

"User mentioned visiting Spain" (rejected: below threshold) 0.534

TIMING:

Embed query: 12ms

Vector search: 8ms

Rerank: 23ms

Total: 43ms

Memory Inspector CLI

For interactive debugging, provide a CLI tool that can inspect memory state:

# Inspect memories for a specific attribution
$ memori inspect --attribution user:456

Found 47 memories for user:456

Type breakdown:
  preference: 12
  fact: 23
  summary: 8
  rule: 4

Oldest: 2024-01-15 (342 days ago)
Newest: 2025-01-28 (1 day ago)
Average access frequency: 2.3/week

Potential issues:
  - 3 memories have low access (>90 days since last recall)
  - 2 memories have conflicting content (preference type)

# Simulate a recall without executing
$ memori recall --dry-run --query "user's favorite color" --attribution user:456

DRY RUN - No memories will be modified

Would return 2 memories:
  1. "User's favorite color is blue" (score: 0.94, stored: 2024-06-12)
  2. "User mentioned liking blue themes" (score: 0.78, stored: 2024-08-03)
            

Trade-offs Considered

Benefits

Dramatically faster debugging of memory-related issues
Builds developer confidence in the memory system
Enables data-driven tuning of recall thresholds
Reduces support burden with self-service diagnostics
Creates foundation for memory analytics features
CLI tool enables ops teams to investigate production issues

Drawbacks

Debug mode adds latency (collecting and structuring metadata)
Increased memory usage when debug info is retained
Risk of exposing sensitive memory content in logs
API surface area increases significantly
Debug output format becomes a compatibility concern
May encourage over-reliance on debugging vs. proper testing

Mitigations

Opt-in with zero overhead: Debug mode is disabled by default. When disabled, no debug data is collected, ensuring zero performance impact in production.
Content redaction: Provide a redact_content=True option that shows memory metadata and scores without exposing actual content, safe for logging.
Sampling mode: For production observability, support debug_sample_rate=0.01 to collect debug info for 1% of requests.
Structured output: Debug data is available as typed objects, JSON, or OpenTelemetry spans, integrating with existing observability stacks.
Versioned schema: Debug output schema is versioned, with deprecation warnings for breaking changes.

Integration with Observability Stacks

The debug output should integrate seamlessly with common observability tools:

# OpenTelemetry integration
from memori import MemoriClient
from memori.telemetry import OTelExporter

client = MemoriClient(
    api_key="...",
    telemetry_exporter=OTelExporter()  # Sends spans to configured collector
)

# Every memory operation creates a span with debug attributes
result = client.memory.recall(query="...", attribution="...")

# Spans include:
# - memori.recall.candidates_evaluated: 12
# - memori.recall.memories_returned: 3
# - memori.recall.top_score: 0.923
# - memori.recall.embedding_ms: 12
# - memori.recall.search_ms: 8
            

# JSON logging for structured log aggregators
import logging
from memori import MemoriClient
from memori.logging import JSONDebugHandler

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("memori")
logger.addHandler(JSONDebugHandler())

client = MemoriClient(api_key="...", debug=True)

# Debug output goes to structured logs automatically
# {"event": "memory.recall", "candidates": 12, "returned": 3, ...}
            

Alternatives Considered

1. External APM Integration Only

Rely on Datadog, New Relic, etc. for all observability. Rejected because it requires additional infrastructure, doesn't provide memory-specific insights, and creates vendor lock-in.

2. Verbose Logging Mode

Add a LOG_LEVEL=DEBUG that prints detailed logs. Partially adopted as a fallback, but structured debug objects are more useful for programmatic analysis than log parsing.

3. Separate Debug SDK

Ship a memori-debug package with enhanced tooling. Rejected because it fragments the ecosystem and makes debugging feel like an afterthought rather than a first-class feature.

Success Metrics

Debug mode adoption: Track percentage of API calls with debug enabled (target: 15% in development environments)
Mean time to resolution: Survey developers on debugging time before/after feature launch
Support ticket reduction: Measure decrease in "memory not working" support requests
CLI usage: Track memori inspect command invocations
Documentation engagement: Monitor traffic to debugging guide pages

Recommendation

Ship memory observability tools as a core SDK feature. The ability to understand why memory behaves the way it does is essential for building reliable AI applications. Without these tools, developers are forced to treat memory as a black box, leading to frustration and reduced trust in the platform.

Proposed rollout:

Phase 1 (4 weeks): Basic debug mode with candidate list and timing in Python SDK
Phase 2 (6 weeks): CLI inspector tool and JSON export
Phase 3 (8 weeks): OpenTelemetry integration and TypeScript SDK parity

SDK Feedback: Memory Observability Tools

Summary

The Problem

Proposed Solution

Core Features

Recall Explainer

Query Analysis

Lineage Tracing

Performance Profiling

Memory Health Metrics

Export & Replay

Developer Experience

Example Debug Output

Memory Inspector CLI

Trade-offs Considered

Benefits

Drawbacks

Mitigations

Integration with Observability Stacks

Alternatives Considered

1. External APM Integration Only

2. Verbose Logging Mode

3. Separate Debug SDK

Success Metrics

Recommendation

Related Feedback