Claude Code Debugger

Install

claude plugin marketplace add tyroneross/RossLabs-AI-Toolkit
claude plugin install claude-code-debugger@rosslabs-ai-toolkit

The Problem

Claude Code resets to zero knowledge every session. A database timeout in week one takes the same four hours to diagnose in week three because nothing carries over: not the root cause, not the fix, not the verification path. The debugging intelligence is ephemeral, and difficult bugs consume disproportionate time when every encounter is the first encounter.

What I Built

Claude Code Debugger stores debugging incidents and extracts reusable patterns from them, then surfaces relevant prior work the moment a similar symptom appears. The behavior is now formalized as the debugging-memory skill: a memory-first workflow that runs before any investigation, returns a verdict, and decides whether to apply a known fix, adapt a similar one, or start fresh.

It also serves as a supporting plugin for Build Loop. Build Loop’s debugger phase calls into the bridges this plugin exposes; when no full debugger plugin is reachable, the debugging-memory skill is the standalone fallback.

The debugging-memory Skill

The skill is the contract. It runs before investigation, calls the debugger search MCP tool with the symptom, and returns one of four verdicts:

Verdict	Action
`KNOWN_FIX`	Apply the documented fix directly, adapting for current context. Skip the loop.
`LIKELY_MATCH`	Past incidents exist but need verification. Enter `debug-loop`.
`WEAK_SIGNAL`	Loosely related. Enter `debug-loop` with prior context as a starting point.
`NO_MATCH`	No prior knowledge. Enter `debug-loop` for full investigation.

After resolution, the skill calls store with symptom, root cause (with confidence 0–1), fix approach, files changed, and verification status. The next session benefits.

Root-Cause Investigation

When the verdict isn’t KNOWN_FIX, the workflow escalates to the debug-loop skill. Investigation is structured as a causal tree, not a single chain: the root-cause-investigator agent branches on multiple potential causes, traces each, and flags when external research is needed. A fix-critique agent pressure-tests the proposed fix before declaring it resolved, checking whether it addresses the root cause rather than a symptom, what regression risk it carries, and where evidence is missing. The loop iterates up to five times, with a scorecard tracking pass/fail per criterion.

Every report uses ✅ Verified / ⚠️ Assumed / ❓ Unknown markers. No overclaiming.

Multi-Domain Parallel Assessment

For vague symptoms (“app broken”) or issues that span layers (“search is slow and returns wrong results”), the workflow dispatches domain assessors in parallel rather than running a single generic pass:

Assessor	Expertise
`database-assessor`	Prisma, PostgreSQL, queries, migrations, connection pools
`frontend-assessor`	React, hooks, rendering, state, hydration, SSR
`api-assessor`	Endpoints, REST/GraphQL, auth, middleware, CORS
`performance-assessor`	Latency, memory, CPU, bottlenecks

Each returns a JSON assessment with a confidence score, probable causes, recommended actions, and related past incidents. Results are ranked and merged into a unified diagnosis with a prioritized action sequence.

Memory Architecture

Two storage modes. Local mode (.claude/memory/) keeps incidents scoped to a single project, useful for proprietary codebases. Shared mode (~/.claude-code-debugger/) pools incidents across projects, useful for framework-specific patterns that recur everywhere.

Storage is tiered, following the pattern proven in IBR and NavGator:

Layer	Purpose
`MEMORY_SUMMARY.md`	Compressed context (<150 lines) for LLM cold starts
`index.json`	O(1) lookups by category, tag, file, quality tier
`keyword-index.json`	Inverted keyword → incident ID map
`incidents.jsonl`	Append-only log for fast full-text search
`incidents/INC_*.json`	Full incident details, loaded on demand
`outcomes.jsonl`	Verdict outcome tracking (worked / failed / modified)

Compound incident IDs (INC_CATEGORY_YYYYMMDD_HHMMSS_xxxx) encode category in the filename, so incidents can be browsed without opening them. Auto-archival moves incidents beyond 200 active or 180 days old into archive/.

Pattern Extraction

When three or more similar incidents accumulate (Jaccard similarity >0.7), the system extracts a reusable pattern. Incidents become TF-IDF vectors over symptom and fix text; DBSCAN clusters them with epsilon 0.3 and a minimum of 3 incidents per cluster. The centroid becomes the template, with cluster-member variations recorded as alternatives. Pattern matches return ~90% confidence versus ~70% for individual incident matches; outcome tracking feeds back into pattern success rates so the system learns which patterns are reliable.

Retrieval

Progressive, pattern-first. Symptom keywords hit the inverted index for O(log n) lookup, patterns are matched first, then incidents. Results return as one-liner summaries (~40 tokens each) with verdicts; full details load on demand via detail. Default token budget is 2,500, split 30% patterns / 60% incidents / 10% metadata, with the system automatically choosing summary, compact, or full tier based on what fits.

For symptom matching, Jaro-Winkler outperformed Levenshtein and cosine similarity on short phrases with typos (common in error messages). Cosine similarity wins for long-form descriptions and is used there.

Trace Ingestion

Adapters convert traces from external systems into incident drafts:

OpenTelemetry: error-status spans become drafts with stack trace, operation name, and duration
Sentry: error events with breadcrumbs as tags
LangChain / LangSmith: prompt failures with full input/output
Browser: Chrome DevTools, Playwright, console logs

Traces are summarized to preserve diagnostic information while keeping token cost low.

Context Engine

Three layers surface memory automatically without manual /debugger calls:

CLAUDE.md dynamic section — memory stats, hot files (most past incidents), and trigger instructions, refreshed on rebuild-index or session start
SessionStart hook — runs session-context to inject ~150 tokens of memory state at the beginning of each Claude session
PreToolUse hook — runs check-file when a file is edited; surfaces past incident IDs if any exist, otherwise returns {"ok": true} with zero noise

Outcome Tracking

After a suggested fix is tried, the outcome is recorded as worked, failed, or modified. This feeds back into pattern success rates so verdicts improve over time. A pattern that worked four times and failed once gets ranked above one that’s never been verified, even if textual similarity scores match.

Distribution and Verification

Available via the Claude Code plugin marketplace and as @tyroneross/claude-code-debugger on npm with provenance from npm trusted publishing, so the published package matches the GitHub source commit. The package ships 45 end-to-end tests covering incident CRUD, parallel retrieval, pattern extraction, and trace ingestion, plus a 6-dimension benchmark suite that scores retrieval accuracy, verdict precision, context efficiency, pattern quality, scalability, and cold-start quality.