MACP Control Plane — Troubleshooting
Runtime Connection Failures
Symptom: GET /readyz returns runtime.ok: false
Checks:
- Verify runtime is running:
grpcurl -plaintext 127.0.0.1:50051 list - Check
RUNTIME_ADDRESSenv var matches the runtime's listen address - If using TLS, ensure
RUNTIME_TLS=trueand certificates are valid - Check
RUNTIME_REQUEST_TIMEOUT_MS(default 30s) — increase if runtime is slow - Check circuit breaker state in
GET /readyz— reset withPOST /admin/circuit-breaker/reset
Circuit Breaker Open
Symptom: All runtime calls fail with CIRCUIT_BREAKER_OPEN
Cause: 5 consecutive gRPC failures tripped the circuit breaker.
Fix:
- Check runtime health:
GET /runtime/health - If runtime is back, reset:
POST /admin/circuit-breaker/reset - Or wait for auto-reset after
RUNTIME_CIRCUIT_BREAKER_RESET_MS(default 30s)
Migration Issues
Symptom: Application fails to start with database errors
Steps:
- Ensure PostgreSQL is running and accessible via
DATABASE_URL - Migrations run automatically on startup (see
src/db/migrate.ts) - Check
drizzle/directory for migration SQL files - Use
npm run drizzle:studioto inspect database state
Stuck Runs
Symptom: Runs stay in starting or running state indefinitely
Steps:
- Check stream consumer logs for reconnection errors
- Verify runtime session state:
GET /readyz - Check
STREAM_MAX_RETRIES(default 5) andSTREAM_IDLE_TIMEOUT_MS(default 120s) - Manually cancel:
POST /runs/{id}/cancel - If recovery is enabled (
RUN_RECOVERY_ENABLED=true), the system auto-recovers orphaned runs on startup
Auth-service unreachable / JWT mint failure
Symptom: Log line auth_mint_failure reason=... or JWT mint failed; falling back to static bearer.
Explanation: MACP_AUTH_SERVICE_URL is set, but the auth-service is down, returned non-2xx, or its response was unparseable. The credential resolver automatically falls back to RUNTIME_BEARER_TOKEN for this call.
Checks:
- Is the auth-service reachable?
curl -X POST $MACP_AUTH_SERVICE_URL/tokens -d '{}' -H 'content-type: application/json'(expect a 4xx response, not a connection error). - Is
RUNTIME_BEARER_TOKENset as a fallback? Without it the call eventually proceeds with noAuthorizationheader (dev-header mode) or fails auth on the runtime side. - If the auth-service is healthy but calls still fail, check
MACP_AUTH_SERVICE_TIMEOUT_MS(default 5000 ms) — slow auth-services can time out under load.
See also: runtime/docs/getting-started#authentication → Resolver order for how the runtime evaluates inbound credentials, and ARCHITECTURE.md § Runtime Credential Resolution for the control-plane side of the chain.
bindSession ConflictException in logs
Symptom: Log line bindSession no-op for run <uuid>: cannot transition ... (current status=running).
Explanation: Not an error. Two paths can race to bind the same run — RunExecutorService for POST /runs-created runs, and SessionDiscoveryService for runs auto-discovered via WatchSessions. Whichever arrives second sees the run already past binding_session. As of the subscribe-session PR, the second call is a logged no-op; it no longer crashes the process.
When to investigate: only if you see this repeatedly for the same runId — that would indicate a loop somewhere retrying the bind. A single occurrence per run is normal.
Legacy Write Endpoints Return 410 Gone
Symptom: POST /runs/:id/messages, /signal, or /context returns 410 Gone with errorCode: ENDPOINT_REMOVED.
Explanation: The control-plane is observer-only as of the 2026-04-15 direct-agent-auth refactor. Agents authenticate to the runtime directly and emit their own envelopes via macp-sdk-python / macp-sdk-typescript. See docs/API.md § "Messages & Signals — emission is NOT via the control-plane" for the mapping, and the SDK guides for the new agent flow: python-sdk direct-agent-auth, typescript-sdk agent-framework.
Agent Envelopes Not Appearing in Projection
Symptom: Agents call session.send(...) via the SDK but events don't appear in GET /runs/:id/state.
Checks:
- Confirm the run's
runtimeSessionIdmatches thesession_idthe agent is writing to (GET /runs/:id). - Check stream consumer logs for
StreamSessionreconnection loops — the observer subscribes read-only and must be connected. - Confirm the runtime echoes envelopes back on the stream (some runtimes only echo certain message types).
signal.emittedandmessage.sentcanonical events requirestream-envelopeentries on the observer stream. See runtime/docs/API#message-transport for StreamSession semantics and runtime/docs/sdk-guide#streaming for the observer lifecycle. - For session discovery, verify
SESSION_DISCOVERY_ENABLED=trueso externally-launched sessions auto-create runs. Concepts: python-sdk/docs/guides/session-discovery.md.
SSE Stream Drops
Symptom: Live stream disconnects frequently
Checks:
- Check heartbeat interval:
STREAM_SSE_HEARTBEAT_MS(default 15s) - Ensure no proxy/load balancer is timing out idle connections
- Check
STREAM_IDLE_TIMEOUT_MS(default 120s) - Client should handle reconnection using
Last-Event-Idheader
High Memory Usage
Causes:
- Too many active SSE subscribers — StreamHub cleans up idle subjects after 60s
- Large replay queries — batch size configurable via
REPLAY_BATCH_SIZE(default 500) - Database connection pool exhaustion — check
DB_POOL_MAX(default 20) - Event accumulation — check
STREAM_MAX_RETRIESfor stuck reconnection loops
Integration Test Issues
Test DB connection fails:
- Start test postgres:
docker compose -f docker-compose.test.yml up -d postgres-test - Test DB uses port 5433 (not 5432) to avoid conflict with dev DB
Real runtime tests fail with InvalidPayload:
- Use
payloadEnvelopewith proto encoding instead of plainpayload - Set
INTEGRATION_RUNTIME=remoteandRUNTIME_ADDRESS=127.0.0.1:50051
Prometheus metric re-registration error:
- Tests that create multiple NestJS apps must call
promClient.register.clear()between apps
"Test suite failed to run" even though every assertion passed:
- Teardown leak — background observation services (
StreamConsumerService,SignalConsumerService,SessionDiscoveryService) had in-flightpersistRawAndCanonicalwork when the DB pool closed. Fixed bytest/helpers/test-app.ts→drainBackgroundWork()which awaits each service's bounded drain before Nest's ownonModuleDestroysweep. If you see this in a new test, make sure you created the app viacreateTestApp(...)so theapp.close()wrapper is in place.
Common Error Codes
| Code | HTTP | Meaning |
|---|---|---|
RUN_NOT_FOUND | 404 | Run ID does not exist |
INVALID_STATE_TRANSITION | 409 | Cannot transition run to requested state |
RUNTIME_UNAVAILABLE | 502 | Cannot connect to gRPC runtime |
RUNTIME_TIMEOUT | 504 | gRPC call exceeded deadline |
CIRCUIT_BREAKER_OPEN | 503 | Runtime circuit breaker is open |
STREAM_EXHAUSTED | 500 | Max stream reconnection retries reached |
SESSION_EXPIRED | 410 | Runtime session has expired |
MODE_NOT_SUPPORTED | 400 | Runtime does not support requested mode |
VALIDATION_ERROR | 400 | Request body validation failed |
INVALID_SESSION_ID | 400 | Session ID not recognized by runtime |
UNKNOWN_POLICY_VERSION | 400 | Policy version not found in registry |
POLICY_DENIED | 403 | Commitment rejected by policy rules |
INVALID_POLICY_DEFINITION | 400 | Policy rules fail schema validation |
SESSION_ALREADY_EXISTS | 409 | Duplicate session start attempt |
INTERNAL_ERROR | 500 | Unexpected server error |