Before you start
Use safe structured timing only while diagnosing:First cluster read
Run read-only checks from a shell wherekubectl is configured for the target
cluster and namespace:
- restarts;
- readiness or liveness probe failures;
/api/healthtimeouts;- OOM kills;
- CPU or memory limits;
- scheduling failures;
- node scale-up events.
/api/health context deadline exceeded means the
whole web pod was not answering quickly. That is different from one slow Teams
query or one slow UI component.
Capture logs
Capture web and worker logs for the incident window:Parse perf traces
Structured perf log lines contain JSON with"type":"second.perf". Group by
requestId, route, and second.
This script gives a quick route-level summary:
Teams, Members,
Invitations, or Integrations GETs. If it does, treat that as request
amplification, not user behavior.
Split slow requests
For slow.response events, inspect the same requestId and split elapsed
time into:
auth.workspace;- app access checks;
- settings read model;
- DB subqueries;
- stream readiness wait;
- resumable stream resume;
- replay fallback;
- total elapsed.
auth.workspace is slow across many concurrent requests, suspect external
auth, membership lookup pressure, or request amplification. If a DB subquery is
slow for tiny result counts, suspect concurrency, missing indexes, network
latency, or saturation rather than data size.
Interpret common patterns
Request amplification
Symptoms:- many identical GETs in the same second;
- settings pages repeatedly loading tiny result sets;
- web pod health probes timing out;
- worker mostly quiet.
- realtime invalidation loop;
- component remount loop;
- repeated
useEffectfetches; - component-local polling added on top of workspace realtime;
- read route publishing mutation events;
- browser connection pressure from too many EventSource subscriptions.
apps/web/src/components/workspace-realtime-provider.tsx;apps/web/src/lib/events/workspace-events.ts;apps/web/src/lib/workspace-settings/read-models.ts;apps/web/src/lib/workspace-settings/request-dedupe.ts;- Members, Teams, Integrations settings clients;
- routes that publish
member.changed,integration.changed, or app/run events.
Read-side mutation loop
Treat any write from a GET/read path as suspicious. A read path that repairs membership, ensures a default team, upserts metadata, or publishes invalidation events can create this loop:Whole-app stall
If/api/health probes time out, do not focus only on the UI page that was
visible. Check request volume, auth latency, CPU throttling, memory pressure,
and event-loop saturation. A later low kubectl top pods sample does not
disprove a short earlier stall.
Stream attach delay
Start with:docs/streaming.mdx;apps/web/src/components/app-chat.tsx;apps/web/src/app/api/workspaces/[workspaceId]/apps/[appId]/runs/[runId]/chat/stream/route.ts;apps/web/src/lib/streams/run-replay.ts.
streaming, whether activeStreamId existed,
whether the attach path waited for readiness, and whether replay or resumable
stream was used.
Capacity checks
Kubernetes node autoscaling is not pod autoscaling. Check whether more pods can exist:What to record
When the issue teaches a durable lesson, update the active plan or docs with:- what the user observed;
- the exact time window checked;
- pod health and resource state;
- request counts by route;
- slow request breakdown by request ID;
- what was ruled out;
- the likely root cause;
- the code or infra change;
- the exact staging validation that should prove the fix.