> ## Documentation Index > Fetch the complete documentation index at: https://docs.second.so/llms.txt > Use this file to discover all available pages before exploring further. # Deployed Slowness Playbook > How to diagnose staging or production slowness with perf traces, Kubernetes events, and request amplification checks. Local speed is not enough proof that a deployed issue is gone. Deployed environments add external auth, real Redis, real Mongo/network latency, Kubernetes resource limits, health probes, and a load balancer. Start from evidence and avoid guessing. Use this playbook when the app feels frozen, navigation stalls, settings pages hang, chat history appears late, stream attach is slow, or users see temporary server errors. ## Before you start Use safe structured timing only while diagnosing: ```bash theme={null} SECOND_PERF_TRACE=1 ``` Perf traces are designed to be content-minimal. They include route names, request IDs, workspace/app/run IDs, elapsed timings, counts, CPU, and memory. They must not include prompts, source files, cookies, tokens, headers, secret values, or integration secret values. If tracing is not already enabled in the deployed environment, ask before changing deployment config. Turn it back off after diagnosis unless the current incident still needs it. ## First cluster read Run read-only checks from a shell where `kubectl` is configured for the target cluster and namespace: ```bash theme={null} kubectl config current-context kubectl get pods -o wide kubectl top pods kubectl get events --sort-by=.lastTimestamp | tail -n 80 ``` Then inspect the active web pod: ```bash theme={null} kubectl describe pod ``` Look for: * restarts; * readiness or liveness probe failures; * `/api/health` timeouts; * OOM kills; * CPU or memory limits; * scheduling failures; * node scale-up events. A health probe failure like `/api/health context deadline exceeded` means the whole web pod was not answering quickly. That is different from one slow Teams query or one slow UI component. ## Capture logs Capture web and worker logs for the incident window: ```bash theme={null} kubectl logs deploy/second --since=30m --timestamps > /tmp/second-web.log kubectl logs deploy/second-worker --since=30m --timestamps > /tmp/second-worker.log ``` If the current deployment has multiple web pods, capture all matching pods or use labels: ```bash theme={null} kubectl logs -l app.kubernetes.io/name=second,app.kubernetes.io/component=web \ --since=30m \ --timestamps \ --all-containers > /tmp/second-web.log ``` Search for application errors first: ```bash theme={null} rg "Error|Unhandled|Exception|ECONN|timeout|second.perf" /tmp/second-web.log rg "Error|Unhandled|Exception|ECONN|timeout" /tmp/second-worker.log ``` If the browser showed a temporary server error but the web logs do not show a route stack trace, compare the timestamp against Kubernetes events. It may have been a load-balancer/backend timeout while the web pod was saturated. ## Parse perf traces Structured perf log lines contain JSON with `"type":"second.perf"`. Group by `requestId`, route, and second. This script gives a quick route-level summary: ```bash theme={null} node - <<'NODE' const fs = require("fs"); const path = "/tmp/second-web.log"; const rows = fs.readFileSync(path, "utf8").split(/\n/).filter(Boolean); const events = []; for (const line of rows) { const i = line.indexOf('{"type":"second.perf"'); if (i === -1) continue; try { events.push(JSON.parse(line.slice(i))); } catch {} } const byReq = new Map(); for (const e of events) { if (!e.requestId) continue; const r = byReq.get(e.requestId) ?? { route: e.route, response: null, events: [], }; r.events.push(e); if (String(e.event).endsWith(".response")) r.response = e; byReq.set(e.requestId, r); } const stats = new Map(); for (const r of byReq.values()) { const s = stats.get(r.route) ?? { requests: 0, responded: 0, over2s: 0, over5s: 0, max: 0, }; s.requests++; if (r.response) { s.responded++; const ms = Number(r.response.totalElapsedMs ?? r.response.sinceStartMs ?? 0); s.max = Math.max(s.max, ms); if (ms > 2000) s.over2s++; if (ms > 5000) s.over5s++; } stats.set(r.route, s); } console.log("perf events", events.length, "requests", byReq.size); for (const [route, s] of [...stats.entries()].sort()) { console.log(route, s); } NODE ``` This script finds request-start bursts: ```bash theme={null} node - <<'NODE' const fs = require("fs"); const rows = fs.readFileSync("/tmp/second-web.log", "utf8").split(/\n/); const buckets = new Map(); for (const line of rows) { const i = line.indexOf('{"type":"second.perf"'); if (i === -1) continue; let e; try { e = JSON.parse(line.slice(i)); } catch { continue; } if (!String(e.event).endsWith("request_start")) continue; const key = `${String(e.at).slice(0, 19)} ${e.route}`; buckets.set(key, (buckets.get(key) ?? 0) + 1); } for (const [key, count] of [...buckets.entries()].sort((a, b) => b[1] - a[1]).slice(0, 25)) { console.log(count, key); } NODE ``` A normal click should not create dozens of identical `Teams`, `Members`, `Invitations`, or `Integrations` GETs. If it does, treat that as request amplification, not user behavior. ## Split slow requests For slow `.response` events, inspect the same `requestId` and split elapsed time into: * `auth.workspace`; * app access checks; * settings read model; * DB subqueries; * stream readiness wait; * resumable stream resume; * replay fallback; * total elapsed. Useful searches: ```bash theme={null} rg '"requestId":""' /tmp/second-web.log rg '"event":"auth.workspace"|"event":"settings.|"event":"run.stream_attach' /tmp/second-web.log ``` If `auth.workspace` is slow across many concurrent requests, suspect external auth, membership lookup pressure, or request amplification. If a DB subquery is slow for tiny result counts, suspect concurrency, missing indexes, network latency, or saturation rather than data size. ## Interpret common patterns ### Request amplification Symptoms: * many identical GETs in the same second; * settings pages repeatedly loading tiny result sets; * web pod health probes timing out; * worker mostly quiet. Likely causes: * realtime invalidation loop; * component remount loop; * repeated `useEffect` fetches; * component-local polling added on top of workspace realtime; * read route publishing mutation events; * browser connection pressure from too many EventSource subscriptions. Start by inspecting: * `apps/web/src/components/workspace-realtime-provider.tsx`; * `apps/web/src/lib/events/workspace-events.ts`; * `apps/web/src/lib/workspace-settings/read-models.ts`; * `apps/web/src/lib/workspace-settings/request-dedupe.ts`; * Members, Teams, Integrations settings clients; * routes that publish `member.changed`, `integration.changed`, or app/run events. ### Read-side mutation loop Treat any write from a GET/read path as suspicious. A read path that repairs membership, ensures a default team, upserts metadata, or publishes invalidation events can create this loop: ``` read → publish event → mounted client refetches → read → publish event ``` Fix by making ensure paths idempotent and publishing only after a real insert, update, delete, or status transition. ### Whole-app stall If `/api/health` probes time out, do not focus only on the UI page that was visible. Check request volume, auth latency, CPU throttling, memory pressure, and event-loop saturation. A later low `kubectl top pods` sample does not disprove a short earlier stall. ### Stream attach delay Start with: * `docs/streaming.mdx`; * `apps/web/src/components/app-chat.tsx`; * `apps/web/src/app/api/workspaces/[workspaceId]/apps/[appId]/runs/[runId]/chat/stream/route.ts`; * `apps/web/src/lib/streams/run-replay.ts`. Check whether the run was already `streaming`, whether `activeStreamId` existed, whether the attach path waited for readiness, and whether replay or resumable stream was used. ## Capacity checks Kubernetes node autoscaling is not pod autoscaling. Check whether more pods can exist: ```bash theme={null} kubectl get deploy second -o jsonpath='{.spec.replicas}{"\n"}' kubectl get hpa kubectl describe deploy second ``` Managed GKE Autopilot can add nodes for schedulable pod requests, but it will not create more web pods without replicas or an autoscaling policy. A single web pod can still become the bottleneck under request amplification or many active streams. Worker scaling needs separate thought because active SDK sessions and workspace files live in worker memory/filesystem while durable state is saved through web and MongoDB. ## What to record When the issue teaches a durable lesson, update the active plan or docs with: * what the user observed; * the exact time window checked; * pod health and resource state; * request counts by route; * slow request breakdown by request ID; * what was ruled out; * the likely root cause; * the code or infra change; * the exact staging validation that should prove the fix.