Documentation Index
Fetch the complete documentation index at: https://docs.second.so/llms.txt
Use this file to discover all available pages before exploring further.
Local speed is not enough proof that a deployed issue is gone. Deployed
environments add external auth, real Redis, real Mongo/network latency,
Kubernetes resource limits, health probes, and a load balancer. Start from
evidence and avoid guessing.
Use this playbook when the app feels frozen, navigation stalls, settings pages
hang, chat history appears late, stream attach is slow, or users see temporary
server errors.
Before you start
Use safe structured timing only while diagnosing:
Perf traces are designed to be content-minimal. They include route names,
request IDs, workspace/app/run IDs, elapsed timings, counts, CPU, and memory.
They must not include prompts, source files, cookies, tokens, headers, secret
values, or integration secret values.
If tracing is not already enabled in the deployed environment, ask before
changing deployment config. Turn it back off after diagnosis unless the current
incident still needs it.
First cluster read
Run read-only checks from a shell where kubectl is configured for the target
cluster and namespace:
kubectl config current-context
kubectl get pods -o wide
kubectl top pods
kubectl get events --sort-by=.lastTimestamp | tail -n 80
Then inspect the active web pod:
kubectl describe pod <web-pod-name>
Look for:
- restarts;
- readiness or liveness probe failures;
/api/health timeouts;
- OOM kills;
- CPU or memory limits;
- scheduling failures;
- node scale-up events.
A health probe failure like /api/health context deadline exceeded means the
whole web pod was not answering quickly. That is different from one slow Teams
query or one slow UI component.
Capture logs
Capture web and worker logs for the incident window:
kubectl logs deploy/second --since=30m --timestamps > /tmp/second-web.log
kubectl logs deploy/second-worker --since=30m --timestamps > /tmp/second-worker.log
If the current deployment has multiple web pods, capture all matching pods or
use labels:
kubectl logs -l app.kubernetes.io/name=second,app.kubernetes.io/component=web \
--since=30m \
--timestamps \
--all-containers > /tmp/second-web.log
Search for application errors first:
rg "Error|Unhandled|Exception|ECONN|timeout|second.perf" /tmp/second-web.log
rg "Error|Unhandled|Exception|ECONN|timeout" /tmp/second-worker.log
If the browser showed a temporary server error but the web logs do not show a
route stack trace, compare the timestamp against Kubernetes events. It may have
been a load-balancer/backend timeout while the web pod was saturated.
Parse perf traces
Structured perf log lines contain JSON with "type":"second.perf". Group by
requestId, route, and second.
This script gives a quick route-level summary:
node - <<'NODE'
const fs = require("fs");
const path = "/tmp/second-web.log";
const rows = fs.readFileSync(path, "utf8").split(/\n/).filter(Boolean);
const events = [];
for (const line of rows) {
const i = line.indexOf('{"type":"second.perf"');
if (i === -1) continue;
try {
events.push(JSON.parse(line.slice(i)));
} catch {}
}
const byReq = new Map();
for (const e of events) {
if (!e.requestId) continue;
const r = byReq.get(e.requestId) ?? {
route: e.route,
response: null,
events: [],
};
r.events.push(e);
if (String(e.event).endsWith(".response")) r.response = e;
byReq.set(e.requestId, r);
}
const stats = new Map();
for (const r of byReq.values()) {
const s = stats.get(r.route) ?? {
requests: 0,
responded: 0,
over2s: 0,
over5s: 0,
max: 0,
};
s.requests++;
if (r.response) {
s.responded++;
const ms = Number(r.response.totalElapsedMs ?? r.response.sinceStartMs ?? 0);
s.max = Math.max(s.max, ms);
if (ms > 2000) s.over2s++;
if (ms > 5000) s.over5s++;
}
stats.set(r.route, s);
}
console.log("perf events", events.length, "requests", byReq.size);
for (const [route, s] of [...stats.entries()].sort()) {
console.log(route, s);
}
NODE
This script finds request-start bursts:
node - <<'NODE'
const fs = require("fs");
const rows = fs.readFileSync("/tmp/second-web.log", "utf8").split(/\n/);
const buckets = new Map();
for (const line of rows) {
const i = line.indexOf('{"type":"second.perf"');
if (i === -1) continue;
let e;
try {
e = JSON.parse(line.slice(i));
} catch {
continue;
}
if (!String(e.event).endsWith("request_start")) continue;
const key = `${String(e.at).slice(0, 19)} ${e.route}`;
buckets.set(key, (buckets.get(key) ?? 0) + 1);
}
for (const [key, count] of [...buckets.entries()].sort((a, b) => b[1] - a[1]).slice(0, 25)) {
console.log(count, key);
}
NODE
A normal click should not create dozens of identical Teams, Members,
Invitations, or Integrations GETs. If it does, treat that as request
amplification, not user behavior.
Split slow requests
For slow .response events, inspect the same requestId and split elapsed
time into:
auth.workspace;
- app access checks;
- settings read model;
- DB subqueries;
- stream readiness wait;
- resumable stream resume;
- replay fallback;
- total elapsed.
Useful searches:
rg '"requestId":"<id>"' /tmp/second-web.log
rg '"event":"auth.workspace"|"event":"settings.|"event":"run.stream_attach' /tmp/second-web.log
If auth.workspace is slow across many concurrent requests, suspect external
auth, membership lookup pressure, or request amplification. If a DB subquery is
slow for tiny result counts, suspect concurrency, missing indexes, network
latency, or saturation rather than data size.
Interpret common patterns
Request amplification
Symptoms:
- many identical GETs in the same second;
- settings pages repeatedly loading tiny result sets;
- web pod health probes timing out;
- worker mostly quiet.
Likely causes:
- realtime invalidation loop;
- component remount loop;
- repeated
useEffect fetches;
- component-local polling added on top of workspace realtime;
- read route publishing mutation events;
- browser connection pressure from too many EventSource subscriptions.
Start by inspecting:
apps/web/src/components/workspace-realtime-provider.tsx;
apps/web/src/lib/events/workspace-events.ts;
apps/web/src/lib/workspace-settings/read-models.ts;
apps/web/src/lib/workspace-settings/request-dedupe.ts;
- Members, Teams, Integrations settings clients;
- routes that publish
member.changed, integration.changed, or app/run events.
Read-side mutation loop
Treat any write from a GET/read path as suspicious. A read path that repairs
membership, ensures a default team, upserts metadata, or publishes invalidation
events can create this loop:
read → publish event → mounted client refetches → read → publish event
Fix by making ensure paths idempotent and publishing only after a real insert,
update, delete, or status transition.
Whole-app stall
If /api/health probes time out, do not focus only on the UI page that was
visible. Check request volume, auth latency, CPU throttling, memory pressure,
and event-loop saturation. A later low kubectl top pods sample does not
disprove a short earlier stall.
Stream attach delay
Start with:
docs/streaming.mdx;
apps/web/src/components/app-chat.tsx;
apps/web/src/app/api/workspaces/[workspaceId]/apps/[appId]/runs/[runId]/chat/stream/route.ts;
apps/web/src/lib/streams/run-replay.ts.
Check whether the run was already streaming, whether activeStreamId existed,
whether the attach path waited for readiness, and whether replay or resumable
stream was used.
Capacity checks
Kubernetes node autoscaling is not pod autoscaling.
Check whether more pods can exist:
kubectl get deploy second -o jsonpath='{.spec.replicas}{"\n"}'
kubectl get hpa
kubectl describe deploy second
Managed GKE Autopilot can add nodes for schedulable pod requests, but it will
not create more web pods without replicas or an autoscaling policy. A single web
pod can still become the bottleneck under request amplification or many active
streams.
Worker scaling needs separate thought because active SDK sessions and workspace
files live in worker memory/filesystem while durable state is saved through web
and MongoDB.
What to record
When the issue teaches a durable lesson, update the active plan or docs with:
- what the user observed;
- the exact time window checked;
- pod health and resource state;
- request counts by route;
- slow request breakdown by request ID;
- what was ruled out;
- the likely root cause;
- the code or infra change;
- the exact staging validation that should prove the fix.