Operations

What to alert on, which metrics matter, and the oncall checklist when a Tale instance starts behaving badly.

3 min read

The operations page is the alert playbook — which signals are worth waking someone for, which can ride out a coffee, and what the first five minutes of an incident look like. Tale's metrics surface lives behind METRICS_BEARER_TOKEN; this page assumes you have wired up Prometheus and Grafana per Observability config and now need to know which numbers to watch.

The symptom-first index is at Troubleshooting. This page is the proactive side — signals first, oncall checklist second.

Signals worth alerting on

Signal	Severity	Why it matters
`tale-proxy` health probe failing > 1 min	page	Every user sees a connection error
`tale-platform` HTTP 5xx rate > 5 %	page	The UI is broken for a meaningful share of requests
`tale-convex` WebSocket reconnect storm	page	UI loads but no data flows
Postgres connections > 80 % of pool	warn	The next spike will start blocking
`db-data` volume > 80 % full	warn	The operational Postgres goes read-only at full
`knowledge-db-data` volume > 80 % full	warn	Ingestion fails when the corpus database is full
`tale-knowledge-db` unreachable from convex	warn	Knowledge search returns empty; ingestion stalls
Provider request error rate > 20 %	warn	The upstream LLM provider is having a bad day
Daily backup did not write	page	Restore drill will fail at the worst moment
TLS cert renewal failed	warn	Renews 30 d before expiry — you have time

The first two pages are the actually-customer-impacting ones. The warns are catching trends before they tip into page territory.

Log signals to grep for

Logs come through stdout per container, captured by Docker's json-file driver. The four phrases that consistently mean trouble:

panic or unexpected error in tale-convex logs — Convex action crash.
decryption failed in tale-platform logs — SOPS age key mismatch with the file on disk.
429 Too Many Requests repeated from a provider — rate limit hit, agents will start failing.
connection refused or ECONNREFUSED to knowledge-db in tale-convex logs — the backend cannot reach the corpus database; ingestion and knowledge search fail.

Pipe these to your aggregator as derived alerts; the metrics endpoints do not surface them as gauges.

Oncall checklist

When a page lands, the first five minutes follow the same shape every time.

Confirm the alert is real. Open $SITE_URL in a browser. If the UI loads and chat works, you are looking at a metrics or scraper issue, not a customer-impacting one.
Identify the container. docker compose ps shows which is unhealthy; docker compose logs --tail=200 <service> shows the last error.
Restart the most-likely culprit. docker compose restart <service> resolves a surprising fraction of incidents — process crashes, file watchers gone stale, exhausted connection pools. The architecture is built to survive a single container restart cleanly.
Check upstream providers. https://status.openai.com, https://status.anthropic.com, etc. If the provider is on fire, agents fail; Tale is not the cause.
Page the on-call engineer if the user-visible symptom persists after a restart. No need to escalate sooner — most incidents resolve in the first three steps.

What does not need oncall

A tale-knowledge-db outage is a warn, not a page. The web-crawl schedule absorbs hours of downtime without user impact, and document ingestion retries rather than dropping work — uploads sit in "indexing" until the corpus database is back. Knowledge search returns empty in the meantime, but chats that do not retrieve knowledge keep working. Catch this in the warn band and fix it in business hours.

Where this fits

The signals above are the proactive side of operating a Tale instance; the reactive side is Troubleshooting, and the configuration that gets the metrics into Prometheus is Observability config. If you have not yet set METRICS_BEARER_TOKEN, every threshold above is unmonitored — start there.

Edit on GitHub