Troubleshooting
Start with readiness and logs, then isolate webhook, queue, AI, REES, RAG, or write-suppression problems.
First checks
docker compose ps
docker compose logs --tail=200 gittensory
curl http://localhost:8787/ready
curl http://localhost:8787/metricsbashNo review appears
- Webhook
- Check GitHub App deliveries and confirm /v1/github/webhook receives 2xx responses.
- Allowlist
- Confirm the repo is in GITTENSORY_REVIEW_REPOS for per-PR features.
- Write mode
- SELFHOST_DEPLOYMENT_MODE=dry-run or disabled suppresses writes even when review computes.
- Policy
- gate.aiReview.mode=off or commentMode=off can make AI/comment output intentionally quiet.
AI summary unavailable
- Confirm
AI_PROVIDERis set and supported. - Confirm the provider key or local endpoint works from inside the container.
- Set the matching provider model env, such as
ANTHROPIC_AI_MODEL,OPENAI_COMPATIBLE_AI_MODEL,OLLAMA_AI_MODEL,CLAUDE_AI_MODEL, orCODEX_AI_MODEL. - Increase the matching provider timeout env, such as
CLAUDE_AI_TIMEOUT_MSorCODEX_AI_TIMEOUT_MS, for large subscription-CLI reviews. - For CLI providers, confirm the CLI binary and credential path are available.
REES is silent
A no-finding REES response can be intentionally invisible. For failures, search logs forreview_context_fetch_failed with contextType set to enrichment.
review_context_fetch_failed
rees_analyzer_config_invalidCheck REES enrichment for enablement and REES analyzer reference for analyzer names, network calls, and token requirements.
RAG returns no context
- Confirm
GITTENSORY_REVIEW_RAG=trueand repo activation. - Confirm Qdrant or the vector backend is reachable from the app container.
- Confirm the embedding endpoint and model are running.
- Confirm the repo has been indexed after enabling the feature.
Queue stuck or dead jobs
Watch pending, processed, failed, and dead metrics. A high pending count can be webhook replay or maintenance work; dead jobs need direct investigation.
curl http://localhost:8787/metrics | grep gittensory_queue
docker compose logs gittensory | grep selfhost_job_deadbashGitHub rate-limit responses or admission deferrals
Two independent signals cover this: gittensory_github_rest_rate_limit_responses_total counts actual 403/429 responses from GitHub, and the gittensory_jobs_rate_limit_admission_deferred_total / gittensory_jobs_rate_limit_budget_deferred_total / gittensory_jobs_rate_limited_by_type_total counters track jobs the queue itself held back before making a request, to avoid tripping a limit. All three job-side counters carry the same three labels — kind (webhook or background), key_scope (installation, public, global, or other), and job_type (the queue job's type, e.g. agent-regate-pr) — so you can break a spike down to exactly which token pool and which job type is under pressure.
A short burst of deferrals is expected and self-resolving: the queue is deliberately trading a few seconds of delay to avoid a real 429. Treat it as a real problem only once it's sustained — which is exactly what GittensoryGitHubRateLimitResponses (real 403/429s observed) and GittensoryQueueRateLimitDeferralsHigh (a sustained deferral rate, not a blip) are tuned to alert on, rather than firing on every brief admission hold.
# Deferrals broken down by token pool and job type over the last 10m
sum by (key_scope, job_type) (rate(gittensory_jobs_rate_limit_admission_deferred_total[10m]))
# Is one key_scope (e.g. a single installation token) the bottleneck?
topk(5, sum by (key_scope) (rate(gittensory_jobs_rate_limit_budget_deferred_total[10m])))
# Real rate-limit responses from GitHub itself (not just internal deferrals)
sum(rate(gittensory_github_rest_rate_limit_responses_total[10m]))promqlIf a single key_scope=installation pool is consistently the bottleneck, the fix is usually spreading load across more installation tokens (fewer repos per installation) or raising the GitHub App's own rate-limit tier, not code changes here.
Low GitHub response-cache hit rate
gittensory_github_response_cache_total (REST) and gittensory_github_graphql_cache_total (GraphQL) both carry a result label — hit, miss, set, coalesced, bypassed, or error — and a class label identifying the endpoint family. A healthy cache should show most traffic as hit for endpoints that are read repeatedly in one review/maintenance pass (PR reads, check-run lookups); a low hit rate on those specific classes, not the overall average, is the useful signal.
# REST hit rate by endpoint class over the last 15m
sum by (class) (rate(gittensory_github_response_cache_total{result="hit"}[15m]))
/
sum by (class) (rate(gittensory_github_response_cache_total[15m]))
# GraphQL hit rate — same shape, separate metric
sum by (class) (rate(gittensory_github_graphql_cache_total{result="hit"}[15m]))
/
sum by (class) (rate(gittensory_github_graphql_cache_total[15m]))promqlQdrant / vector-store errors
gittensory_qdrant_errors_total carries an op label (upsert, query, or delete) so you can tell whether indexing or retrieval is failing. GittensoryQdrantErrorRateHigh fires on a sustained error ratio, not an isolated blip.
- Confirm
QDRANT_URL(e.g.http://qdrant:6333) is reachable from the app container and theqdrantCompose profile is running. - If Qdrant requires auth, confirm
QDRANT_API_KEYis set and matches the Qdrant deployment's configuration. - A dimension-mismatch error means the existing
gittensorycollection (the fixed collection name self-host always uses) was created with a different embedding model than the one currently configured (AI_EMBED_MODEL). Recreating it — delete the collection and let the next index run recreate it at the current width — is the fix, but it temporarily removes ALL indexed RAG context for every repo until re-indexing completes, so treat it as a deliberate, disruptive step, not a routine one.
curl "$QDRANT_URL/collections/gittensory"
docker compose --profile qdrant ps qdrant
# Only after confirming a dimension mismatch is the actual cause:
curl -X DELETE "$QDRANT_URL/collections/gittensory"bashOrb export or relay problems
For brokered self-host deployments, gittensory_orb_events_exported_total and gittensory_orb_export_errors_total track the hourly outcome-export loop; GittensoryOrbExportErrorRateHigh fires on a sustained error ratio there. The pull-mode relay loop (for installations receiving events outbound from Orb) reports through gittensory_orb_relay_drains_total (result=events when it drained something, result=empty otherwise) and gittensory_orb_webhook_total (event + result labels) for what happened to each relayed event once enqueued locally.
If exports are failing but the relay itself looks healthy, the export loop's Sentry cron monitor (see Self-host operations) is the fastest way to confirm whether the loop is even running, before digging into the error counters.
AI provider circuit breaker keeps opening
Each AI provider (self-host AI_PROVIDER entries) has its own circuit breaker: after 3 consecutive failures it stops attempting real calls to that provider for 60 seconds, recorded as gittensory_ai_provider_circuit_open_total{provider="..."} (skipped calls) alongside gittensory_ai_provider_failures_total{provider="..."} (real failures). It self-heals automatically — there is no manual reset — but it will reopen immediately if the underlying problem is still there.
- Search logs for
circuit_open: provider "..."to confirm which provider tripped, andselfhost_ai_provider_failed_in_chainfor the real error each failed attempt hit before the breaker opened. - A provider that keeps re-tripping after its cooldown almost always means a persistent problem, not a transient blip: an expired/invalid API key, a CLI binary missing from the image (see
selfhost_ai_cli_missingat boot), or the endpoint being genuinely unreachable from the container. GittensoryAiProviderCircuitOpenfires on any circuit-open event in a 15-minute window — a single trip during a real but brief outage is expected; a rule that keeps firing across multiple windows points at the persistent case above.
Grafana traces error or show no data
The trace path is app or smoke process → OTEL collector → Tempo → Grafana. Tempo is only started by the observability profile, and app traces are only emitted when OTEL_TRACES_EXPORTER includes otlp.
docker compose --profile observability ps tempo otel-collector grafana
docker compose logs --tail=80 tempo otel-collector grafana
# Send one synthetic span through the collector and read it back from Tempo.
npm run test:smoke:observabilitybash- If the smoke command fails at
otel-collector:4318/v1/traces, the collector is not reachable from the app container. - If it pushes successfully but cannot read
tempo:3200/api/traces/<trace_id>, Tempo is unhealthy, not ingesting, or not sharing the Compose network. - If the smoke command passes but Grafana Explore fails, check the Tempo data source URL. It should point at
http://tempo:3200, not the OTLP ingest ports. - For a temporary live debugging run, set
OTEL_TRACES_SAMPLER_ARG=1so every root trace is sampled, then lower it again after diagnosis.
Readiness fails
- DB
- Check DATABASE_URL or DATABASE_PATH, volume permissions, Postgres reachability, and migrations.
- Migrations
- Read startup logs for migration errors before recreating volumes.
- Dependencies
- If Qdrant or Postgres profiles are enabled, confirm those services are healthy first.