OpenClaw Production Monitoring: Health Check Endpoints & Best Practices
OpenClaw Production Monitoring: Health Check Endpoints & Best Practices
Why Health Checks Matter in Production
In development, you know your OpenClaw gateway is running because you're staring at the logs. In production, on a VPS somewhere in the cloud, you need automated ways to answer two questions:
- Is the gateway alive? — Is the process running and responding to requests?
- Is the gateway ready? — Can it actually handle work (AI API connections intact, channels connected, no startup lag)?
Before OpenClaw 2026.3.3, answering these required custom scripts or third-party monitoring tools. Now, health checking is built-in.
The Four Health Endpoints
OpenClaw 2026.3.3 exposes four HTTP endpoints for health monitoring:
| Endpoint | Status Code (Healthy) | Status Code (Unhealthy) | Purpose |
|---|---|---|---|
GET /health |
200 | 503 | Basic liveness: Gateway process is running |
GET /healthz |
200 | 503 | Kubernetes-style liveness (alias for /health) |
GET /ready |
200 | 503 | Readiness: Gateway can handle work (AI configured, channels connected) |
GET /readyz |
200 | 503 | Kubernetes-style readiness (alias for /ready) |
Liveness vs. Readiness:
- Liveness (/health, /healthz): "Is the process alive?" If this fails, restart the container/process.
- Readiness (/ready, /readyz): "Can it handle work right now?" If this fails, don't send traffic yet (but don't restart).
Endpoint Behavior in Detail
/health and /healthz (Liveness)
Returns 200 OK if the gateway process is running, regardless of:
- Whether AI providers are configured
- Whether channels are connected
- Whether the gateway is under heavy load
Returns 503 Service Unavailable only if:
- The gateway process is crashed or not responding
- HTTP server is not listening on the expected port
Use case: Basic process monitoring. If this fails, something is fundamentally wrong — restart the gateway.
/ready and /readyz (Readiness)
Returns 200 OK only if:
- Gateway process is running (same as /health)
- AND at least one AI provider is configured and accessible
- AND all enabled channels are connected (or have a valid configuration pending connection)
Returns 503 Service Unavailable if:
- No AI provider is configured
- A configured AI provider API key is invalid or the API is unreachable
- A channel is configured but failed to connect (e.g., Slack WebSocket connection failed)
The response body includes details about what's not ready:
HTTP 503 Service Unavailable
Content-Type: application/json
{
"status": "not_ready",
"issues": [
"No AI provider configured",
"Slack connection failed: invalid_bot_token"
]
}
Use case: Traffic routing. Don't send webhooks or API requests to the gateway until /ready returns 200.
Docker HEALTHCHECK Configuration
The most common use case is Docker health checks. In your Dockerfile:
FROM openclaw/gateway:latest
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
Breakdown:
--interval=30s— Check health every 30 seconds--timeout=3s— Fail if the check takes longer than 3 seconds--start-period=10s— Give the gateway 10 seconds to start before counting failures--retries=3— Only mark unhealthy after 3 consecutive failures
In docker-compose.yml:
version: '3.8'
services:
openclaw:
image: openclaw/gateway:latest
ports:
- "8080:8080"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 3s
start_period: 10s
retries: 3
Docker will automatically restart the container if health checks fail (if you have restart: always or restart: on-failure set).
Kubernetes Probes
In Kubernetes, use the livenessProbe and readinessProbe in your deployment manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: openclaw-gateway
spec:
replicas: 3
selector:
matchLabels:
app: openclaw
template:
metadata:
labels:
app: openclaw
spec:
containers:
- name: gateway
image: openclaw/gateway:latest
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
timeoutSeconds: 3
failureThreshold: 3
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 2
Liveness probe: If the gateway crashes or hangs, Kubernetes restarts the pod.
Readiness probe: If the gateway is not ready (AI not configured, channel connection failed), Kubernetes stops sending traffic to it until it recovers.
Prometheus Metrics
OpenClaw also exposes metrics in Prometheus format at /metrics (if enabled in config):
openclaw:
metrics:
enabled: true
port: 9090
Metrics exposed include:
openclaw_health_status— Liveness status (1 = healthy, 0 = unhealthy)openclaw_ready_status— Readiness status (1 = ready, 0 = not ready)openclaw_uptime_seconds— Gateway uptime in secondsopenclaw_request_count— Total requests handled (by channel)openclaw_request_duration_seconds— Request latency histogramopenclaw_error_count— Total errors (by type)
Configure Prometheus to scrape these:
scrape_configs:
- job_name: 'openclaw'
static_configs:
- targets: ['openclaw-gateway:9090']
scrape_interval: 15s
Create a Grafana dashboard to visualize:
- Gateway uptime and health status
- Request rate and error rate
- Latency percentiles (p50, p95, p99)
- AI provider response times
- Channel connection status
High Availability Patterns
Active-Active with Load Balancer
Run multiple OpenClaw instances behind a load balancer:
upstream openclaw_backend {
least_conn;
server openclaw-1:8080 max_fails=2 fail_timeout=30s;
server openclaw-2:8080 max_fails=2 fail_timeout=30s;
server openclaw-3:8080 max_fails=2 fail_timeout=30s;
}
server {
listen 80;
location / {
proxy_pass http://openclaw_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
Configure the load balancer to health check /ready and route traffic only to healthy instances. If one instance fails (e.g., AI provider connection issue), the load balancer automatically routes to the remaining healthy instances.
Active-Passive Failover
Run a primary instance with a hot standby:
# Primary (active)
instance:
id: openclaw-primary
# Standby (passive)
instance:
id: openclaw-standby
Use an orchestrator (Kubernetes, Nomad) or a custom watchdog to monitor the primary's /health endpoint. If it fails, promote the standby to active.
This is simpler than active-active but has slower failover (seconds to minutes vs. milliseconds).
Alerting Rules
Set up alerts based on health check failures:
Prometheus Alertmanager
groups:
- name: openclaw
rules:
- alert: OpenClawGatewayDown
expr: openclaw_health_status == 0
for: 1m
labels:
severity: critical
annotations:
summary: "OpenClaw gateway {{ $labels.instance }} is down"
description: "Health check has been failing for 1 minute"
- alert: OpenClawNotReady
expr: openclaw_ready_status == 0
for: 5m
labels:
severity: warning
annotations:
summary: "OpenClaw gateway {{ $labels.instance }} is not ready"
description: "Readiness check has been failing for 5 minutes"
PagerDuty / Opsgenie Integration
Route critical alerts to on-call engineers:
- GatewayDown: Page immediately — something is broken and needs human attention
- GatewayNotReady: Create a ticket — investigate when available, but not an emergency
Monitoring Stack Options
Lightweight: Uptime Robot / Pingdom
For simple setups, use an external monitoring service to ping /health every minute:
- Free tier covers 1-5 monitors
- Email/slack alerts on downtime
- No infrastructure to manage
Limitation: Can't check /ready status or capture detailed metrics.
Mid-Tier: Prometheus + Grafana
Self-hosted metrics and alerting:
- Prometheus scrapes
/metrics - Grafana dashboards visualize health, latency, errors
- Alertmanager routes alerts to email, Slack, PagerDuty
Setup complexity: Medium. Requires running Prometheus and Grafana (or using managed offerings).
Enterprise: Datadog / New Relic
Managed observability platforms:
- Auto-discovery of OpenClaw containers
- Built-in dashboards for API gateways
- Intelligent alerting on anomaly detection
- Distributed tracing for debugging latency issues
Cost: $15-50/host/month. Worth it for teams that want minimal ops overhead.
Troubleshooting Health Checks
"/ready returns 503 but /health returns 200"
Symptom: Gateway is alive but not ready.
Diagnosis: Check the response body from /ready for details:
curl http://localhost:8080/ready
Common causes:
- No AI provider: Run
openclaw onboardto configure Anthropic, OpenAI, or Google. - Invalid API key: Check
openclaw healthfor details. Verify keys at your provider's console. - Channel connection failed: Check logs for channel-specific errors (e.g., Slack WebSocket error, Telegram webhook not received).
"Health check passes but gateway doesn't respond"
Symptom: /health returns 200, but actual requests to the gateway timeout or error.
Diagnosis: This is usually a network issue, not a health check issue:
- Firewall blocking traffic on the gateway port
- Load balancer routing to the wrong port
- DNS misconfiguration (health check bypassing DNS, requests hitting DNS)
Fix: Verify network connectivity separately from health checks (e.g., curl http://gateway:8080/).
"Flapping health checks (up and down repeatedly)"
Symptom: Health checks alternate between 200 and 503 every few seconds.
Causes:
- Resource exhaustion: Gateway is CPU/memory constrained and intermittently unresponsive. Check
toporhtop. - AI provider rate limiting: Gateway is being throttled by the AI API and temporarily fails. Check logs for 429 errors.
- Network instability: Intermittent connectivity to AI providers or channels. Check
pingandtraceroute.
Fix: Address the root cause (scale up resources, implement rate limiting, fix network). In the meantime, increase the health check failure threshold (e.g., --retries=5 in Docker).
Production Checklist
- ✅ Configure
/healthendpoint for basic liveness monitoring - ✅ Configure
/readyendpoint for traffic routing decisions - ✅ Set up Docker HEALTHCHECK or Kubernetes probes
- ✅ Configure Prometheus metrics scraping (optional but recommended)
- ✅ Set up Grafana dashboards for health, latency, errors
- ✅ Configure alerts for gateway down (critical) and gateway not ready (warning)
- ✅ Test failover: kill a gateway instance and verify automatic recovery
- ✅ Document runbooks for common health check failures
Going Beyond: Advanced Monitoring
Health checks are the foundation. For mature production deployments, add:
- Synthetic transactions: Periodic test messages through the gateway to verify end-to-end functionality (not just that the HTTP server responds)
- Channel-specific health: Custom checks for Slack webhook delivery, Telegram bot responsiveness, etc.
- Business metrics: Track AI API spend per channel, user engagement, resolution rate
- Log aggregation: Centralize logs (ELK, Loki, Datadog) for debugging issues across instances
Need production-grade monitoring for your OpenClaw deployment? We set up Prometheus, Grafana, alerting, and high availability with active-active failover. Includes runbook creation and team training.
Need Help with OpenClaw?
Our experts handle the entire setup — installation, configuration, integrations, and ongoing support. Get your AI assistant running in 24 hours.
Related Articles
OpenClaw PDF Analysis Tool: Native Document Processing at Scale
OpenClaw PDF Analysis Tool: Native Document Processing at Scale
9 min read
OpenClaw Secrets Management: Secure Credential Configuration Guide
OpenClaw Secrets Management: Secure Credential Configuration Guide
11 min read
OpenClaw Security Hardening Guide 2026
OpenClaw Security Hardening Guide 2026
11 min read