Back to Blog
O

OpenClaw Production Monitoring: Health Check Endpoints & Best Practices

Advanced Guides

OpenClaw Production Monitoring: Health Check Endpoints & Best Practices

OpenClaw Expert Team
10 min read

Why Health Checks Matter in Production

In development, you know your OpenClaw gateway is running because you're staring at the logs. In production, on a VPS somewhere in the cloud, you need automated ways to answer two questions:

  1. Is the gateway alive? — Is the process running and responding to requests?
  2. Is the gateway ready? — Can it actually handle work (AI API connections intact, channels connected, no startup lag)?

Before OpenClaw 2026.3.3, answering these required custom scripts or third-party monitoring tools. Now, health checking is built-in.

The Four Health Endpoints

OpenClaw 2026.3.3 exposes four HTTP endpoints for health monitoring:

Endpoint Status Code (Healthy) Status Code (Unhealthy) Purpose
GET /health 200 503 Basic liveness: Gateway process is running
GET /healthz 200 503 Kubernetes-style liveness (alias for /health)
GET /ready 200 503 Readiness: Gateway can handle work (AI configured, channels connected)
GET /readyz 200 503 Kubernetes-style readiness (alias for /ready)

Liveness vs. Readiness:

  • Liveness (/health, /healthz): "Is the process alive?" If this fails, restart the container/process.
  • Readiness (/ready, /readyz): "Can it handle work right now?" If this fails, don't send traffic yet (but don't restart).

Endpoint Behavior in Detail

/health and /healthz (Liveness)

Returns 200 OK if the gateway process is running, regardless of:

  • Whether AI providers are configured
  • Whether channels are connected
  • Whether the gateway is under heavy load

Returns 503 Service Unavailable only if:

  • The gateway process is crashed or not responding
  • HTTP server is not listening on the expected port

Use case: Basic process monitoring. If this fails, something is fundamentally wrong — restart the gateway.

/ready and /readyz (Readiness)

Returns 200 OK only if:

  • Gateway process is running (same as /health)
  • AND at least one AI provider is configured and accessible
  • AND all enabled channels are connected (or have a valid configuration pending connection)

Returns 503 Service Unavailable if:

  • No AI provider is configured
  • A configured AI provider API key is invalid or the API is unreachable
  • A channel is configured but failed to connect (e.g., Slack WebSocket connection failed)

The response body includes details about what's not ready:

HTTP 503 Service Unavailable
Content-Type: application/json

{
  "status": "not_ready",
  "issues": [
    "No AI provider configured",
    "Slack connection failed: invalid_bot_token"
  ]
}

Use case: Traffic routing. Don't send webhooks or API requests to the gateway until /ready returns 200.

Docker HEALTHCHECK Configuration

The most common use case is Docker health checks. In your Dockerfile:

FROM openclaw/gateway:latest

HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1

Breakdown:

  • --interval=30s — Check health every 30 seconds
  • --timeout=3s — Fail if the check takes longer than 3 seconds
  • --start-period=10s — Give the gateway 10 seconds to start before counting failures
  • --retries=3 — Only mark unhealthy after 3 consecutive failures

In docker-compose.yml:

version: '3.8'
services:
  openclaw:
    image: openclaw/gateway:latest
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 3s
      start_period: 10s
      retries: 3

Docker will automatically restart the container if health checks fail (if you have restart: always or restart: on-failure set).

Kubernetes Probes

In Kubernetes, use the livenessProbe and readinessProbe in your deployment manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: openclaw-gateway
spec:
  replicas: 3
  selector:
    matchLabels:
      app: openclaw
  template:
    metadata:
      labels:
        app: openclaw
    spec:
      containers:
      - name: gateway
        image: openclaw/gateway:latest
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 30
          timeoutSeconds: 3
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /readyz
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
          timeoutSeconds: 2
          failureThreshold: 2

Liveness probe: If the gateway crashes or hangs, Kubernetes restarts the pod.

Readiness probe: If the gateway is not ready (AI not configured, channel connection failed), Kubernetes stops sending traffic to it until it recovers.

Prometheus Metrics

OpenClaw also exposes metrics in Prometheus format at /metrics (if enabled in config):

openclaw:
  metrics:
    enabled: true
    port: 9090

Metrics exposed include:

  • openclaw_health_status — Liveness status (1 = healthy, 0 = unhealthy)
  • openclaw_ready_status — Readiness status (1 = ready, 0 = not ready)
  • openclaw_uptime_seconds — Gateway uptime in seconds
  • openclaw_request_count — Total requests handled (by channel)
  • openclaw_request_duration_seconds — Request latency histogram
  • openclaw_error_count — Total errors (by type)

Configure Prometheus to scrape these:

scrape_configs:
  - job_name: 'openclaw'
    static_configs:
      - targets: ['openclaw-gateway:9090']
    scrape_interval: 15s

Create a Grafana dashboard to visualize:

  • Gateway uptime and health status
  • Request rate and error rate
  • Latency percentiles (p50, p95, p99)
  • AI provider response times
  • Channel connection status

High Availability Patterns

Active-Active with Load Balancer

Run multiple OpenClaw instances behind a load balancer:

upstream openclaw_backend {
    least_conn;
    server openclaw-1:8080 max_fails=2 fail_timeout=30s;
    server openclaw-2:8080 max_fails=2 fail_timeout=30s;
    server openclaw-3:8080 max_fails=2 fail_timeout=30s;
}

server {
    listen 80;
    location / {
        proxy_pass http://openclaw_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

Configure the load balancer to health check /ready and route traffic only to healthy instances. If one instance fails (e.g., AI provider connection issue), the load balancer automatically routes to the remaining healthy instances.

Active-Passive Failover

Run a primary instance with a hot standby:

# Primary (active)
instance:
  id: openclaw-primary

# Standby (passive)
instance:
  id: openclaw-standby

Use an orchestrator (Kubernetes, Nomad) or a custom watchdog to monitor the primary's /health endpoint. If it fails, promote the standby to active.

This is simpler than active-active but has slower failover (seconds to minutes vs. milliseconds).

Alerting Rules

Set up alerts based on health check failures:

Prometheus Alertmanager

groups:
- name: openclaw
  rules:
  - alert: OpenClawGatewayDown
    expr: openclaw_health_status == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "OpenClaw gateway {{ $labels.instance }} is down"
      description: "Health check has been failing for 1 minute"

  - alert: OpenClawNotReady
    expr: openclaw_ready_status == 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "OpenClaw gateway {{ $labels.instance }} is not ready"
      description: "Readiness check has been failing for 5 minutes"

PagerDuty / Opsgenie Integration

Route critical alerts to on-call engineers:

  • GatewayDown: Page immediately — something is broken and needs human attention
  • GatewayNotReady: Create a ticket — investigate when available, but not an emergency

Monitoring Stack Options

Lightweight: Uptime Robot / Pingdom

For simple setups, use an external monitoring service to ping /health every minute:

  • Free tier covers 1-5 monitors
  • Email/slack alerts on downtime
  • No infrastructure to manage

Limitation: Can't check /ready status or capture detailed metrics.

Mid-Tier: Prometheus + Grafana

Self-hosted metrics and alerting:

  • Prometheus scrapes /metrics
  • Grafana dashboards visualize health, latency, errors
  • Alertmanager routes alerts to email, Slack, PagerDuty

Setup complexity: Medium. Requires running Prometheus and Grafana (or using managed offerings).

Enterprise: Datadog / New Relic

Managed observability platforms:

  • Auto-discovery of OpenClaw containers
  • Built-in dashboards for API gateways
  • Intelligent alerting on anomaly detection
  • Distributed tracing for debugging latency issues

Cost: $15-50/host/month. Worth it for teams that want minimal ops overhead.

Troubleshooting Health Checks

"/ready returns 503 but /health returns 200"

Symptom: Gateway is alive but not ready.

Diagnosis: Check the response body from /ready for details:

curl http://localhost:8080/ready

Common causes:

  • No AI provider: Run openclaw onboard to configure Anthropic, OpenAI, or Google.
  • Invalid API key: Check openclaw health for details. Verify keys at your provider's console.
  • Channel connection failed: Check logs for channel-specific errors (e.g., Slack WebSocket error, Telegram webhook not received).

"Health check passes but gateway doesn't respond"

Symptom: /health returns 200, but actual requests to the gateway timeout or error.

Diagnosis: This is usually a network issue, not a health check issue:

  • Firewall blocking traffic on the gateway port
  • Load balancer routing to the wrong port
  • DNS misconfiguration (health check bypassing DNS, requests hitting DNS)

Fix: Verify network connectivity separately from health checks (e.g., curl http://gateway:8080/).

"Flapping health checks (up and down repeatedly)"

Symptom: Health checks alternate between 200 and 503 every few seconds.

Causes:

  • Resource exhaustion: Gateway is CPU/memory constrained and intermittently unresponsive. Check top or htop.
  • AI provider rate limiting: Gateway is being throttled by the AI API and temporarily fails. Check logs for 429 errors.
  • Network instability: Intermittent connectivity to AI providers or channels. Check ping and traceroute.

Fix: Address the root cause (scale up resources, implement rate limiting, fix network). In the meantime, increase the health check failure threshold (e.g., --retries=5 in Docker).

Production Checklist

  • ✅ Configure /health endpoint for basic liveness monitoring
  • ✅ Configure /ready endpoint for traffic routing decisions
  • ✅ Set up Docker HEALTHCHECK or Kubernetes probes
  • ✅ Configure Prometheus metrics scraping (optional but recommended)
  • ✅ Set up Grafana dashboards for health, latency, errors
  • ✅ Configure alerts for gateway down (critical) and gateway not ready (warning)
  • ✅ Test failover: kill a gateway instance and verify automatic recovery
  • ✅ Document runbooks for common health check failures

Going Beyond: Advanced Monitoring

Health checks are the foundation. For mature production deployments, add:

  • Synthetic transactions: Periodic test messages through the gateway to verify end-to-end functionality (not just that the HTTP server responds)
  • Channel-specific health: Custom checks for Slack webhook delivery, Telegram bot responsiveness, etc.
  • Business metrics: Track AI API spend per channel, user engagement, resolution rate
  • Log aggregation: Centralize logs (ELK, Loki, Datadog) for debugging issues across instances

Need production-grade monitoring for your OpenClaw deployment? We set up Prometheus, Grafana, alerting, and high availability with active-active failover. Includes runbook creation and team training.

Book a free consultation or explore our Enterprise package.

openclaw productionopenclaw monitoringopenclaw health checkopenclaw high availabilityopenclaw docker

Need Help with OpenClaw?

Our experts handle the entire setup — installation, configuration, integrations, and ongoing support. Get your AI assistant running in 24 hours.